[python] 값이 목록에 있는지 확인하는 가장 빠른 방법

값이 목록 (수백만 개의 값이 포함 된 목록)에 존재하고 그 색인이 무엇인지 아는 가장 빠른 방법은 무엇입니까?

이 예제와 같이 목록의 모든 값이 고유하다는 것을 알고 있습니다.

내가 시도하는 첫 번째 방법은 (실제 코드에서 3.8 초)입니다.

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

두 번째 방법은 (2 배 빠름 : 실제 코드의 경우 1.9 초)입니다.

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

Stack Overflow 사용자가 제안한 방법 (실제 코드의 경우 2.74 초) :

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

내 실제 코드에서 첫 번째 방법은 3.81 초가 걸리고 두 번째 방법은 1.88 초가 걸립니다. 좋은 개선이지만 다음과 같습니다.

저는 Python / 스크립트를 사용하는 초보자이며 동일한 작업을 수행하고 처리 시간을 단축하는 더 빠른 방법이 있습니까?

내 응용 프로그램에 대한 더 구체적인 설명 :

Blender API에서 파티클 목록에 액세스 할 수 있습니다.

particles = [1, 2, 3, 4, etc.]

거기서 파티클의 위치에 접근 할 수 있습니다 :

particles[x].location = [x,y,z]

그리고 각 입자에 대해 다음과 같이 각 입자 위치를 검색하여 이웃이 존재하는지 테스트합니다.

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])

답변

7 in a

가장 명확하고 빠른 방법입니다.

을 사용하는 것도 고려할 수 set있지만 목록에서 해당 세트를 구성하면 멤버쉽 테스트 시간이 단축되는 것보다 시간이 더 걸릴 수 있습니다. 확신 할 수있는 유일한 방법은 벤치 마크를 잘하는 것입니다. (또한 필요한 작업에 따라 다릅니다)

답변

다른 사람들이 말했듯이 in큰 목록의 경우 속도가 매우 느릴 수 있습니다. in, set및 의 성능을 몇 가지 비교합니다 bisect. 시간 (초)은 로그 스케일입니다.

테스트 코드 :

import random
import bisect
import matplotlib.pyplot as plt
import math
import time

def method_in(a,b,c):
    start_time = time.time()
    for i,x in enumerate(a):
        if x in b:
            c[i] = 1
    return(time.time()-start_time)

def method_set_in(a,b,c):
    start_time = time.time()
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = 1
    return(time.time()-start_time)

def method_bisect(a,b,c):
    start_time = time.time()
    b.sort()
    for i,x in enumerate(a):
        index = bisect.bisect_left(b,x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return(time.time()-start_time)

def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    Nls = [x for x in range(1000,20000,1000)]
    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        time_method_in.append(math.log(method_in(a,b,c)))
        time_method_set_in.append(math.log(method_set_in(a,b,c)))
        time_method_bisect.append(math.log(method_bisect(a,b,c)))

    plt.plot(Nls,time_method_in,marker='o',color='r',linestyle='-',label='in')
    plt.plot(Nls,time_method_set_in,marker='o',color='b',linestyle='-',label='set')
    plt.plot(Nls,time_method_bisect,marker='o',color='g',linestyle='-',label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

답변

에 상품을 넣을 수 있습니다 set. 집합 조회는 매우 효율적입니다.

시험:

s = set(a)
if 7 in s:
  # do stuff

편집 의견에서 요소의 색인을 얻고 싶다고 말합니다. 불행히도, 세트에는 요소 위치에 대한 개념이 없습니다. 다른 방법은 목록을 미리 정렬 한 다음 요소를 찾을 때마다 이진 검색 을 사용 하는 것입니다.

답변

def check_availability(element, collection: iter):
    return element in collection

용법

check_availability('a', [1,2,3,4,'a','b','c'])

선택한 값이 배열에 있는지 알 수있는 가장 빠른 방법이라고 생각합니다.

답변

a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

이것은 변경되지 않는 경우에만 좋은 아이디어 일 것이므로 dict () 부분을 한 번 수행 한 다음 반복적으로 사용할 수 있습니다. 변경 사항이있는 경우 수행중인 작업에 대한 자세한 정보를 제공하십시오.

답변

원래 질문은 다음과 같습니다.

값이 목록 (수백만 개의 값이 포함 된 목록)에 존재하고 그 색인이 무엇인지 아는 가장 빠른 방법은 무엇입니까?

따라서 찾아야 할 두 가지가 있습니다.

목록의 항목이며
색인은 무엇입니까 (목록에있는 경우).

이를 위해 모든 경우에 인덱스를 계산하기 위해 @xslittlegrass 코드를 수정하고 추가 방법을 추가했습니다.

결과

방법은 다음과 같습니다.

in– 기본적으로 x in b 인 경우 : b.index (x)를 반환
b.index (x)에서 try-try / catch (x가 b인지 확인해야 함)
set– 기본적으로 set (b)의 x 인 경우 : b.index (x)를 반환
bisect-인덱스가있는 b를 정렬하고 sorted (b)에서 x를 이진 검색합니다. @xslittlegrass의 mod는 원래 b가 아닌 정렬 된 b로 인덱스를 반환합니다.
reverse–b에 대한 역방향 조회 사전 d를 형성합니다. d [x]는 x의 인덱스를 제공합니다.

결과는 방법 5가 가장 빠르다는 것을 보여줍니다.

흥미롭게도 try 및 set 메소드는 시간이 동일합니다.

테스트 코드

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()

답변

블룸 필터 데이터 구조를 사용하면 응용 프로그램에서 이점을 얻을 수있을 것 같습니다.

간단히 말해서, 블룸 필터 조회는 값이 세트에 확실하게 존재하지 않으면 매우 빨리 알 수 있습니다. 그렇지 않으면 목록에서 POSSIBLY가 될 수있는 값의 인덱스를 얻기 위해 느린 조회를 수행 할 수 있습니다. 따라서 응용 프로그램에서 “찾을 수 없음”결과가 “찾은”결과보다 훨씬 자주 발생하는 경우 블룸 필터를 추가하면 속도가 향상 될 수 있습니다.

자세한 내용은 Wikipedia에서 Bloom Filters의 작동 방식에 대한 훌륭한 개요를 제공하며 “python bloom filter library”에 대한 웹 검색을 통해 유용한 구현을 제공 할 수 있습니다.