[python] Python 해시 가능 사전

Question 1

연습으로, 주로 내 자신의 즐거움을 위해 역 추적 packrat 파서를 구현하고 있습니다. 이에 대한 영감은 알골과 같은 언어에서 위생적인 매크로가 어떻게 작동하는지에 대해 더 잘 알고 싶습니다 (일반적으로 사용하는 구문이없는 lisp 방언에 따라). 이 때문에 입력을 통과하는 다른 과정에서 다른 문법을 볼 수 있으므로 캐시 된 구문 분석 결과와 함께 현재 버전의 문법을 저장하지 않는 한 캐시 된 구문 분석 결과는 유효하지 않습니다. ( 편집 :이 키-값 컬렉션 사용의 결과는 변경 불가능해야하지만 변경 될 수 있도록 인터페이스를 노출 할 의도가 없으므로 변경 가능하거나 변경 불가능한 컬렉션 모두 괜찮습니다)

문제는 파이썬 사전이 다른 사전에 키로 나타날 수 없다는 것입니다. 어쨌든 튜플을 사용하는 것조차 도움이되지 않습니다.

>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>

나는 그것이 끝까지 튜플이어야한다고 생각합니다. 이제 파이썬 표준 라이브러리는 내가 필요로하는 대략적인 것을 제공 collections.namedtuple하고 매우 다른 구문을 가지고 있지만 키로 사용할 수 있습니다. 위 세션에서 계속 :

>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}

확인. 하지만 사용하려는 규칙에서 가능한 각 키 조합에 대해 클래스를 만들어야합니다. 이는 각 구문 분석 규칙이 사용하는 매개 변수를 정확히 알고 있기 때문에 클래스를 동시에 정의 할 수 있기 때문입니다. 규칙을 구문 분석하는 함수로.

편집 : namedtuples 의 추가 문제 는 엄격하게 위치가 있다는 것입니다. 서로 달라야하는 것처럼 보이는 두 개의 튜플은 실제로 동일 할 수 있습니다.

>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False

tl’dr : dict다른 dicts의 키로 사용할 수 있는를 어떻게 얻 습니까?

답변을 조금 해킹 한 후 사용중인보다 완벽한 솔루션이 있습니다. 이것은 결과 딕셔너리를 실제 목적을 위해 모호하게 불변으로 만들기 위해 약간의 추가 작업을 수행합니다. 물론 전화로 해킹하는 것은 여전히 쉽지만 dict.__setitem__(instance, key, value)우리는 모두 성인입니다.

class hashdict(dict):
    """
    hashable dict implementation, suitable for use as a key into
    other dicts.

        >>> h1 = hashdict({"apples": 1, "bananas":2})
        >>> h2 = hashdict({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        hashdict(apples=1, bananas=3, mangoes=5)
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: hashdict(bananas=3, mangoes=5)

    based on answers from
       http://stackoverflow.com/questions/1151658/python-hashable-dicts

    """
    def __key(self):
        return tuple(sorted(self.items()))
    def __repr__(self):
        return "{0}({1})".format(self.__class__.__name__,
            ", ".join("{0}={1}".format(
                    str(i[0]),repr(i[1])) for i in self.__key()))

    def __hash__(self):
        return hash(self.__key())
    def __setitem__(self, key, value):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def __delitem__(self, key):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def clear(self):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def pop(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def popitem(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def setdefault(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def update(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    # update is not ok because it mutates the object
    # __add__ is ok because it creates a new object
    # while the new object is under construction, it's ok to mutate it
    def __add__(self, right):
        result = hashdict(self)
        dict.update(result, right)
        return result

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Question 2

다음은 해시 가능한 사전을 만드는 쉬운 방법입니다. 명백한 이유로 다른 사전에 임베딩 한 후에는 변경하지 마십시오.

class hashabledict(dict):
    def __hash__(self):
        return hash(tuple(sorted(self.items())))

Question 3

Hashables는 불변이어야합니다. 이것을 강제하는 것이 아니라 dict를 키로 처음 사용한 후에 변경하지 않도록 믿으면 다음과 같은 접근 방식이 작동합니다.

class hashabledict(dict):
  def __key(self):
    return tuple((k,self[k]) for k in sorted(self))
  def __hash__(self):
    return hash(self.__key())
  def __eq__(self, other):
    return self.__key() == other.__key()

딕셔너리를 변경할 필요가 있고 키로 사용하고 싶다면 복잡성이 수백 배로 폭발합니다. 할 수 없다고 말할 수는 없지만 그 놀라운 모 라스에 들어가기 전에 매우 구체적인 표시가 나올 때까지 기다릴 것입니다! -)

Question 4

목적에 맞게 사전을 사용하기 위해 필요한 것은 __hash__ 메소드를 추가하는 것입니다.

class Hashabledict(dict):
    def __hash__(self):
        return hash(frozenset(self))

참고 frozenset의 변환은 모든 사전 (정렬로 키를 필요로하지 않습니다 그것을 즉)에 대한 작동합니다. 마찬가지로 사전 값에는 제한이 없습니다.

키가 같지만 값이 다른 사전이 많은 경우 해시에서 값을 고려하도록해야합니다. 이를 수행하는 가장 빠른 방법은 다음과 같습니다.

class Hashabledict(dict):
    def __hash__(self):
        return hash((frozenset(self), frozenset(self.itervalues())))

이것은 frozenset(self.iteritems())두 가지 이유 보다 빠릅니다 . 먼저, frozenset(self)단계는 불필요한 통화를 저장, 사전에 저장된 해시 값을 다시 사용합니다 hash(key). 둘째, itervalue 를 사용하면 값에 직접 액세스하고 조회를 수행 할 때마다 메모리에 새로운 많은 키 / 값 튜플을 형성하기 위해 항목 별로 사용하는 많은 메모리 할당 자 호출을 피할 수 있습니다.

Question 5

주어진 답변은 괜찮지 만 해시를 생성하는 frozenset(...)대신 사용하여 개선 할 수 있습니다 tuple(sorted(...)).

>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023

성능 이점은 사전의 내용에 따라 다르지만 테스트 한 대부분의 경우 해싱 frozenset이 최소 2 배 더 빠릅니다 (주로 정렬 할 필요가 없기 때문입니다).

Question 6

합리적으로 깨끗하고 간단한 구현은 다음과 같습니다.

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        return hash(tuple(sorted(self._d.iteritems())))

Question 7

이 주제로 계속 돌아옵니다 … 여기에 또 다른 변형이 있습니다. 메서드 dict를 추가하는 서브 클래 싱 이 불편합니다 __hash__. dict가 변경 될 수 있다는 문제에서 벗어날 수있는 방법은 거의 없으며, 변경되지 않을 것이라고 신뢰하는 것은 약한 생각처럼 보입니다. 그래서 저는 그 자체로 불변 인 내장형을 기반으로 매핑을 구축하는 방법을 살펴 보았습니다. tuple분명한 선택 이지만 , 그 안에있는 값에 접근하는 것은 정렬과 이분법을 의미합니다. 문제는 아니지만 그것이 구축 된 유형의 힘을 많이 활용하지 않는 것 같습니다.

잼 키, 값 쌍을 frozenset? 그게 무엇을 요구하고 어떻게 작동할까요?

Part 1에서는 frozenset이 주로 키로 처리하는 방식으로 ‘항목’을 인코딩하는 방법이 필요합니다. 나는 그것을 위해 작은 하위 클래스를 만들 것입니다.

import collections
class pair(collections.namedtuple('pair_base', 'key value')):
    def __hash__(self):
        return hash((self.key, None))
    def __eq__(self, other):
        if type(self) != type(other):
            return NotImplemented
        return self.key == other.key
    def __repr__(self):
        return repr((self.key, self.value))

그것만으로도 불변의 매핑의 거리에 있습니다.

>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])

오! 불행히도 집합 연산자를 사용할 때 요소는 동일하지만 동일한 개체는 아닙니다. 반환 값에서 끝나는 것은 undefined입니다 . 우리는 더 많은 회전을 거쳐야 할 것입니다.

>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'

그러나 이러한 방식으로 값을 찾는 것은 번거롭고 더 나쁜 것은 많은 중간 세트를 생성합니다. 그것은하지 않을 것입니다! 이를 해결하기 위해 ‘가짜’키-값 쌍을 생성합니다.

class Thief(object):
    def __init__(self, key):
        self.key = key
    def __hash__(self):
        return hash(pair(self.key, None))
    def __eq__(self, other):
        self.value = other.value
        return pair(self.key, None) == other

그 결과 덜 문제가됩니다.

>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'

그게 다 깊은 마법입니다. 나머지는 딕셔너리와 같은 인터페이스 를 가진 것으로 모든 것을 감싸는 것입니다 . frozenset인터페이스가 매우 다른 에서 서브 클래 싱하기 때문에 메서드가 상당히 많습니다. 에서 약간의 도움을 collections.Mapping받지만 대부분의 작업은 frozensetdicts처럼 작동하는 버전 의 메서드를 대신 재정의하는 것입니다.

class FrozenDict(frozenset, collections.Mapping):
    def __new__(cls, seq=()):
        return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
    def __getitem__(self, key):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        raise KeyError(key)
    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            return dict(self.iteritems()) == other
        if len(self) != len(other):
            return False
        for key, value in self.iteritems():
            try:
                if value != other[key]:
                    return False
            except KeyError:
                return False
        return True
    def __hash__(self):
        return hash(frozenset(self.iteritems()))
    def get(self, key, default=None):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        return default
    def __iter__(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def iteritems(self):
        for item in frozenset.__iter__(self):
            yield (item.key, item.value)
    def iterkeys(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def itervalues(self):
        for item in frozenset.__iter__(self):
            yield item.value
    def __contains__(self, key):
        return frozenset.__contains__(self, pair(key, None))
    has_key = __contains__
    def __repr__(self):
        return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
    @classmethod
    def fromkeys(cls, keys, value=None):
        return cls((key, value) for key in keys)

궁극적으로 내 질문에 대답합니다.

>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5

Question 8

@Unknown의 답변과 @AlexMartelli의 답변은 완벽하게 작동하지만 다음 제약 조건에서만 작동합니다.

사전의 값은 해시 가능해야합니다. 예를 들어 hash(hashabledict({'a':[1,2]}))올릴 것이다 TypeError.
키는 비교 연산을 지원해야합니다. 예를 들어 hash(hashabledict({'a':'a', 1:1}))올릴 것이다 TypeError.
키에 대한 비교 연산자는 전체 순서를 부과합니다. 예를 들어 사전에있는 두 키가 frozenset((1,2,3))및 frozenset((4,5,6))인 경우 두 방향에서 같지 않은 값을 비교합니다. 따라서 이러한 키를 사용하여 사전 항목을 정렬하면 임의의 순서가 발생할 수 있으므로 동일한 객체가 동일한 해시 값을 가져야한다는 규칙을 위반하게됩니다.

@ObenSonne의 훨씬 더 빠른 답변은 제약 조건 2와 3을 해제하지만 여전히 제약 조건 1에 의해 구속됩니다 (값은 해시 가능해야 함).

@RaymondHettinger의 더 빠르면서도 대답 .values()은 해시 계산에 포함되지 않기 때문에 세 가지 제약 조건을 모두 해제합니다 . 그러나 다음과 같은 경우에만 성능이 좋습니다.

해시해야하는 대부분의 (같지 않은) 사전은 동일하지 않습니다 .keys().

이 조건이 충족되지 않으면 해시 함수는 여전히 유효하지만 너무 많은 충돌이 발생할 수 있습니다. 예를 들어 모든 사전이 웹 사이트 템플릿에서 생성되는 극단적 인 경우 (필드 이름은 키로, 사용자 입력은 값으로) 키는 항상 동일하며 해시 함수는 모든 입력에 대해 동일한 값을 반환합니다. . 결과적으로 이러한 해시 함수에 의존하는 해시 테이블은 항목을 검색 할 때 목록만큼 느려집니다 ( O(N)대신 O(1)).

위에 나열된 4 개의 제약 조건을 모두 위반하더라도 다음 솔루션이 합리적으로 잘 작동 할 것이라고 생각합니다. 딕셔너리뿐만 아니라 모든 컨테이너가 중첩 된 변경 가능한 컨테이너가 있더라도 해시 할 수 있다는 추가적인 이점이 있습니다.

나는 지금까지 이것을 가볍게 테스트했기 때문에 이것에 대한 피드백을 많이 주시면 감사하겠습니다.

# python 3.4
import collections
import operator
import sys
import itertools
import reprlib

# a wrapper to make an object hashable, while preserving equality
class AutoHash:
    # for each known container type, we can optionally provide a tuple
    # specifying: type, transform, aggregator
    # even immutable types need to be included, since their items
    # may make them unhashable

    # transformation may be used to enforce the desired iteration
    # the result of a transformation must be an iterable
    # default: no change; for dictionaries, we use .items() to see values

    # usually transformation choice only affects efficiency, not correctness

    # aggregator is the function that combines all items into one object
    # default: frozenset; for ordered containers, we can use tuple

    # aggregator choice affects both efficiency and correctness
    # e.g., using a tuple aggregator for a set is incorrect,
    # since identical sets may end up with different hash values
    # frozenset is safe since at worst it just causes more collisions
    # unfortunately, no collections.ABC class is available that helps
    # distinguish ordered from unordered containers
    # so we need to just list them out manually as needed

    type_info = collections.namedtuple(
        'type_info',
        'type transformation aggregator')

    ident = lambda x: x
    # order matters; first match is used to handle a datatype
    known_types = (
        # dict also handles defaultdict
        type_info(dict, lambda d: d.items(), frozenset),
        # no need to include set and frozenset, since they are fine with defaults
        type_info(collections.OrderedDict, ident, tuple),
        type_info(list, ident, tuple),
        type_info(tuple, ident, tuple),
        type_info(collections.deque, ident, tuple),
        type_info(collections.Iterable, ident, frozenset) # other iterables
    )

    # hash_func can be set to replace the built-in hash function
    # cache can be turned on; if it is, cycles will be detected,
    # otherwise cycles in a data structure will cause failure
    def __init__(self, data, hash_func=hash, cache=False, verbose=False):
        self._data=data
        self.hash_func=hash_func
        self.verbose=verbose
        self.cache=cache
        # cache objects' hashes for performance and to deal with cycles
        if self.cache:
            self.seen={}

    def hash_ex(self, o):
        # note: isinstance(o, Hashable) won't check inner types
        try:
            if self.verbose:
                print(type(o),
                    reprlib.repr(o),
                    self.hash_func(o),
                    file=sys.stderr)
            return self.hash_func(o)
        except TypeError:
            pass

        # we let built-in hash decide if the hash value is worth caching
        # so we don't cache the built-in hash results
        if self.cache and id(o) in self.seen:
            return self.seen[id(o)][0] # found in cache

        # check if o can be handled by decomposing it into components
        for typ, transformation, aggregator in AutoHash.known_types:
            if isinstance(o, typ):
                # another option is:
                # result = reduce(operator.xor, map(_hash_ex, handler(o)))
                # but collisions are more likely with xor than with frozenset
                # e.g. hash_ex([1,2,3,4])==0 with xor

                try:
                    # try to frozenset the actual components, it's faster
                    h = self.hash_func(aggregator(transformation(o)))
                except TypeError:
                    # components not hashable with built-in;
                    # apply our extended hash function to them
                    h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
                if self.cache:
                    # storing the object too, otherwise memory location will be reused
                    self.seen[id(o)] = (h, o)
                if self.verbose:
                    print(type(o), reprlib.repr(o), h, file=sys.stderr)
                return h

        raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))

    def __hash__(self):
        return self.hash_ex(self._data)

    def __eq__(self, other):
        # short circuit to save time
        if self is other:
            return True

        # 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
        # 2) any other situation => lhs.__eq__ will be called first

        # case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
        # => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
        # case 2. neither side is a subclass of the other; self is lhs
        # => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
        # case 3. neither side is a subclass of the other; self is rhs
        # => we can't compare to another type, and the other side already tried and failed;
        # we should return False, but NotImplemented will have the same effect
        # any other case: we won't reach the __eq__ code in this class, no need to worry about it

        if isinstance(self, type(other)): # identifies case 1
            return self._data == other._data
        else: # identifies cases 2 and 3
            return NotImplemented

d1 = {'a':[1,2], 2:{3:4}}
print(hash(AutoHash(d1, cache=True, verbose=True)))

d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
print(hash(d))