FAQ
The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?

Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

So the problem is to come up with a reasonable hashkey for each of
these objects. The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both. The first attribute is a simple dictionary
whose keys are integers and values are strings. The second attribute
is more complicated. It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings. The keys at every
level of this structure are strings. E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
'b': set(['4', '5'])},
'B': set(['6', '7', '8'])}

I'm looking for a good algorithm for computing a hash key for
something like this? (Basically I'm looking for a good way to
combine hashkeys.)

Thanks!

kj

Search Discussions

  • Duncan Booth at Oct 6, 2010 at 7:13 pm

    kj wrote:

    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    http://svn.python.org/view/python/trunk/Objects/tupleobject.c?revision�029&view=markup

    static long
    tuplehash(PyTupleObject *v)
    {
    register long x, y;
    register Py_ssize_t len = Py_SIZE(v);
    register PyObject **p;
    long mult = 1000003L;
    x = 0x345678L;
    p = v->ob_item;
    while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
    return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (long)(82520L + len + len);
    }
    x += 97531L;
    if (x == -1)
    x = -2;
    return x;
    }


    I'm looking for a good algorithm for computing a hash key for
    something like this? (Basically I'm looking for a good way to
    combine hashkeys.)
    If you want to combine the hashes from several objects then the
    easiest way is just to create a tuple of the objects and hash it.
  • Diez B. Roggisch at Oct 6, 2010 at 7:26 pm

    kj <no.email at please.post> writes:

    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    Surprisingly, in the source:

    http://google.com/codesearch/p?hl=de#-2BKs-LW4I0/trunk/python/src/Objects/tupleobject.c&q=python%20tuplehash&sa=N&cd=1&ct=rc
    Now, for the long version of this question, I'm working with some
    complext Python objects that I want to be able to compare for
    equality easily.

    These objects are non-mutable once they are created, so I would
    like to use a two-step comparison for equality, based on the
    assumption that I can compute (either at creation time, or as needed
    and memoized) a hashkey/digest for each object. The test for
    equality of two of these objects would first compare their hashkeys.
    If they are different, the two objects are declared different; if
    they match, then a more stringent test for equality is performed.

    So the problem is to come up with a reasonable hashkey for each of
    these objects. The objects have two significant attributes, and
    two of these objects should be regarded as equal if these attributes
    are "the same" in both. The first attribute is a simple dictionary
    whose keys are integers and values are strings. The second attribute
    is more complicated. It is a tree structure, represented as a
    dictionary of dictionaries of dictionaries... until we get to the
    leaf elements, which are frozensets of strings. The keys at every
    level of this structure are strings. E.g. a simple example of such
    an attribute would look like:

    {'A': {'a': set(['1', '2', '3']),
    'b': set(['4', '5'])},
    'B': set(['6', '7', '8'])}

    I'm looking for a good algorithm for computing a hash key for
    something like this? (Basically I'm looking for a good way to
    combine hashkeys.)
    Creating tuples from dicts, recursively, and stabilized by using sorted
    on items.

    Diez
  • Kj at Oct 7, 2010 at 10:53 am
    In <m2fwwjazs2.fsf at web.de> deets at web.de (Diez B. Roggisch) writes:
    kj <no.email at please.post> writes:
    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    Surprisingly, in the source:
    http://google.com/codesearch/p?hl=de#-2BKs-LW4I0/trunk/python/src/Objects/tupleobject.c&q=python%20tuplehash&sa=N&cd=1&ct=rc
    Thanks to you, and to all who pointed me to the place in the source
    where this is.

    How exactly did you search for this? Taking a hint from the url
    above, I went to Google Code Search and searched for "python tuple
    hash" (minus the quotes, of course), which produced a ton of
    irrelevant stuff (almost 80K hits). Searching for "python tuple
    hash lang:c" cut down the number of hits to ~8K, but still too much
    to wade through. Clearly I need a smarter search strategy (one
    that does not include foreknowledge of the name of the actual
    function in the C source, of course).

    ~kj
  • Diez B. Roggisch at Oct 7, 2010 at 11:59 am

    kj <no.email at please.post> writes:

    In <m2fwwjazs2.fsf at web.de> deets at web.de (Diez B. Roggisch) writes:
    kj <no.email at please.post> writes:
    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    Surprisingly, in the source:
    http://google.com/codesearch/p?hl=de#-2BKs-LW4I0/trunk/python/src/Objects/tupleobject.c&q=python%20tuplehash&sa=N&cd=1&ct=rc
    Thanks to you, and to all who pointed me to the place in the source
    where this is.

    How exactly did you search for this? Taking a hint from the url
    above, I went to Google Code Search and searched for "python tuple
    hash" (minus the quotes, of course), which produced a ton of
    irrelevant stuff (almost 80K hits). Searching for "python tuple
    hash lang:c" cut down the number of hits to ~8K, but still too much
    to wade through. Clearly I need a smarter search strategy (one
    that does not include foreknowledge of the name of the actual
    function in the C source, of course).
    I tried codesearch first. Not satisfied after 30 seconds with the
    results, I did the most straight forward thing - I downloaded and
    un-packed the python source... and took a look.
    From that I learned the tuplehash function name.
    Diez
  • Kj at Oct 7, 2010 at 1:01 pm
    In <87pqvmp611.fsf at web.de> deets at web.de (Diez B. Roggisch) writes:
    I tried codesearch first. Not satisfied after 30 seconds with the
    results, I did the most straight forward thing - I downloaded and
    un-packed the python source... and took a look.
    From that I learned the tuplehash function name.
    You must be at least somewhat familiar with the Python source.
    Everytime I've peeked into it I just feel lost, but it's clearly
    something I need to master sooner or later... I can't wait for
    the next one of those occasional introductions to the Python
    internals at our local Python club.

    Thanks,

    ~kj
  • Diez B. Roggisch at Oct 7, 2010 at 1:25 pm

    kj <no.email at please.post> writes:

    In <87pqvmp611.fsf at web.de> deets at web.de (Diez B. Roggisch) writes:
    I tried codesearch first. Not satisfied after 30 seconds with the
    results, I did the most straight forward thing - I downloaded and
    un-packed the python source... and took a look.
    From that I learned the tuplehash function name.
    You must be at least somewhat familiar with the Python source.
    Everytime I've peeked into it I just feel lost, but it's clearly
    something I need to master sooner or later... I can't wait for
    the next one of those occasional introductions to the Python
    internals at our local Python club.
    No, I'm not the tiniest bit. I just followed my instincts in looking
    into the "Objects" folder, because that's where I suspected the
    definition of objects to be....

    And grep has been proven useful in these cases as well.

    Diez
  • Geremy condra at Oct 6, 2010 at 7:28 pm

    On Wed, Oct 6, 2010 at 11:58 AM, kj wrote:

    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    From Objects/tuple.c, line 315 in Python3.2:
    static long
    tuplehash(PyTupleObject *v)
    {
    register long x, y;
    register Py_ssize_t len = Py_SIZE(v);
    register PyObject **p;
    long mult = 1000003L;
    x = 0x345678L;
    p = v->ob_item;
    while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
    return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (long)(82520L + len + len);
    }
    x += 97531L;
    if (x == -1)
    x = -2;
    return x;
    }
    Now, for the long version of this question, I'm working with some
    complext Python objects that I want to be able to compare for
    equality easily.

    These objects are non-mutable once they are created, so I would
    like to use a two-step comparison for equality, based on the
    assumption that I can compute (either at creation time, or as needed
    and memoized) a hashkey/digest for each object. ?The test for
    equality of two of these objects would first compare their hashkeys.
    If they are different, the two objects are declared different; if
    they match, then a more stringent test for equality is performed.

    So the problem is to come up with a reasonable hashkey for each of
    these objects. ?The objects have two significant attributes, and
    two of these objects should be regarded as equal if these attributes
    are "the same" in both. ?The first attribute is a simple dictionary
    whose keys are integers and values are strings. ?The second attribute
    is more complicated. ?It is a tree structure, represented as a
    dictionary of dictionaries of dictionaries... until we get to the
    leaf elements, which are frozensets of strings. ?The keys at every
    level of this structure are strings. ?E.g. a simple example of such
    an attribute would look like:

    {'A': {'a': set(['1', '2', '3']),
    ? ? ? 'b': set(['4', '5'])},
    ?'B': set(['6', '7', '8'])}

    I'm looking for a good algorithm for computing a hash key for
    something like this? ?(Basically I'm looking for a good way to
    combine hashkeys.)
    Sounds like you're trying to hash mutable data structures, which is a
    no-no, but assuming you've got that part all figured out then the easy
    way out would be to print the whole thing and take the hash of the
    resulting string. A (possibly) better way would be to do a Schwartzian
    transform[1] and freeze everything. Other approaches may be better,
    but those are the first two out of my head.

    Geremy Condra

    [1] http://en.wikipedia.org/wiki/Schwartzian_transform
  • Robert Kern at Oct 6, 2010 at 7:31 pm

    On 10/6/10 1:58 PM, kj wrote:
    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?
    The function tuplehash() in Objects/tupleobject.c, predictably enough.
    Now, for the long version of this question, I'm working with some
    complext Python objects that I want to be able to compare for
    equality easily.

    These objects are non-mutable once they are created, so I would
    like to use a two-step comparison for equality, based on the
    assumption that I can compute (either at creation time, or as needed
    and memoized) a hashkey/digest for each object. The test for
    equality of two of these objects would first compare their hashkeys.
    If they are different, the two objects are declared different; if
    they match, then a more stringent test for equality is performed.
    The most straightforward way to implement __hash__ for a complicated object is
    to normalize its relevant data to a tuple of hashable objects and then call
    hash() on that tuple. A straightforward way to compare such objects is to
    calculate that very same normalized tuple for each one and compare those tuples.
    Cache it if necessary. Don't bother hashing to implement __eq__ unless if you
    are really optimizing for space and time.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
  • Terry Reedy at Oct 6, 2010 at 9:43 pm

    On 10/6/2010 2:58 PM, kj wrote:

    These objects are non-mutable once they are created,
    See below.
    like to use a two-step comparison for equality, based on the
    assumption that I can compute (either at creation time, or as needed
    and memoized) a hashkey/digest for each object. The test for
    equality of two of these objects would first compare their hashkeys.
    If they are different, the two objects are declared different; if
    they match, then a more stringent test for equality is performed.
    I believe Python strings do this (cache the hash). Equality comparison
    can check length, hashes, and only then chars.
    So the problem is to come up with a reasonable hashkey for each of
    these objects. The objects have two significant attributes, and
    two of these objects should be regarded as equal if these attributes
    are "the same" in both. The first attribute is a simple dictionary
    whose keys are integers and values are strings. The second attribute
    is more complicated. It is a tree structure, represented as a
    dictionary of dictionaries of dictionaries... until we get to the
    leaf elements, which are frozensets of strings. The keys at every
    level of this structure are strings. E.g. a simple example of such
    an attribute would look like:

    {'A': {'a': set(['1', '2', '3']),
    'b': set(['4', '5'])},
    'B': set(['6', '7', '8'])}
    If these two attributes, and hence the dicts, are public, then your
    instances are mutable.

    --
    Terry Jan Reedy
  • Kj at Oct 7, 2010 at 10:42 am
    In <mailman.1415.1286438617.29448.python-list at python.org> Terry Reedy <tjreedy at udel.edu> writes:
    If these two attributes, and hence the dicts, are public, then your
    instances are mutable.
    I guess I should have written "immutable among consenting adults."

    As far as I know, Python does not support private attributes, so
    I guess the dicts are public no matter what I do. I suppose that
    I can implement "frozendict", but I can't think of any way to
    enforce the immutability of these "frozendicts" that would not be
    trivial to circumvent (it better be, in fact, otherwise I wouldn't
    be able to initialize the damn things).

    If you had something else in mind, please let me know.

    ~kj
  • MRAB at Oct 7, 2010 at 3:10 pm

    On 07/10/2010 11:42, kj wrote:
    In<mailman.1415.1286438617.29448.python-list at python.org> Terry Reedy<tjreedy at udel.edu> writes:
    If these two attributes, and hence the dicts, are public, then your
    instances are mutable.
    I guess I should have written "immutable among consenting adults."

    As far as I know, Python does not support private attributes, so
    I guess the dicts are public no matter what I do. I suppose that
    I can implement "frozendict", but I can't think of any way to
    enforce the immutability of these "frozendicts" that would not be
    trivial to circumvent (it better be, in fact, otherwise I wouldn't
    be able to initialize the damn things).
    You would initialise them by creating them from a list of tuples (or an
    iterable which yields tuples), like with a dict:
    dict([(1, "foo"), (2, "bar")])
    {1: 'foo', 2: 'bar'}
  • Arnaud Delobelle at Oct 7, 2010 at 8:00 pm

    kj <no.email at please.post> writes:

    The short version of this question is: where can I find the algorithm
    used by the tuple class's __hash__ method?

    Now, for the long version of this question, I'm working with some
    complext Python objects that I want to be able to compare for
    equality easily.

    These objects are non-mutable once they are created, so I would
    like to use a two-step comparison for equality, based on the
    assumption that I can compute (either at creation time, or as needed
    and memoized) a hashkey/digest for each object. The test for
    equality of two of these objects would first compare their hashkeys.
    If they are different, the two objects are declared different; if
    they match, then a more stringent test for equality is performed.

    So the problem is to come up with a reasonable hashkey for each of
    these objects. The objects have two significant attributes, and
    two of these objects should be regarded as equal if these attributes
    are "the same" in both. The first attribute is a simple dictionary
    whose keys are integers and values are strings. The second attribute
    is more complicated. It is a tree structure, represented as a
    dictionary of dictionaries of dictionaries... until we get to the
    leaf elements, which are frozensets of strings. The keys at every
    level of this structure are strings. E.g. a simple example of such
    an attribute would look like:

    {'A': {'a': set(['1', '2', '3']),
    'b': set(['4', '5'])},
    'B': set(['6', '7', '8'])}

    I'm looking for a good algorithm for computing a hash key for
    something like this? (Basically I'm looking for a good way to
    combine hashkeys.)
    You could do something like this:


    deep_methods = {
    list: lambda f, l: tuple(map(f, l)),
    dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
    set: lambda f, s: frozenset(map(f, s)),
    # Add more if needed
    }

    def apply_method(f, obj):
    try:
    method = deep_methods[type(obj)]
    except KeyError:
    return obj
    return method(f, obj)

    def deepfreeze(obj):
    """Return a 'hashable version' of an object
    return apply_method(deepfreeze, obj)

    def deephash(obj):
    """Return hash(deepfreeze(obj)) without deepfreezing"""
    return hash(apply_method(deephash, obj))

    # Example of deepfreezable object:
    obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]

    deepfreeze(obj)
    (1, 'foo', frozenset({('bar', 'baz'), ((2, 4), frozenset({4, 5, 7}))}))
    deephash(obj)
    1341422540
    hash(deepfreeze(obj))
    1341422540


    --
    Arnaud
  • Kj at Oct 9, 2010 at 7:54 pm
    In <87y6a9lqnj.fsf at gmail.com> Arnaud Delobelle <arnodel at gmail.com> writes:
    You could do something like this:
    deep_methods = {
    list: lambda f, l: tuple(map(f, l)),
    dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
    set: lambda f, s: frozenset(map(f, s)),
    # Add more if needed
    }
    def apply_method(f, obj):
    try:
    method = deep_methods[type(obj)]
    except KeyError:
    return obj
    return method(f, obj)
    def deepfreeze(obj):
    """Return a 'hashable version' of an object
    return apply_method(deepfreeze, obj)
    def deephash(obj):
    """Return hash(deepfreeze(obj)) without deepfreezing"""
    return hash(apply_method(deephash, obj))
    # Example of deepfreezable object:
    obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]
    ^ ^
    `-------`------- what's this?

    deepfreeze(obj)
    (1, 'foo', frozenset({('bar', 'baz'), ((2, 4), frozenset({4, 5, 7}))}))
    deephash(obj)
    1341422540
    hash(deepfreeze(obj))
    1341422540

    After fixing the missing """ in deepfreeze this code works as
    advertised, but I'm mystified by the identity between hash(deepfreeze(...))
    and deephash(...). Without some knowledge of the Python internals,
    I don't see how this follows.

    More specifically, it is not obvious to me that, for example,

    hash(frozenset((<whatever>,)))

    would be identical to

    hash(frozenset((hash(<whatever>),)))

    but this identity has held every time I've checked it. Similarly
    for other more complicated variations on this theme.

    Anyway, thanks for the code. It's very useful.

    ~kj
  • Arnaud Delobelle at Oct 9, 2010 at 8:39 pm

    kj <no.email at please.post> writes:

    In <87y6a9lqnj.fsf at gmail.com> Arnaud Delobelle <arnodel at gmail.com> writes:
    You could do something like this:
    deep_methods = {
    list: lambda f, l: tuple(map(f, l)),
    dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
    set: lambda f, s: frozenset(map(f, s)),
    # Add more if needed
    }
    def apply_method(f, obj):
    try:
    method = deep_methods[type(obj)]
    except KeyError:
    return obj
    return method(f, obj)
    def deepfreeze(obj):
    """Return a 'hashable version' of an object
    return apply_method(deepfreeze, obj)
    def deephash(obj):
    """Return hash(deepfreeze(obj)) without deepfreezing"""
    return hash(apply_method(deephash, obj))
    # Example of deepfreezable object:
    obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]
    ^ ^
    `-------`------- what's this?
    This is set literal notation, introduced in Python 3!
    In 2.X, you would write set([7, 5, 4])
    deepfreeze(obj)
    (1, 'foo', frozenset({('bar', 'baz'), ((2, 4), frozenset({4, 5, 7}))}))
    deephash(obj)
    1341422540
    hash(deepfreeze(obj))
    1341422540

    After fixing the missing """ in deepfreeze this code works as
    advertised,
    Oops, I must admit I added the docstrings after pasting the code :)
    but I'm mystified by the identity between hash(deepfreeze(...)) and
    deephash(...). Without some knowledge of the Python internals, I
    don't see how this follows.
    OK. I haven't actually proved it, but it follows from the following
    facts:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
    for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).

    2. Container hashable objects compute their hash from the hash of their
    elements.

    I don't think either of these two facts is documented, but it would be quite
    easy to verify them in the Python source (I must admit I have not done
    it). And it is difficult to conceive how it could be otherwise.

    --
    Arnaud
  • Steven D'Aprano at Oct 10, 2010 at 12:31 am

    On Sat, 09 Oct 2010 21:39:51 +0100, Arnaud Delobelle wrote:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
    for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).
    It's a beautiful theory, but, alas, it is not the case.
    hash(-1) == -1
    False
    hash(2**64) == 2**64
    False

    to give only two of an infinite number of counter-examples.

    Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
    before, or after, ints and longs were partially integrated?

    [steve at sylar ~]$ python2.1
    Python 2.1.3 (#1, Aug 12 2010, 01:53:57)
    [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
    Type "copyright", "credits" or "license" for more information.
    2**64
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    OverflowError: integer exponentiation
    >>>


    People keep forgetting that 2.2 introduced nearly as many far-reaching
    changes as 3.0.


    --
    Steven
  • Hrvoje Niksic at Oct 10, 2010 at 8:51 am

    Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    On Sat, 09 Oct 2010 21:39:51 +0100, Arnaud Delobelle wrote:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
    for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).
    It's a beautiful theory, but, alas, it is not the case.
    hash(-1) == -1
    False
    This is a counter-example for the (invalid) premise that hash(x) == x,
    but not for the invariant of hash(hash(x)) == hash(x).
    hash(hash(-1)) == hash(-1)
    True
    hash(hash(2**64)) == hash(2**64)
    True
    Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
    before, or after, ints and longs were partially integrated?
    I would take it to mean the type 2.x calls 'int', i.e. fixed-width
    integer type.
  • Arnaud Delobelle at Oct 10, 2010 at 10:06 am

    Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    On Sat, 09 Oct 2010 21:39:51 +0100, Arnaud Delobelle wrote:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
    for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).
    It's a beautiful theory, but, alas, it is not the case.
    hash(-1) == -1
    False
    hash(2**64) == 2**64
    False

    to give only two of an infinite number of counter-examples.
    I can only see one counterexample, (-1). 2**64 is of type 'long' in 2.X
    on your machine (or, to be pedantic, 2.x where x >= 2). And, in fact,
    (-1) is the only int such that hash(x) != x.

    In can only guess that (-1) is a value that has a special meaning when
    hashing. Try this (Python 2.6):
    class A(object):
    ... def __hash__(self): return -1
    ...
    a = A()
    hash(a)
    -2

    Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
    before, or after, ints and longs were partially integrated? Either.
    [steve at sylar ~]$ python2.1
    Python 2.1.3 (#1, Aug 12 2010, 01:53:57)
    [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
    Type "copyright", "credits" or "license" for more information.
    2**64
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    OverflowError: integer exponentiation

    People keep forgetting that 2.2 introduced nearly as many far-reaching
    changes as 3.0.
    I didn't forget, but what bearing does it have on this particular issue?
  • Kj at Oct 10, 2010 at 1:49 pm
    In <87hbgu8irb.fsf at gmail.com> Arnaud Delobelle <arnodel at gmail.com> writes:
    Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    On Sat, 09 Oct 2010 21:39:51 +0100, Arnaud Delobelle wrote:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
    for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).
    It's a beautiful theory, but, alas, it is not the case.
    hash(-1) == -1
    False
    hash(2**64) == 2**64
    False

    to give only two of an infinite number of counter-examples.
    (Infinite???)
    And, in fact,
    (-1) is the only int such that hash(x) != x.
    Arnaud, how did you determine that -1 is the only such int? I
    can't imagine any other approach other than a brute-force check of
    all ints... When I tried this I ran into unforeseen limitations
    in xrange, etc. It's all very doable, but still, it would take at
    least about 3 hours on my laptop.
    In can only guess that (-1) is a value that has a special meaning when
    hashing. Try this (Python 2.6):
    class A(object):
    ... def __hash__(self): return -1
    ...
    a = A()
    hash(a)
    -2
    Very cool.

    BTW, thank you for the explanation in your previous post. It makes
    a lot of sense. I find it interesting (as Hrvoje pointed out) that
    the hash function is (or appears to be) idempotent on integers
    (long or not), even though it is not the identity on the integers.
    Thanks to Steven for the counterexamples to show the latter. I've
    learned tons from this exchange.

    ~kj
  • Steven D'Aprano at Oct 10, 2010 at 3:32 pm

    On Sun, 10 Oct 2010 13:49:09 +0000, kj wrote:

    to give only two of an infinite number of counter-examples.
    (Infinite???)
    I was mistaken. Given Arnaud's specification that we look only at the
    Python 2.x ints, I believe that there is only one counter-example, namely
    -1.

    However, in Python 3.x I would be correct. The reasoning is a simple
    application of the pigeon-hole principle. Since hash(n) is limited to
    returning a finite number of distinct results, but there are an infinite
    number of Python 3 ints, it follows that there must be an infinite number
    of collisions.

    And, in fact,
    (-1) is the only int such that hash(x) != x.
    Arnaud, how did you determine that -1 is the only such int? I can't
    imagine any other approach other than a brute-force check of all ints...
    Reading the source code is also a good approach.

    I don't read C very well, but even I can work out what this is doing:

    static long
    int_hash(PyIntObject *v)
    {
    /* XXX If this is changed, you also need to change the way
    Python's long, float and complex types are hashed. */
    long x = v -> ob_ival;
    if (x == -1)
    x = -2;
    return x;
    }


    (from intobject.c in Python 2.6.1). It takes a Python int object,
    extracts the value of it as a C long, returns -2 if the value is -1
    otherwise returns the value.

    When I tried this I ran into unforeseen limitations in xrange, etc.
    It's all very doable, but still, it would take at least about 3 hours on
    my laptop.

    What are you doing? This takes less than 20 seconds to test up to 2**24:

    import time
    def test(n):
    t = time.time()
    assert hash(0) == 0
    assert hash(1) == 1
    assert hash(-1) == -2
    for i in xrange(2, n):
    assert hash(i) == i, "failed %d" % i
    assert hash(-i) == -i, "failed %d" % -i
    t = time.time() - t
    return t

    test(2**24)
    18.076412916183472

    Since 2**31 is 2**7 = 128 times bigger, I estimate that testing
    everything up to 2**31 should take less than 45 minutes. And my PC is not
    exactly the fastest machine on the block.



    --
    Steven
  • Kj at Oct 10, 2010 at 8:57 pm
    In <4cb1dc9a$0$29970$c3e8da3$5496439d at news.astraweb.com> Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    Reading the source code is also a good approach.
    Every time I have attempted to answer a question by looking at the
    Python C source, all I've achieved was wasting time, sometimes a
    *lot* of time, so by now I've developed what can only be described
    as a phobia to it. I probably need professional help at this point.

    ~kj
  • Arnaud Delobelle at Oct 10, 2010 at 3:49 pm

    kj <no.email at please.post> writes:

    In <87hbgu8irb.fsf at gmail.com> Arnaud Delobelle <arnodel at gmail.com> writes:
    [...]
    And, in fact,
    (-1) is the only int such that hash(x) != x.
    Arnaud, how did you determine that -1 is the only such int? I
    can't imagine any other approach other than a brute-force check of
    all ints... When I tried this I ran into unforeseen limitations
    in xrange, etc. It's all very doable, but still, it would take at
    least about 3 hours on my laptop.
    I looked at the source, namely the get_hash() function in the
    intobject.c file in the 2.x source, and the get_hash() function in the
    longobject.c file in the 3.x source.
    In can only guess that (-1) is a value that has a special meaning when
    hashing. Try this (Python 2.6):
    class A(object):
    ... def __hash__(self): return -1
    ...
    a = A()
    hash(a)
    -2
    Very cool.

    BTW, thank you for the explanation in your previous post. It makes
    a lot of sense. I find it interesting (as Hrvoje pointed out) that
    the hash function is (or appears to be) idempotent on integers
    (long or not), even though it is not the identity on the integers.
    Thanks to Steven for the counterexamples to show the latter. I've
    learned tons from this exchange.
    It still seems to hold that hash() is idempotent on all objects.

    I have learnt too that hash(-1) is not (-1), and that it seems that a
    hash value is not allowed to be (-1). There is one thing left to find
    out. Why can't it be (-1)? Maybe looking at the source for the hash()
    builtin would yield a clue, or maybe someone who knows would be able to
    tell us?

    --
    Arnaud
  • Hrvoje Niksic at Oct 10, 2010 at 4:24 pm

    Arnaud Delobelle <arnodel at gmail.com> writes:

    I have learnt too that hash(-1) is not (-1), and that it seems that a
    hash value is not allowed to be (-1). There is one thing left to find
    out. Why can't it be (-1)?
    Because -1 has a special meaning in the C function equivalent to
    Python's hash(). PyObject_Hash returning -1 means that an exception was
    raised somewhere inside the object's __hash__ method. For that reason
    hash functions that really return -1 must change that to another value,
    and -2 is as good a replacement as any.

    This is documented in
    http://docs.python.org/c-api/object.html?highlight=pyobject_hash#PyObject_Hash
  • Steven D'Aprano at Oct 10, 2010 at 2:28 pm

    On Sun, 10 Oct 2010 11:06:00 +0100, Arnaud Delobelle wrote:

    Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    On Sat, 09 Oct 2010 21:39:51 +0100, Arnaud Delobelle wrote:

    1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x)
    hold for any hashable x (this is a simple consequence of the fact that
    hash(x) == x for any int x (by 'int' I mean 2.X int)).
    It's a beautiful theory, but, alas, it is not the case.
    hash(-1) == -1
    False
    hash(2**64) == 2**64
    False

    to give only two of an infinite number of counter-examples.
    I can only see one counterexample, (-1). 2**64 is of type 'long' in 2.X
    on your machine (or, to be pedantic, 2.x where x >= 2). And, in fact,
    (-1) is the only int such that hash(x) != x.
    Fair point. I was mistaken.

    I had the impression that the integration between ints and longs since
    Python 2.2 was more extensive than it actually is. I made the above
    comments based on the idea that since Python 2.2, longs are a subclass of
    ints, e.g. that isinstance(2**64, int) would return True. Turns out that
    I'm wrong, in which case I agree with you that -1 is the only int counter-
    example.

    As an aside, this makes me glad that I have continued writing
    isinstance(x, (int, long)) in my code, even though I "knew" it was
    unnecessary. Not the first time, and probably not the last, that a "this
    can never happen" test saved the day.


    --
    Steven

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 6, '10 at 6:58p
activeOct 10, '10 at 8:57p
posts24
users10
websitepython.org

People

Translate

site design / logo © 2022 Grokbase