FAQ

[Tutor] object representation

Spir
Mar 4, 2010 at 7:47 am
Hello,

In python like in most languages, I guess, objects (at least composite ones -- I don't know about ints, for instance -- someone knows?) are internally represented as associative arrays. Python associative arrays are dicts, which in turn are implemented as hash tables. Correct?
Does this mean that the associative arrays representing objects are implemented like python dicts, thus hash tables?

I was wondering about the question because I guess the constraints are quite different:
* Dict keys are of any type, including heterogeneous (mixed). Object keys are names, ie a subset of strings.
* Object keys are very stable, typically they hardly change; usually only values change. Dicts often are created empty and fed in a loop.
* Objects are small mappings: entries are explicitely written in code (*). Dicts can be of any size, only limited by memory; they are often fed by computation.
* In addition, dict keys can be variables, while object keys rarely are: they are literal constants (*).

So, I guess the best implementations for objects and dicts may be quite different. i wonder about alternatives for objects, in particuliar trie and variants: http://en.wikipedia.org/wiki/Trie, because they are specialised for associative arrays which keys are strings.

denis

PS: Would someone point me to typical hash funcs for string keys, and the one used in python?

(*) Except for rare cases using setattr(obj,k,v) or obj.__dict__[k]=v.
--
________________________________

la vita e estrany

spir.wikidot.com
reply

Search Discussions

6 responses

  • Alan Gauld at Mar 4, 2010 at 8:50 am
    "spir" <denis.spir at gmail.com> wrote
    Does this mean that the associative arrays representing objects are
    implemented like python dicts, thus hash tables?
    Yes, in fact I think they are Python dicts - although I've never actually
    looked at the source to confirm that.
    I was wondering about the question because I guess the constraints
    are quite different:
    * Dict keys are of any type, including heterogeneous (mixed).
    Object keys are names, ie a subset of strings.
    The keys must be immutable but ...
    the object keys are a perfect subset of the generic dict.
    * Object keys are very stable, typically they hardly change;
    usually only values change. Dicts often are created empty and fed in a
    loop.
    The normal case is as you say but you can if you with create an empty
    object and then later add attributes - in a loop if you wish.
    * Objects are small mappings: entries are explicitely written in code
    (*).
    Dicts can be of any size, only limited by memory; they are often fed
    by computation.
    But again the object usage is just a subset of the generic.
    * In addition, dict keys can be variables, while object keys rarely
    are: they are literal constants (*).
    variables in Python are just names that efer to objects.
    When you use a variable as a key to a dict you really use
    the value of the variable.
    d = {}
    x = 42
    d[x] = 66
    d[42]
    66
    d[x]
    66
    x = 5
    d[x]
    Traceback (most recent call last):
    File "<pyshell#12>", line 1, in <module>
    d[x]
    KeyError: 5
    >>>

    It is the value of x that is the key not 'x'. If you change the value of x
    it ceases to be a valid key of the dict.
    So, I guess the best implementations for objects and dicts may
    be quite different. i wonder about alternatives for objects,
    They could be different and perhaps optimised but I don't think
    they are. Pythons dicts are already pretty efficient. They form
    the backbone of almost everything in Python not just objects.
    in particuliar trie and variants: http://en.wikipedia.org/wiki/Trie,
    because they are specialised for associative arrays which keys are
    strings.
    PS: Would someone point me to typical hash funcs for string keys,
    and the one used in python?
    Wikipedia for generic and the source for Python?

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • Dave Angel at Mar 4, 2010 at 2:22 pm

    spir wrote:
    Hello,

    In python like in most languages, I guess, objects (at least composite ones -- I don't know about ints, for instance -- someone knows?) are internally represented as associative arrays. Python associative arrays are dicts, which in turn are implemented as hash tables. Correct?
    Does this mean that the associative arrays representing objects are implemented like python dicts, thus hash tables?

    I was wondering about the question because I guess the constraints are quite different:
    * Dict keys are of any type, including heterogeneous (mixed). Object keys are names, ie a subset of strings.
    * Object keys are very stable, typically they hardly change; usually only values change. Dicts often are created empty and fed in a loop.
    * Objects are small mappings: entries are explicitely written in code (*). Dicts can be of any size, only limited by memory; they are often fed by computation.
    * In addition, dict keys can be variables, while object keys rarely are: they are literal constants (*).

    So, I guess the best implementations for objects and dicts may be quite different. i wonder about alternatives for objects, in particuliar trie and variants: http://en.wikipedia.org/wiki/Trie, because they are specialised for associative arrays which keys are strings.

    denis

    PS: Would someone point me to typical hash funcs for string keys, and the one used in python?

    http://effbot.org/zone/python-hash.htm
    (*) Except for rare cases using setattr(obj,k,v) or obj.__dict__[k]=v.
    Speaking without knowledge of the actual code implementing CPython (or
    for that matter any of the other dozen implementations), I can comment
    on my *model* of how Python "works." Sometimes it's best not to know
    (or at least not to make use of) the details of a particular
    implementation, as your code is more likely to port readily to the next
    architecture, or even next Python version.

    I figure every object has exactly three items in it: a ref count, a
    implementation pointer, and a payload. The payload may vary between 4
    bytes for an int object, and about 4 megabytes for a list of size a
    million. (And of course arbitrarily large for larger objects).

    You can see that these add up to 12 bytes (in version 2.6.2 of CPython)
    for an int by using sys.getsizeof(92). Note that if the payload refers
    to other objects, those sizes are not included in the getsizeof()
    function result. So getsizeof(a list of strings) will not show the
    sizes of the strings, but only the list itself.

    The payload for a simple object will contain just the raw data of the
    object. So for a string, it'd contain the count and the bytes. For
    compound objects that can change in size, it'd contain a pointer to a
    malloc'ed buffer that contains the variable-length data. The object
    stays put, but the malloc'ed buffer may move as it size grows and
    shrinks. getsizeof() is smart enough to report not only the object
    itself, but the buffer it references. Note that buffer is referenced by
    only one object, so its lifetime is intimately tied up with the object's.

    The bytes in the payload are meaningless without the implementation
    pointer. That implementation pointer will be the same for all
    instances of a particular type. It points to a structure that defines a
    particular type (or class). That structure for an empty class happens
    to be 452 bytes, but that doesn't matter much, as it only appears once
    per class. The instance of an empty class is only 32 bytes. Now, even
    that might seem a bit big, so Python offers the notion of slots, which
    reduces the size of each instance, at the cost of a little performance
    and a lot of flexibility. Still, slots are important, because I suspect
    that's how built-ins are structured, to make the objects so small.

    Now, some objects, probably most of the built-ins, are not extensible.
    You can't add new methods, or alter the behavior much. Other objects,
    such as instances of a class you write, are totally and very flexible.
    I won't get into inheritance here, except to say that it can be tricky
    to derive new classes from built-in types.

    So where do associative arrays come in? One of the builtin types is a
    dictionary, and that is core to much of the workings of Python. There
    are dictionaries in each class implementation (that 452 bytes I
    mentioned). And there may be dictionaries in the instances themselves.
    There are two syntaxes to directly access these dictionaries, the "dot"
    notation and the bracket [] notation. The former is a simple
    indirection through a special member called __dict__.

    So the behavior of an object depends on its implementation pointer,
    which points to a structure. And parts of that structure ultimately
    point to C code whch does all the actual work. But most of the work is
    some double- or triple-indirection which ultimately calls code.
  • Steven D'Aprano at Mar 5, 2010 at 1:21 am

    On Fri, 5 Mar 2010 01:22:52 am Dave Angel wrote:
    spir wrote:
    [...]
    PS: Would someone point me to typical hash funcs for string keys,
    and the one used in python?
    http://effbot.org/zone/python-hash.htm
    But note that this was written a few years ago, and so may have been
    changed.

    As for "typical" hash functions, I don't think there is such a thing. I
    think everyone creates there own, unless there is a built-in hash
    function provided by the language or the library, which is what Python
    does. I've seen hash functions as simple as:

    def hash(s):
    if not s: return 0
    else: return ord(s[0])

    but of course that leads to *many* collisions.

    Slightly better (but not much!) is

    def hash(s):
    n = 0
    for c in s:
    n += ord(c)
    return n % sys.maxint

    This is a pretty awful hash function though. Don't use it.

    You might also like to read this thread, to get an idea of the thought
    that has been put into Python's hashing:

    http://mail.python.org/pipermail/python-dev/2004-April/044244.html


    [...]
    I figure every object has exactly three items in it: a ref count, a
    implementation pointer, and a payload.
    Not quite. CPython has ref counts. IronPython and Jython don't. Other
    Pythons may or may not.

    I'm not quite sure what you mean by "implementation pointer" -- I think
    you're talking about a pointer back to the type itself. It's normal to
    just to refer to this as "the type", and ignore the fact that it's
    actually a pointer. The type data structure (which itself will be an
    object) itself is not embedded in every object! And of course, other
    Pythons may use some other mechanism for linking objects back to their
    type.



    --
    Steven D'Aprano
  • Dave Angel at Mar 5, 2010 at 8:01 am

    spir wrote:
    On Thu, 04 Mar 2010 09:22:52 -0500
    Dave Angel wrote:

    Still, slots are important, because I suspect
    that's how built-ins are structured, to make the objects so small.
    Sure, one cannot alter their structure. Not even of a direct instance of <object>:
    o = object()
    o.n=1
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    AttributeError: 'object' object has no attribute 'n'

    Now, some objects, probably most of the built-ins, are not extensible.
    You can't add new methods, or alter the behavior much.
    This applies to any attr, not only methods, also plain "information":
    s = "abc"
    s.n=1
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    AttributeError: 'str' object has no attribute 'n'


    Other objects,
    such as instances of a class you write, are totally and very flexible.
    conceptually this is equivalent to have no __slots__ slot. Or mayby they could be implemented using structs (which values would be pointers), instead of dicts. A struct is like a fixed record, as opposed to a dict. What do you think? On the implementation side, this would be much simpler, lighter, and more efficient.
    Oh, this gives me an idea... (to implement so-called "value objects").

    Denis
    having not played much with slots, my model is quite weak there. But I
    figure the dictionary is in the implementation structure, along with a
    flag saying that it's readonly. Each item of such a dictionary would be
    an index into the fixed table in the object. Like a struct, as you say,
    except that in C, there's no need to know the names of the fields at run
    time.

    DaveA
  • Steven D'Aprano at Mar 5, 2010 at 1:45 am

    On Thu, 4 Mar 2010 06:47:04 pm spir wrote:
    Hello,

    In python like in most languages, I guess, objects (at least
    composite ones -- I don't know about ints, for instance -- someone
    knows?) are internally represented as associative arrays.
    No.

    You can consider a Python object to be something like a C struct, or a
    Pascal record. The actual structure of the object depends on the type
    of the object (ints, strings, lists, etc will be slightly different).
    See Dave Angel's post for a good description.

    And of course this is implementation dependent: CPython, being written
    in C, uses C structs to implement objects. Other Pythons, written in
    other languages, will use whatever data structure makes sense for their
    language. That's almost certainly going to be a struct-like structure,
    since that is a pretty fundamental data type, but it could be
    different.

    If you want to know exactly how objects are represented internally by
    Python, I'm afraid you will probably need to read the source code. But
    a good start is here:

    http://effbot.org/zone/python-objects.htm


    Python
    associative arrays are dicts, which in turn are implemented as hash
    tables. Correct? Does this mean that the associative arrays
    representing objects are implemented like python dicts, thus hash
    tables?
    "Associate array" is a generic term for a data structure that maps keys
    to items. There are lots of ways of implementing such an associative
    array: at least two different sorts of hash tables, about a dozen
    different types of binary trees, and so on.

    In Python, associate arrays are called "dicts", and they are implemented
    in CPython as hash tables with chaining. But objects are not hash
    tables. *Some* objects INCLUDE a hash table as one of its fields, but
    not all. For example:
    int.__dict__
    <dictproxy object at 0xb7f09674>
    (2).__dict__
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    AttributeError: 'int' object has no attribute '__dict__'
    class C: pass
    ...
    C.__dict__
    {'__module__': '__main__', '__doc__': None}


    That is why you can't add attributes to arbitrary built-in objects:
    {}.x = None
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    AttributeError: 'dict' object has no attribute 'x'

    Because the dict instance has no __dict__, there is nowhere to insert
    the attribute.

    I was wondering about the question because I guess the constraints
    are quite different: * Dict keys are of any type, including
    heterogeneous (mixed). Object keys are names, ie a subset of strings.
    CPython's implementation of hash tables is highly optimized for speed.
    The biggest bottleneck is the hash function, and that is tuned to be
    extremely efficient for strings and ints.

    [...]
    So, I guess the best implementations for objects and dicts may be
    quite different. i wonder about alternatives for objects, in
    particuliar trie and variants: http://en.wikipedia.org/wiki/Trie,
    because they are specialised for associative arrays which keys are
    strings.
    *shrug*

    Maybe, maybe not. Tries are a more complicated data structure, which
    means bigger code and more bugs. They don't fit in very well with
    CPython's memory management system. And they use a large number of
    pointers, which can be wasteful.

    E.g. a trie needs six pointers just to represent the single
    key "python":

    '' -> 'p' -> 'y' -> 't' -> 'h' -> 'o' -> 'n'

    while a hash table uses just one:

    -> 'python'




    --
    Steven D'Aprano
  • Lie Ryan at Mar 5, 2010 at 4:09 pm

    On 03/05/2010 12:45 PM, Steven D'Aprano wrote:
    E.g. a trie needs six pointers just to represent the single
    key "python":

    '' -> 'p' -> 'y' -> 't' -> 'h' -> 'o' -> 'n'

    while a hash table uses just one:

    -> 'python'
    You can argue that had trie beed used as the datatype, there will
    actually be no need to store the key's string representation; the index
    of the object in the trie implies the textual representation. Such that,
    you will still need 6 pointers, but you won't need to store a string
    object. It will just be:

    '' -> ptrP -> ptrY -> ptrT -> ptrH -> ptrO -> ptrN

    and if for some reason the name is needed (perhaps for debugging?); then
    there could be an algorithm to reverse-map the ptrXs to char. I can
    imagine that to be implementable if variable names in python be limited
    to alphanumerics only and probably a select few of symbols. Unicode
    names makes things difficult though...

Related Discussions

Discussion Navigation
viewthread | post