FAQ
Hi,

I am running a script that uses dictionaries on Python 2.6.4 on Ubuntu 9.10.
I notice that my script crashes with a MemoryError when my dictionary
reaches 44739243th entry. My system has 3GB RAM (32-bit). I noticed that
changing the key or value types also did not help my code. For simplicity I
tried running this:

#BEGIN:python code
import sys
f={}
lsize=0
try:
for k in range(0,4*44739243):
f[k]=k
if sys.getsizeof(f) > lsize:
lsize = sys.getsizeof(f)
print k, lsize
except:
print k, lsize
raise
#END:python code

The code terminates with:
"Traceback (most recent call last):
File "pydict-limit.py", line 6, in <module>
f[k]=k
MemoryError"

I have also tried running this on Ubuntu 9.10 with Python 2.6.6 with 3.5GB
RAM(32-bit) and a 64-bit LINUX cluster machine with 62GB RAM and Python 2.4.
On both these machines it crashed at entry 44739243. The size of the data
structure grows to 1.5GB. On another 64-bit cluster machine with 32GB RAM
and Python 2.6.6 it crashed at entry 178956970. In this case the size of the
data structure grew to 6GB.

Has anybody encountered this before? Can somebody suggest any fix for this?
I am trying to read some 10GB network traces into a hash table(hence
dictionary) and need constant time lookup. Clearly increasing the machine
RAM does not work. Any suggestions would be greatly appreciated.

Thanks,
Shruti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110130/58e51c35/attachment.html>

Search Discussions

  • MRAB at Jan 31, 2011 at 3:52 am

    On 31/01/2011 02:43, Shruti Sanadhya wrote:
    Hi,

    I am running a script that uses dictionaries on Python 2.6.4 on Ubuntu
    9.10. I notice that my script crashes with a MemoryError when my
    dictionary reaches 44739243th entry. My system has 3GB RAM (32-bit). I
    noticed that changing the key or value types also did not help my code.
    For simplicity I tried running this:

    #BEGIN:python code
    import sys
    f={}
    lsize=0
    try:
    for k in range(0,4*44739243):
    f[k]=k
    if sys.getsizeof(f) > lsize:
    lsize = sys.getsizeof(f)
    print k, lsize
    except:
    print k, lsize
    raise
    #END:python code

    The code terminates with:
    "Traceback (most recent call last):
    File "pydict-limit.py", line 6, in <module>
    f[k]=k
    MemoryError"

    I have also tried running this on Ubuntu 9.10 with Python 2.6.6 with
    3.5GB RAM(32-bit) and a 64-bit LINUX cluster machine with 62GB RAM and
    Python 2.4. On both these machines it crashed at entry 44739243. The
    size of the data structure grows to 1.5GB. On another 64-bit cluster
    machine with 32GB RAM and Python 2.6.6 it crashed at entry 178956970. In
    this case the size of the data structure grew to 6GB.

    Has anybody encountered this before? Can somebody suggest any fix for
    this? I am trying to read some 10GB network traces into a hash
    table(hence dictionary) and need constant time lookup. Clearly
    increasing the machine RAM does not work. Any suggestions would be
    greatly appreciated.
    sys.getsizeof(...) returns the size of an object itself, not the size
    of an object plus any others which that object references.

    The dict in your example code occupied 1.5GB, but that didn't include
    the size of the int keys and values, only the references to them.

    As for the 64-bit Linux cluster machine, did you run a 32-bit or a
    64-bit build of Python? A 32-bit build can't use more than 4GB (2**32).
  • Dan Stromberg at Jan 31, 2011 at 5:12 am

    On Sun, Jan 30, 2011 at 6:43 PM, Shruti Sanadhya wrote:
    Hi,
    I am running a script that uses dictionaries on Python 2.6.4 on Ubuntu 9.10.
    I notice that my script crashes with a MemoryError when my dictionary
    reaches 44739243th entry. My system has 3GB RAM (32-bit). I noticed that
    changing the key or value types also did not help my code. For simplicity I
    tried running this:
    #BEGIN:python code
    import sys
    f={}
    lsize=0
    try:
    ?? ?for k in range(0,4*44739243):
    ?? ? ? ?f[k]=k
    ?? ? ? ?if sys.getsizeof(f) > lsize:
    ?? ? ? ? ? ?lsize = sys.getsizeof(f)
    ?? ? ? ? ? ?print k, lsize
    except:
    ?? ?print k, lsize
    ?? ?raise
    #END:python code
    The code terminates with:
    "Traceback (most recent call last):
    ??File "pydict-limit.py", line 6, in <module>
    ?? ?f[k]=k
    MemoryError"
    I have also tried running this on Ubuntu 9.10 with Python 2.6.6 with 3.5GB
    RAM(32-bit) and a 64-bit LINUX cluster machine with 62GB RAM and Python 2.4.
    On both these machines it crashed at entry 44739243. The size of the data
    structure grows to 1.5GB. On another 64-bit cluster machine with 32GB RAM
    and Python 2.6.6 it crashed at entry?178956970. In this case the size of the
    data structure grew to 6GB.
    Has anybody encountered this before? Can somebody suggest any fix for this?
    I am trying to read some 10GB network traces into a hash table(hence
    dictionary) and need constant time lookup. Clearly increasing the machine
    RAM does not work. Any suggestions would be greatly appreciated.
    Thanks,
    Shruti
    If you need strict controls over how much memory is used, you might be
    better off with C or something. Or Cython would give you a Pythonic
    syntax with C's datastructures (as well as Python-style
    datastructures, but that gets back into memory unpredictability). On
    an Ubuntu system, you can probably "man hcreate" to get some info on a
    hash table accessible from C.

    If you need a huge "dictionary", then using something like gdbm or
    bsddb might be good, though I doubt these are constant time - they're
    more like single-table databases, where the keys and values must all
    be strings or serialized non-strings. However, they should be fine
    with huge dictionaries. They have the advantage of providing a pretty
    dictionary-like interface. This almost certainly would not bump into
    any memory limits, as they are disk-based.

    If you really do need to keep things in RAM (VM), you might try this
    module: http://stromberg.dnsalias.org/~strombrg/treap/ on one of your
    64 bit systems, preferably with a recent cpython like 2.6 or 2.7 built
    with a 64 bit compiler. It too is not constant time, but it is
    O(c*log(n)) with a small c, at the expense of a higher standard
    deviation in write times compared to something like a red/black tree
    (better average performance though). This provides a very
    dictionary-like interface. It would implicitly store things sorted by
    key, incidentally, though that may be unimportant in your application.
    This may or may not have the same memory issues as the dictionary -
    and I'd be very interested in hearing whether they do or not.

    You might also be able to read everything into a big list, then sort
    it. Then use the bisect module to do log(n) time lookups. This does
    not give a dictionary-like interface or constant time, but it doesn't
    require anything outside of CPython's standard library. It may or may
    not have the same memory issues you were seeing with dictionaries.

    Here's something based on your code above, that works with the
    mentioned treap module. I found that it required about 1 gig for each
    10% of the numbers the code was trying to save - on a 32 bit, Ubuntu
    10.10 system. However, I have only 3 gig of RAM, so I terminated it
    before it got to a memory error:

    #!/usr/bin/python

    import sys
    import treap
    f = treap.treap()
    top = 4*44739243
    try:
    for k in xrange(0, top):
    sys.stdout.write('%d %f%%\r' % (k, k * 100.0 / top))
    f[k]=k
    print
    except:
    print
    raise

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJan 31, '11 at 2:43a
activeJan 31, '11 at 5:12a
posts3
users3
websitepython.org

People

Translate

site design / logo © 2021 Grokbase