FAQ
I'm observing a strange memory usage pattern with strings. Consider
the following session. Idea is to create a list which holds some
strings so that cumulative characters in the list is 100MB.
l = []
for i in xrange(100000):
... l.append(str(i) * (1000/len(str(i))))

This uses around 100MB of memory as expected and 'del l' will clear that.

for i in xrange(20000):
... l.append(str(i) * (5000/len(str(i))))

This is using 165MB of memory. I really don't understand where the
additional memory usage is coming from.

If I reduce the string size, it remains high till it reaches around
1000. In that case it is back to 100MB usage.

Python 2.6.4 on FreeBSD.

Regards,
Amit

Search Discussions

  • John Gordon at Mar 16, 2011 at 5:51 pm

    In <mailman.988.1300289897.1189.python-list at python.org> Amit Dev <amitdev at gmail.com> writes:

    I'm observing a strange memory usage pattern with strings. Consider
    the following session. Idea is to create a list which holds some
    strings so that cumulative characters in the list is 100MB.
    l = []
    for i in xrange(100000):
    ... l.append(str(i) * (1000/len(str(i))))
    This uses around 100MB of memory as expected and 'del l' will clear that.
    for i in xrange(20000):
    ... l.append(str(i) * (5000/len(str(i))))
    This is using 165MB of memory. I really don't understand where the
    additional memory usage is coming from.
    If I reduce the string size, it remains high till it reaches around
    1000. In that case it is back to 100MB usage.
    I don't know anything about the internals of python storage -- overhead,
    possible merging of like strings, etc. but some simple character counting
    shows that these two loops do not produce the same number of characters.

    The first loop produces:

    Ten single-digit values of i which are repeated 1000 times for a total of
    10000 characters;

    Ninety two-digit values of i which are repeated 500 times for a total of
    45000 characters;

    Nine hundred three-digit values of i which are repeated 333 times for a
    total of 299700 characters;

    Nine thousand four-digit values of i which are repeated 250 times for a
    total of 2250000 characters;

    Ninety thousand five-digit values of i which are repeated 200 times for
    a total of 18000000 characters.

    All that adds up to a grand total of 20604700 characters.

    Or, to condense the above long-winded text in table form:

    range num digits 1000/len(str(i)) total chars
    0-9 10 1 1000 10000
    10-99 90 2 500 45000
    100-999 900 3 333 299700
    1000-9999 9000 4 250 2250000
    10000-99999 90000 5 200 18000000
    ========
    grand total chars 20604700

    The second loop yields this table:

    range num digits 5000/len(str(i)) total bytes
    0-9 10 1 5000 50000
    10-99 90 2 2500 225000
    100-999 900 3 1666 1499400
    1000-9999 9000 4 1250 11250000
    10000-19999 10000 5 1000 10000000
    ========
    grand total chars 23024400

    The two loops do not produce the same numbers of characters, so I'm not
    surprised they do not consume the same amount of storage.

    P.S.: Please forgive me if I've made some basic math error somewhere.

    --
    John Gordon A is for Amy, who fell down the stairs
    gordon at panix.com B is for Basil, assaulted by bears
    -- Edward Gorey, "The Gashlycrumb Tinies"
  • Amit Dev at Mar 16, 2011 at 6:20 pm
    sum(map(len, l)) => 99998200 for 1st case and 99999100 for 2nd case.
    Roughly 100MB as I mentioned.
    On Wed, Mar 16, 2011 at 11:21 PM, John Gordon wrote:
    In <mailman.988.1300289897.1189.python-list at python.org> Amit Dev <amitdev at gmail.com> writes:
    I'm observing a strange memory usage pattern with strings. Consider
    the following session. Idea is to create a list which holds some
    strings so that cumulative characters in the list is 100MB.
    l = []
    for i in xrange(100000):
    ... ?l.append(str(i) * (1000/len(str(i))))
    This uses around 100MB of memory as expected and 'del l' will clear that.
    for i in xrange(20000):
    ... ?l.append(str(i) * (5000/len(str(i))))
    This is using 165MB of memory. I really don't understand where the
    additional memory usage is coming from.
    If I reduce the string size, it remains high till it reaches around
    1000. In that case it is back to 100MB usage.
    I don't know anything about the internals of python storage -- overhead,
    possible merging of like strings, etc. ?but some simple character counting
    shows that these two loops do not produce the same number of characters.

    The first loop produces:

    Ten single-digit values of i which are repeated 1000 times for a total of
    10000 characters;

    Ninety two-digit values of i which are repeated 500 times for a total of
    45000 characters;

    Nine hundred three-digit values of i which are repeated 333 times for a
    total of 299700 characters;

    Nine thousand four-digit values of i which are repeated 250 times for a
    total of 2250000 characters;

    Ninety thousand five-digit values of i which are repeated 200 times for
    a total of 18000000 characters.

    All that adds up to a grand total of 20604700 characters.

    Or, to condense the above long-winded text in table form:

    range ? ? ? ? num digits 1000/len(str(i)) ?total chars
    0-9 ? ? ? ? ? ?10 1 ? ? ?1000 ? ? ? ? ? ? ? ? ? ?10000
    10-99 ? ? ? ? ?90 2 ? ? ? 500 ? ? ? ? ? ? ? ? ? ?45000
    100-999 ? ? ? 900 3 ? ? ? 333 ? ? ? ? ? ? ? ? ? 299700
    1000-9999 ? ?9000 4 ? ? ? 250 ? ? ? ? ? ? ? ? ?2250000
    10000-99999 90000 5 ? ? ? 200 ? ? ? ? ? ? ? ? 18000000
    ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?========
    ? ? ? ? ? ? ? ? ? ? ? ? ?grand total chars ? 20604700

    The second loop yields this table:

    range ? ? ? ? num digits 5000/len(str(i)) ?total bytes
    0-9 ? ? ? ? ? ?10 1 ? ? ?5000 ? ? ? ? ? ? ? ? ? ?50000
    10-99 ? ? ? ? ?90 2 ? ? ?2500 ? ? ? ? ? ? ? ? ? 225000
    100-999 ? ? ? 900 3 ? ? ?1666 ? ? ? ? ? ? ? ? ?1499400
    1000-9999 ? ?9000 4 ? ? ?1250 ? ? ? ? ? ? ? ? 11250000
    10000-19999 10000 5 ? ? ?1000 ? ? ? ? ? ? ? ? 10000000
    ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?========
    ? ? ? ? ? ? ? ? ? ? ? ? ?grand total chars ? 23024400

    The two loops do not produce the same numbers of characters, so I'm not
    surprised they do not consume the same amount of storage.

    P.S.: Please forgive me if I've made some basic math error somewhere.

    --
    John Gordon ? ? ? ? ? ? ? ? ? A is for Amy, who fell down the stairs
    gordon at panix.com ? ? ? ? ? ? ?B is for Basil, assaulted by bears
    ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-- Edward Gorey, "The Gashlycrumb Tinies"

    --
    http://mail.python.org/mailman/listinfo/python-list
  • Santoso Wijaya at Mar 16, 2011 at 7:51 pm
    ??

    Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)]
    on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    import sys
    L = []
    for i in xrange(100000):
    ... L.append(str(i) * (1000 / len(str(i))))
    ...
    sys.getsizeof(L)
    824464
    L = []
    for i in xrange(20000):
    ... L.append(str(i) * (5000 / len(str(i))))
    ...
    sys.getsizeof(L)
    178024
    >>>

    ~/santa

    On Wed, Mar 16, 2011 at 11:20 AM, Amit Dev wrote:

    sum(map(len, l)) => 99998200 for 1st case and 99999100 for 2nd case.
    Roughly 100MB as I mentioned.
    On Wed, Mar 16, 2011 at 11:21 PM, John Gordon wrote:
    In <mailman.988.1300289897.1189.python-list at python.org> Amit Dev <
    amitdev at gmail.com> writes:
    I'm observing a strange memory usage pattern with strings. Consider
    the following session. Idea is to create a list which holds some
    strings so that cumulative characters in the list is 100MB.
    l = []
    for i in xrange(100000):
    ... l.append(str(i) * (1000/len(str(i))))
    This uses around 100MB of memory as expected and 'del l' will clear
    that.
    for i in xrange(20000):
    ... l.append(str(i) * (5000/len(str(i))))
    This is using 165MB of memory. I really don't understand where the
    additional memory usage is coming from.
    If I reduce the string size, it remains high till it reaches around
    1000. In that case it is back to 100MB usage.
    I don't know anything about the internals of python storage -- overhead,
    possible merging of like strings, etc. but some simple character counting
    shows that these two loops do not produce the same number of characters.

    The first loop produces:

    Ten single-digit values of i which are repeated 1000 times for a total of
    10000 characters;

    Ninety two-digit values of i which are repeated 500 times for a total of
    45000 characters;

    Nine hundred three-digit values of i which are repeated 333 times for a
    total of 299700 characters;

    Nine thousand four-digit values of i which are repeated 250 times for a
    total of 2250000 characters;

    Ninety thousand five-digit values of i which are repeated 200 times for
    a total of 18000000 characters.

    All that adds up to a grand total of 20604700 characters.

    Or, to condense the above long-winded text in table form:

    range num digits 1000/len(str(i)) total chars
    0-9 10 1 1000 10000
    10-99 90 2 500 45000
    100-999 900 3 333 299700
    1000-9999 9000 4 250 2250000
    10000-99999 90000 5 200 18000000
    ========
    grand total chars 20604700

    The second loop yields this table:

    range num digits 5000/len(str(i)) total bytes
    0-9 10 1 5000 50000
    10-99 90 2 2500 225000
    100-999 900 3 1666 1499400
    1000-9999 9000 4 1250 11250000
    10000-19999 10000 5 1000 10000000
    ========
    grand total chars 23024400

    The two loops do not produce the same numbers of characters, so I'm not
    surprised they do not consume the same amount of storage.

    P.S.: Please forgive me if I've made some basic math error somewhere.

    --
    John Gordon A is for Amy, who fell down the stairs
    gordon at panix.com B is for Basil, assaulted by bears
    -- Edward Gorey, "The Gashlycrumb Tinies"

    --
    http://mail.python.org/mailman/listinfo/python-list
    --
    http://mail.python.org/mailman/listinfo/python-list
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110316/db9aea9f/attachment-0001.html>
  • Terry Reedy at Mar 16, 2011 at 8:38 pm

    On 3/16/2011 3:51 PM, Santoso Wijaya wrote:
    ??

    Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit
    (AMD64)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    import sys
    L = []
    for i in xrange(100000):
    ... L.append(str(i) * (1000 / len(str(i))))
    ...
    sys.getsizeof(L)
    824464
    This is only the size of the list object and does not include the sum of
    sizes of the string objects. With 8-byth pointers, 824464 == 8*100000 +
    (small bit of overhead) + extra space (for list to grow without
    reallocation and copy)
    L = []
    for i in xrange(20000):
    ... L.append(str(i) * (5000 / len(str(i))))
    ...
    sys.getsizeof(L)
    178024
    == 8*20000 + extra

    --
    Terry Jan Reedy
  • Eryksun () at Mar 16, 2011 at 8:13 pm

    On Wednesday, March 16, 2011 2:20:34 PM UTC-4, Amit Dev wrote:
    sum(map(len, l)) => 99998200 for 1st case and 99999100 for 2nd case.
    Roughly 100MB as I mentioned.
    The two lists used approximately the same memory in my test with Python 2.6.6 on Windows. An implementation detail such as this is likely to vary between interpreters across versions and platforms, including Jython and IronPython. A classic implementation detail is caching of small objects that occur frequently, such as small strings, integers less than 512, etc. For example:

    In [1]: a = 20 * '5'
    In [2]: b = 20 * '5'
    In [3]: a is b
    Out[3]: True

    In [4]: a = 21 * '5'
    In [5]: b = 21 * '5'
    In [6]: a is b
    Out[6]: False

    It's best not to depend on this behavior.
  • Dan Stromberg at Mar 16, 2011 at 9:51 pm

    On Wed, Mar 16, 2011 at 8:38 AM, Amit Dev wrote:

    I'm observing a strange memory usage pattern with strings. Consider
    the following session. Idea is to create a list which holds some
    strings so that cumulative characters in the list is 100MB.
    l = []
    for i in xrange(100000):
    ... l.append(str(i) * (1000/len(str(i))))

    This uses around 100MB of memory as expected and 'del l' will clear that.

    for i in xrange(20000):
    ... l.append(str(i) * (5000/len(str(i))))

    This is using 165MB of memory. I really don't understand where the
    additional memory usage is coming from.

    If I reduce the string size, it remains high till it reaches around
    1000. In that case it is back to 100MB usage.

    Python 2.6.4 on FreeBSD.

    Regards,
    Amit
    --
    http://mail.python.org/mailman/listinfo/python-list
    On Python 2.6.6 on Ubuntu 10.10:

    $ cat pmu
    #!/usr/bin/python

    import os
    import sys

    list_ = []

    if sys.argv[1] == '--first':
    for i in xrange(100000):
    list_.append(str(i) * (1000/len(str(i))))
    elif sys.argv[1] == '--second':
    for i in xrange(20000):
    list_.append(str(i) * (5000/len(str(i))))
    else:
    sys.stderr.write('%s: Illegal sys.argv[1]\n' % sys.argv[0])
    sys.exit(1)

    os.system("ps aux | egrep '\<%d\>|^USER\>'" % os.getpid())

    dstromberg-laptop-dstromberg:~/src/python-mem-use i686-pc-linux-gnu 10916 -
    above cmd done 2011 Wed Mar 16 02:38 PM

    $ make
    ./pmu --first
    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    1000 11063 0.0 3.4 110212 104436 pts/5 S+ 14:38 0:00
    /usr/bin/python ./pmu --first
    1000 11064 0.0 0.0 1896 512 pts/5 S+ 14:38 0:00 sh -c ps
    aux | egrep '\<11063\>|^USER\>'
    1000 11066 0.0 0.0 4012 740 pts/5 S+ 14:38 0:00 egrep
    \<11063\>|^USER\>
    ./pmu --second
    USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
    1000 11067 13.0 3.3 107540 101536 pts/5 S+ 14:38 0:00
    /usr/bin/python ./pmu --second
    1000 11068 0.0 0.0 1896 508 pts/5 S+ 14:38 0:00 sh -c ps
    aux | egrep '\<11067\>|^USER\>'
    1000 11070 0.0 0.0 4012 740 pts/5 S+ 14:38 0:00 egrep
    \<11067\>|^USER\>
    dstromberg-laptop-dstromberg:~/src/python-mem-use i686-pc-linux-gnu 10916 -
    above cmd done 2011 Wed Mar 16 02:38 PM

    So on Python 2.6.6 + Ubuntu 10.10, the second is actually a little smaller
    than the first.

    Some issues you might ponder:
    1) Does FreeBSD's malloc/free know how to free unused memory pages in the
    middle of the heap (using mmap games), or does it only sbrk() down when the
    end of the heap becomes unused, or does it never sbrk() back down at all?
    I've heard various *ix's fall into one of these 3 groups in releasing unused
    pages.

    2) It mijght be just an issue of how frequently the interpreter garbage
    collects; you could try adjusting this; check out the gc module. Note that
    it's often faster not to collect at every conceivable opportunity, but this
    tends to add up the bytes pretty quickly in some scripts - for a while,
    until the next collection. So your memory use pattern will often end up
    looking like a bit of a sawtooth function.

    3) If you need strict memory use guarantees, you might be better off with a
    language that's closer to the metal, like C - something that isn't garbage
    collected is one parameter to consider. If you already have something in
    CPython, then Cython might help; Cython allows you to use C datastructures
    from a dialect of Python.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110316/dbc837b2/attachment-0001.html>
  • Amit Dev at Mar 17, 2011 at 6:11 am
    Thanks Dan for the detailed reply. I suspect it is related to FreeBSD
    malloc/free as you suggested. Here is the output of running your
    script:

    [16-bsd01 ~/work]$ python strm.py --first
    USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
    amdev 6899 3.0 6.9 111944 107560 p0 S+ 9:57PM 0:01.20 python
    strm.py --first (python2.5)
    amdev 6900 0.0 0.1 3508 1424 p0 S+ 9:57PM 0:00.02 sh -c ps
    aux | egrep '\\<6899\\>|^USER\\>'
    amdev 6902 0.0 0.1 3380 1188 p0 S+ 9:57PM 0:00.01 egrep
    \\<6899\\>|^USER\\>

    [16-bsd01 ~/work]$ python strm.py --second
    USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
    amdev 6903 0.0 10.5 166216 163992 p0 S+ 9:57PM 0:00.92 python
    strm.py --second (python2.5)
    amdev 6904 0.0 0.1 3508 1424 p0 S+ 9:57PM 0:00.02 sh -c ps
    aux | egrep '\\<6903\\>|^USER\\>'
    amdev 6906 0.0 0.1 3508 1424 p0 R+ 9:57PM 0:00.00 egrep
    \\<6903\\>|^USER\\> (sh)

    Regards,
    Amit
    On Thu, Mar 17, 2011 at 3:21 AM, Dan Stromberg wrote:
    On Wed, Mar 16, 2011 at 8:38 AM, Amit Dev wrote:

    I'm observing a strange memory usage pattern with strings. Consider
    the following session. Idea is to create a list which holds some
    strings so that cumulative characters in the list is 100MB.
    l = []
    for i in xrange(100000):
    ... ?l.append(str(i) * (1000/len(str(i))))

    This uses around 100MB of memory as expected and 'del l' will clear that.

    for i in xrange(20000):
    ... ?l.append(str(i) * (5000/len(str(i))))

    This is using 165MB of memory. I really don't understand where the
    additional memory usage is coming from.

    If I reduce the string size, it remains high till it reaches around
    1000. In that case it is back to 100MB usage.

    Python 2.6.4 on FreeBSD.

    Regards,
    Amit
    --
    http://mail.python.org/mailman/listinfo/python-list
    On Python 2.6.6 on Ubuntu 10.10:

    $ cat pmu
    #!/usr/bin/python

    import os
    import sys

    list_ = []

    if sys.argv[1] == '--first':
    ??????? for i in xrange(100000):
    ??????????????? list_.append(str(i) * (1000/len(str(i))))
    elif sys.argv[1] == '--second':
    ??????? for i in xrange(20000):
    ??????????????? list_.append(str(i) * (5000/len(str(i))))
    else:
    ??????? sys.stderr.write('%s: Illegal sys.argv[1]\n' % sys.argv[0])
    ??????? sys.exit(1)

    os.system("ps aux | egrep '\<%d\>|^USER\>'" % os.getpid())

    dstromberg-laptop-dstromberg:~/src/python-mem-use i686-pc-linux-gnu 10916 -
    above cmd done 2011 Wed Mar 16 02:38 PM

    $ make
    ./pmu --first
    USER?????? PID %CPU %MEM??? VSZ?? RSS TTY????? STAT START?? TIME COMMAND
    1000???? 11063? 0.0? 3.4 110212 104436 pts/5?? S+?? 14:38?? 0:00
    /usr/bin/python ./pmu --first
    1000???? 11064? 0.0? 0.0?? 1896?? 512 pts/5??? S+?? 14:38?? 0:00 sh -c ps
    aux | egrep '\<11063\>|^USER\>'
    1000???? 11066? 0.0? 0.0?? 4012?? 740 pts/5??? S+?? 14:38?? 0:00 egrep
    \<11063\>|^USER\>
    ./pmu --second
    USER?????? PID %CPU %MEM??? VSZ?? RSS TTY????? STAT START?? TIME COMMAND
    1000???? 11067 13.0? 3.3 107540 101536 pts/5?? S+?? 14:38?? 0:00
    /usr/bin/python ./pmu --second
    1000???? 11068? 0.0? 0.0?? 1896?? 508 pts/5??? S+?? 14:38?? 0:00 sh -c ps
    aux | egrep '\<11067\>|^USER\>'
    1000???? 11070? 0.0? 0.0?? 4012?? 740 pts/5??? S+?? 14:38?? 0:00 egrep
    \<11067\>|^USER\>
    dstromberg-laptop-dstromberg:~/src/python-mem-use i686-pc-linux-gnu 10916 -
    above cmd done 2011 Wed Mar 16 02:38 PM

    So on Python 2.6.6 + Ubuntu 10.10, the second is actually a little smaller
    than the first.

    Some issues you might ponder:
    1) Does FreeBSD's malloc/free know how to free unused memory pages in the
    middle of the heap (using mmap games), or does it only sbrk() down when the
    end of the heap becomes unused, or does it never sbrk() back down at all?
    I've heard various *ix's fall into one of these 3 groups in releasing unused
    pages.

    2) It mijght be just an issue of how frequently the interpreter garbage
    collects; you could try adjusting this; check out the gc module.? Note that
    it's often faster not to collect at every conceivable opportunity, but this
    tends to add up the bytes pretty quickly in some scripts - for a while,
    until the next collection.? So your memory use pattern will often end up
    looking like a bit of a sawtooth function.

    3) If you need strict memory use guarantees, you might be better off with a
    language that's closer to the metal, like C - something that isn't garbage
    collected is one parameter to consider.? If you already have something in
    CPython, then Cython might help; Cython allows you to use C datastructures
    from a dialect of Python.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 16, '11 at 3:38p
activeMar 17, '11 at 6:11a
posts8
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase