FAQ
Hello,
I have a huge problem with loading very simple structure into memory
it is a list of tuples, it has 6MB and consists of 100000 elements
import cPickle
plik = open("mealy","r")
mealy = cPickle.load(plik)
plik.close()
this takes about 30 seconds!
How can I accelerate it?

Thanks in adv.

Search Discussions

  • Alex Martelli at Aug 15, 2003 at 2:57 pm

    Drochom wrote:

    Hello,
    I have a huge problem with loading very simple structure into memory
    it is a list of tuples, it has 6MB and consists of 100000 elements
    import cPickle
    plik = open("mealy","r")
    mealy = cPickle.load(plik)
    plik.close()
    this takes about 30 seconds!
    How can I accelerate it?

    Thanks in adv.
    What protocol did you pickle your data with? The default (protocol 0,
    ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
    save your data with the new protocol 2 -- it's likely to be fastest.


    Alex
  • Drochom at Aug 15, 2003 at 4:39 pm

    What protocol did you pickle your data with? The default (protocol 0,
    ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
    save your data with the new protocol 2 -- it's likely to be fastest.


    Alex
    Thanks:)
    i'm using default protocol, i'm not sure if i can upgrade so simply, because
    i'm using many modules for Py2.2
  • Irmen de Jong at Aug 15, 2003 at 5:23 pm

    Drochom wrote:
    What protocol did you pickle your data with? The default (protocol 0,
    ASCII text) is the slowest. I suggest you upgrade to Python 2.3 and
    save your data with the new protocol 2 -- it's likely to be fastest.


    Alex
    Thanks:)
    i'm using default protocol, i'm not sure if i can upgrade so simply, because
    i'm using many modules for Py2.2
    Then use protocol 1 instead -- that has been the binary pickle protocol
    for a long time, and works perfectly on Python 2.2.x :-)
    (and it's much faster than protocol 0 -- the text protocol)

    --Irmen
  • Michael Peuser at Aug 15, 2003 at 3:23 pm
    Hi,

    I have no idea! I used a similar scheme the other day and made some
    benchmarks (I *like* benchmarks!)

    About 6 MB took 4 seconds dumping as well as loading on a 800 Mhz P3 Laptop.
    When using binary mode it went down to about 1.5 seconds (And space to 2 MB)

    THis is o.k., because I generally have problems beeing faster than 1 MB/sec
    with my 2" drive, processor and Python ;-)

    Python 2.3 seems to have even a more effective "protocoll mode 2".

    May be your structures are *very* complex??

    Kindly
    Michael P



    "Drochom" <pedrosch at gazeta.pl> schrieb im Newsbeitrag
    news:bhiqlg$9qj$1 at atlantis.news.tpi.pl...
    Hello,
    I have a huge problem with loading very simple structure into memory
    it is a list of tuples, it has 6MB and consists of 100000 elements
    import cPickle
    plik = open("mealy","r")
    mealy = cPickle.load(plik)
    plik.close()
    this takes about 30 seconds!
    How can I accelerate it?

    Thanks in adv.

  • Drochom at Aug 15, 2003 at 4:37 pm
    Thanks for help:)

    Here is simple example:
    frankly speaking it's a graph with 100000 nodes:
    STRUCTURE:
    [(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    0),), (('a', 4, 0), ('o', 2, 0))]

    PICKLED
    (lp1
    ((S'k'
    I5
    I0
    ttp2
    a((S'*'
    I0
    I0
    ttp3
    a((S't'
    I1
    I1
    ttp4
    a((S'o'
    I2
    I0
    ttp5
    a((S't'
    I3
    I0
    ttp6
    a((S'a'
    I4
    I0
    t(S'o'
    I2
    I0
    ttp7
    a.

    Maybe now you can give me more precise advice:)
    Thanks again
  • Michael Peuser at Aug 15, 2003 at 5:27 pm
    o.k - I modified my testprogram - let it run at your machine.
    It took 1.5 seconds - I made it 11 Million records to get to 2 Mbyte.
    Kindly
    Michael
    ------------------
    import cPickle as Pickle
    from time import clock

    # generate 1.000.000 records
    r=[(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    0),), (('a', 4, 0), ('o', 2, 0))]

    x=[]

    for i in xrange(1000000):
    x.append(r)

    print len(x), "records"

    t0=clock()
    f=open ("test","w")
    Pickle.dump(x,f,1)
    f.close()
    print "out=", clock()-t0

    t0=clock()
    f=open ("test")
    x=Pickle.load(f)
    f.close()
    print "in=", clock()-t0
    ---------------------

    "Drochom" <pedrosch at gazeta.pl> schrieb im Newsbeitrag
    news:bhj2ah$2ke$1 at nemesis.news.tpi.pl...
    Thanks for help:)

    Here is simple example:
    frankly speaking it's a graph with 100000 nodes:
    STRUCTURE:
    [(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    0),), (('a', 4, 0), ('o', 2, 0))]

    I0
    ttp5
    a((S't'
    I3
    I0
    ttp6
    a((S'a'
    I4
    I0
    t(S'o'
    I2
    I0
    ttp7
    a.

    Maybe now you can give me more precise advice:)
    Thanks again
  • Drochom at Aug 15, 2003 at 10:33 pm
    "Michael Peuser" <mpeuser at web.de> wrote in message
    news:bhj56t$1d8$03$1 at news.t-online.com...
    o.k - I modified my testprogram - let it run at your machine.
    It took 1.5 seconds - I made it 11 Million records to get to 2 Mbyte.
    Kindly
    Michael
    ------------------
    import cPickle as Pickle
    from time import clock

    # generate 1.000.000 records
    r=[(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    0),), (('a', 4, 0), ('o', 2, 0))]

    x=[]

    for i in xrange(1000000):
    x.append(r)

    print len(x), "records"

    t0=clock()
    f=open ("test","w")
    Pickle.dump(x,f,1)
    f.close()
    print "out=", clock()-t0

    t0=clock()
    f=open ("test")
    x=Pickle.load(f)
    f.close()
    print "in=", clock()-t0
    ---------------------
    Hi, i'm really grateful for your help,
    i've modyfied your code a bit, check your times and tell me what are they

    TRY THIS:

    import cPickle as Pickle
    from time import clock
    from random import randrange


    x=[]

    for i in xrange(20000):
    c = []
    for j in xrange(randrange(2,25)):
    c.append((chr(randrange(33,120)),randrange(1,100000),randrange(1,3)))
    c = tuple(c)
    x.append(c)
    if i%1000==0: print i #it will help you to survive waiting...
    print len(x), "records"

    t0=clock()
    f=open ("test","w")
    Pickle.dump(x,f,0)
    f.close()
    print "out=", clock()-t0


    t0=clock()
    f=open ("test")
    x=Pickle.load(f)
    f.close()
    print "in=", clock()-t0

    Thanks once again:)
  • Michael Peuser at Aug 16, 2003 at 7:05 am
    Hi Drochem,

    (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
    the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
    be there is someone interested in fixing this ....


    (2) I run your code and - as you noticed - it takes some time to *generate*
    the datastructure. To be fair pickle has to do the same so it cannot be
    *significantly* faster!!!
    The size of the file was 5,5 MB

    (3) Timings (2.2):
    Generation of data: 18 secs
    Dunping: 3,2 secs
    Loading: 19,4 sec

    (4) I couldn't refrain from running it under 2.3
    Generation of data: 8,5 secs !!!!
    Dumping: 6,4 secs !!!!
    Loading: 5,7 secs


    So your programming might really improve when changing to 2.3 - and if
    anyone can fix the cPickle bug, protocol mode 2 will be even more efficient.

    Kindly
    Michael

    "Drochom" <pedrosch at gazeta.pl> schrieb im Newsbeitrag
    news:bhjn6v$pi8$1 at nemesis.news.tpi.pl...
    >
    [....]
    TRY THIS:

    import cPickle as Pickle
    from time import clock
    from random import randrange


    x=[]

    for i in xrange(20000):
    c = []
    for j in xrange(randrange(2,25)):
    c.append((chr(randrange(33,120)),randrange(1,100000),randrange(1,3)))
    c = tuple(c)
    x.append(c)
    if i%1000==0: print i #it will help you to survive waiting...
    print len(x), "records"

    t0=clock()
    f=open ("test","w")
    Pickle.dump(x,f,0)
    f.close()
    print "out=", clock()-t0


    t0=clock()
    f=open ("test")
    x=Pickle.load(f)
    f.close()
    print "in=", clock()-t0

    Thanks once again:)

  • Tim Evans at Aug 16, 2003 at 11:52 am

    "Michael Peuser" <mpeuser at web.de> writes:

    Hi Drochem,

    (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it with
    the "new Pickle" in 2.3 - same result: "EOF error" when loading back...) May
    be there is someone interested in fixing this ....
    [snip]
    f=open ("test","w")
    [snip]
    f=open ("test")
    [snip]

    Note that on windows, you must open binary files using binary mode
    when reading and writing them, like so:

    f = open('test', 'wb')
    f = open('test', 'rb')
    ^^^^

    If you don't do this binary data will be corrupted by the automatic
    conversion of '\n' to '\r\n' by win32. This is very likely what is
    causing the above error.

    --
    Tim Evans
  • Michael Peuser at Aug 17, 2003 at 7:39 am
    So stupid of me :-(((

    Now here are the benchmarks I got from Drochems dataset. I think it should
    sufice to use the binary mode of 2.2. (I checked the 2.3 data on a different
    disk the other day - that made them not comparable!! I now use the same disk
    for the tests.)

    Timings (2.2.2):
    Generation of data: 18 secs
    Dunping: 3 secs
    Loading: 18,5 sec
    Filesize: 5,5 MB

    Binary dump: 2,4
    Binary load: 3
    Filesize: 2,8 MB

    2.3
    Generation of data: 9 secs
    Dumping: 2,4
    Loading: 2,8


    Binary dump: 1
    Binary load: 1,9
    Filesize: 2,8 MB

    Mode 2 dump: 0,9
    Mode 2 load: 1,7
    Filesize: 2,6 MB

    The musch faster time for generating the data in 2.3 could be due to an
    improved random generator (?) That had alwys been quite slow..

    Kindly
    Michael P



    "Tim Evans" <t.evans at paradise.net.nz> schrieb im Newsbeitrag
    news:87r83l3jmj.fsf at cassandra.evansnet...
    "Michael Peuser" <mpeuser at web.de> writes:
    Hi Drochem,

    (1) Your dataset seems to break the binary cPickle mode ;-) (I tried it
    with
    the "new Pickle" in 2.3 - same result: "EOF error" when loading back...)
    May
    be there is someone interested in fixing this ....
    [snip]
    f=open ("test","w")
    [snip]
    f=open ("test")
    [snip]

    Note that on windows, you must open binary files using binary mode
    when reading and writing them, like so:

    f = open('test', 'wb')
    f = open('test', 'rb')
    ^^^^

    If you don't do this binary data will be corrupted by the automatic
    conversion of '\n' to '\r\n' by win32. This is very likely what is
    causing the above error.

    --
    Tim Evans



    From pu Sun Aug 17 09:46:23 2003
    From: pu (Patrick Useldinger)
    Date: Sun, 17 Aug 2003 09:46:23 +0200
    Subject: asynchat question
    In-Reply-To: <3F3EBE44.FA29DC66@alcyone.com>
    References: <3f3ebaa8_2@news.vo.lu> <3F3EBE44.FA29DC66@alcyone.com>
    Message-ID: <3f3f326a_1@news.vo.lu>

    Erik Max Francis wrote:
    There are several problems here. For starters, your server class isn't
    really a server, since it doesn't derive from asyncore.dispatcher, and
    doesn't bind to a port, so connections never come in. (And you wanted
    to override handle_accept, not handle_connect.) Try something like this
    for your server:
    Eric, my problem is with the client. The server is working correctly, I
    only posted the part of the code I thought was relevant for my
    particular problem. The full code is here:

    from sasUtils import now,SingleServer
    from sasParms import EBPort,BLOCKEND
    import asyncore,socket

    class Dispatcher(asyncore.dispatcher):
    def __init__(self,port):
    asyncore.dispatcher.__init__(self)
    self.create_socket(socket.AF_INET,socket.SOCK_STREAM)
    self.bind(('',port))
    print now(),'listening to port',port
    self.listen(5)
    def handle_accept(self):
    newSocket, address=self.accept()
    print 'connected from',address
    SecondaryServer(newSocket)

    class SecondaryServer(SingleServer):
    def processData(self,data):
    response='??'
    peer=self.getpeername()
    print now(),'from %s received %s' % (peer,repr(data))
    if data == 'quit':
    if peer[0]=='127.0.0.1':
    response='OK'
    dispatcher.close()
    else:
    response='KO'
    response=response+' '+data
    print now(),'to %s responding %s' % (peer,repr(response))
    self.push(response+BLOCKEND)

    if __name__ == '__main__':
    print now(),'start sasEM'
    dispatcher=Dispatcher(EBPort)
    asyncore.loop()
    print now(),'end sasEM'


    Cheers,
    -Patrick

    --
    Real e-mail address is 'cHVAdm8ubHU=\n'.decode('base64')
    Visit my Homepage at http://www.homepages.lu/pu/
  • Scott David Daniels at Aug 15, 2003 at 5:52 pm

    Drochom wrote:
    Thanks for help:)
    Here is simple example:
    frankly speaking it's a graph with 100000 nodes:
    STRUCTURE:
    [(('k', 5, 0),), (('*', 0, 0),), (('t', 1, 1),), (('o', 2, 0),), (('t', 3,
    0),), (('a', 4, 0), ('o', 2, 0))]
    Perhaps this matches your spec:

    from random import randrange
    import pickle, cPickle, time

    source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
    for i in range(100000)]


    def timed(module, flag, name='file.tmp'):
    start = time.time()
    dest = file(name, 'wb')
    module.dump(source, dest, flag)
    dest.close()
    mid = time.time()
    dest = file(name, 'rb')
    result = module.load(dest)
    dest.close()
    stop = time.time()
    assert source == result
    return mid-start, stop-mid

    On 2.2:
    timed(pickle, 0): (7.8, 5.5)
    timed(pickle, 1): (9.5, 6.2)
    timed(cPickle, 0): (0.41, 4.9)
    timed(cPickle, 1): (0.15, .53)

    On 2.3:
    timed(pickle, 0): (6.2, 5.3)
    timed(pickle, 1): (6.6, 5.4)
    timed(pickle, 2): (6.5, 3.9)

    timed(cPickle, 0): (6.2, 5.3)
    timed(pickle, 1): (.88, .69)
    timed(pickle, 2): (.80, .67)

    (Not tightly controlled -- I'd gues 1.5 digits)

    -Scott David Daniels
    Scott.Daniels at Acm.Org
  • Drochom at Aug 15, 2003 at 10:58 pm

    Perhaps this matches your spec:

    from random import randrange
    import pickle, cPickle, time

    source = [(chr(randrange(33, 127)), randrange(100000), randrange(i+50))
    for i in range(100000)]


    def timed(module, flag, name='file.tmp'):
    start = time.time()
    dest = file(name, 'wb')
    module.dump(source, dest, flag)
    dest.close()
    mid = time.time()
    dest = file(name, 'rb')
    result = module.load(dest)
    dest.close()
    stop = time.time()
    assert source == result
    return mid-start, stop-mid

    On 2.2:
    timed(pickle, 0): (7.8, 5.5)
    timed(pickle, 1): (9.5, 6.2)
    timed(cPickle, 0): (0.41, 4.9)
    timed(cPickle, 1): (0.15, .53)

    On 2.3:
    timed(pickle, 0): (6.2, 5.3)
    timed(pickle, 1): (6.6, 5.4)
    timed(pickle, 2): (6.5, 3.9)

    timed(cPickle, 0): (6.2, 5.3)
    timed(pickle, 1): (.88, .69)
    timed(pickle, 2): (.80, .67)

    (Not tightly controlled -- I'd gues 1.5 digits)

    -Scott David Daniels
    Scott.Daniels at Acm.Org
    Hello, and Thanks, your code was extremely helpful:)

    Regards
    Przemo Drochomirecki
  • Batista, Facundo at Aug 15, 2003 at 4:48 pm
    #- > What protocol did you pickle your data with? The default
    #- (protocol 0,
    #- > ASCII text) is the slowest. I suggest you upgrade to
    #- Python 2.3 and
    #- > save your data with the new protocol 2 -- it's likely to
    #- be fastest.
    #- >
    #- >
    #- > Alex
    #- >
    #- Thanks:)
    #- i'm using default protocol, i'm not sure if i can upgrade so
    #- simply, because
    #- i'm using many modules for Py2.2

    Last time I upgraded, was from XLM to bin format (there weren't
    "protocols").

    It was easy: the load() method detects in what format is pickled, so I just
    modified the dump() method, and that was all!

    Don't know if this also works with this new methodology.

    . Facundo






    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . . .
    ADVERTENCIA

    La informaci?n contenida en este mensaje y cualquier archivo anexo al mismo,
    son para uso exclusivo del destinatario y pueden contener informaci?n
    confidencial o propietaria, cuya divulgaci?n es sancionada por la ley.

    Si Ud. No es uno de los destinatarios consignados o la persona responsable
    de hacer llegar este mensaje a los destinatarios consignados, no est?
    autorizado a divulgar, copiar, distribuir o retener informaci?n (o parte de
    ella) contenida en este mensaje. Por favor notif?quenos respondiendo al
    remitente, borre el mensaje original y borre las copias (impresas o grabadas
    en cualquier medio magn?tico) que pueda haber realizado del mismo.

    Todas las opiniones contenidas en este mail son propias del autor del
    mensaje y no necesariamente coinciden con las de Telef?nica Comunicaciones
    Personales S.A. o alguna empresa asociada.

    Los mensajes electr?nicos pueden ser alterados, motivo por el cual
    Telef?nica Comunicaciones Personales S.A. no aceptar? ninguna obligaci?n
    cualquiera sea el resultante de este mensaje.

    Muchas Gracias.
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://mail.python.org/pipermail/python-list/attachments/20030815/673fadd2/attachment.html
  • Bengt Richter at Aug 15, 2003 at 10:22 pm

    On Fri, 15 Aug 2003 16:27:18 +0200, "Drochom" wrote:
    Hello,
    I have a huge problem with loading very simple structure into memory
    it is a list of tuples, it has 6MB and consists of 100000 elements
    If speed is important, you may want to do different things depending on e.g.,
    what is in those tuples, and whether they are all the same length, etc. E.g.,
    if they were all fixed length tuples of integers, you could do hugely better
    than store the data as a list of tuples.

    Secondly, you might want to consider being lazy about extracting the data you
    are actually going to use, depending on use patterns. One way to do that would
    be to have a compact index to the data, or store it in such a way that you can
    compute an index, and then write some simple class to define access methods.
    That's not a bad idea anyway, since it will let you change the way you store
    and retrieve data later, without changing the code that uses it.

    You could store the whole thing in a mmap image, with a length-prefixed pickle
    string in the front representing index info.

    There's a lot of different things you could do. But Alex's suggestion (upgrade to
    2.3 and use protocol 2 pickle) will probably take care of it ;-)
    import cPickle
    plik = open("mealy","r")
    mealy = cPickle.load(plik)
    plik.close()
    this takes about 30 seconds!
    How can I accelerate it?
    Find a way to avoid doing it? Or doing much of it?
    What are your access needs once the data is accessible?

    Regards,
    Bengt Richter
  • Drochom at Aug 15, 2003 at 10:41 pm
    Hello,
    If speed is important, you may want to do different things depending on e.g.,
    what is in those tuples, and whether they are all the same length, etc. E.g.,
    if they were all fixed length tuples of integers, you could do hugely better
    than store the data as a list of tuples.
    Those tuples have different length indeed.
    You could store the whole thing in a mmap image, with a length-prefixed pickle
    string in the front representing index info.
    If i only knew how do to it...:-)
    Find a way to avoid doing it? Or doing much of it?
    What are your access needs once the data is accessible?
    My structure stores a finite state automaton with polish dictionary (lexicon
    to be more precise) and it should be loaded
    once but fast!

    Thx
    Regards,
    Przemo Drochomirecki
  • Bengt Richter at Aug 16, 2003 at 3:03 am

    On Sat, 16 Aug 2003 00:41:42 +0200, "Drochom" wrote:
    Hello,
    If speed is important, you may want to do different things depending on e.g.,
    what is in those tuples, and whether they are all the same length, etc. E.g.,
    if they were all fixed length tuples of integers, you could do hugely better
    than store the data as a list of tuples.
    Those tuples have different length indeed.
    You could store the whole thing in a mmap image, with a length-prefixed pickle
    string in the front representing index info.
    If i only knew how do to it...:-)
    Find a way to avoid doing it? Or doing much of it?
    What are your access needs once the data is accessible?
    My structure stores a finite state automaton with polish dictionary (lexicon
    to be more precise) and it should be loaded
    once but fast!
    I wonder how much space it would take to store the Polish complete language word
    list with one entry each in a Python dictionary. 300k words of 6-7 characters avg?
    Say 2MB plus the dict hash stuff. I bet it would be fast.

    Is that in effect what you are doing, except sort of like a regex state machine
    to match words character by character?

    Regards,
    Bengt Richter
  • Drochom at Aug 15, 2003 at 10:49 pm
    I forgot to explain you why i use tuples instead of lists
    i was squeezing a lexicon => minimalization of automaton => using a
    dictionary => using hashable objects =>using tuples(lists aren't hashable)


    Regards,
    Przemo Drochomirecki
  • Klaus Alexander Seistrup at Aug 16, 2003 at 8:05 am

    Drochom wrote:

    import cPickle

    plik = open("mealy","r")
    mealy = cPickle.load(plik)
    plik.close()
    this takes about 30 seconds!
    How can I accelerate it?
    Perhaps it's worth looking into PyTables:

    <http://pytables.sourceforge.net/doc/PyCon.html#section4>


    Cheers,

    // Klaus

    --
    <> unselfish actions pay back better
  • Michael Peuser at Aug 17, 2003 at 7:48 am
    This a highly interesting product - I for sure shall come back to it when I
    have to store more data...

    Kindly
    Michael P


    "Klaus Alexander Seistrup" <spam at magnetic-ink.dk> schrieb im Newsbeitrag
    news:3f3de5af-3dcdcb9a-e4f8-43c2-a49e-2e5009909839 at news.szn.dk...
    Drochom wrote:
    import cPickle

    plik = open("mealy","r")
    mealy = cPickle.load(plik)
    plik.close()
    this takes about 30 seconds!
    How can I accelerate it?
    Perhaps it's worth looking into PyTables:

    <http://pytables.sourceforge.net/doc/PyCon.html#section4>


    Cheers,

    // Klaus

    --
    <> unselfish actions pay back better

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 15, '03 at 2:27p
activeAug 17, '03 at 7:48a
posts20
users9
websitepython.org

People

Translate

site design / logo © 2022 Grokbase