FAQ
Hi,

I have about 200GB of data that I need to go through and extract the
common first part of a line. Something like this.
a = "abcdefghijklmnopqrstuvwxyz"
b = "abcdefghijklmnopBHLHT"
c = extract(a,b)
print c
"abcdefghijklmnop"

Here I want to extract the common string "abcdefghijklmnop". Basically I
need a fast way to do that for any two given strings. For my situation,
the common string will always be at the beginning of both strings. I can
use regular expressions to do this, but from what I understand there is
a lot of overhead. New data is being generated at the rate of about 1GB
per hour, so this needs to be reasonably fast while leaving CPU time for
other processes.

Thanks
Ravi

Search Discussions

  • Jim Richardson at Aug 2, 2003 at 11:03 pm

    On Sat, 02 Aug 2003 17:39:26 -0400, Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi
    Are you trying to match any to any strings? or only a pair as above?


    --
    Jim Richardson http://www.eskimo.com/~warlock

    Linux, because eventually, you grow up enough to be trusted with a fork()
  • Ravi at Aug 3, 2003 at 3:14 am
    Are you trying to match any to any strings? or only a pair as above?
    Just a pair at a time, and I only want the first N characters that are
    common to both strings. The os.path.commonprefix works nicely. Thanks
    for your help.

    Ravi
  • Jim Richardson at Aug 3, 2003 at 5:16 am

    On Sat, 02 Aug 2003 23:14:10 -0400, Ravi wrote:
    Are you trying to match any to any strings? or only a pair as above?
    Just a pair at a time, and I only want the first N characters that are
    common to both strings. The os.path.commonprefix works nicely. Thanks
    for your help.

    Ravi
    Well, it wasn't me, but you're welcome anyway :)




    --
    Jim Richardson http://www.eskimo.com/~warlock

    Linux, because eventually, you grow up enough to be trusted with a fork()
  • Jeff Epler at Aug 3, 2003 at 3:23 am
    This is a naive implementation of the 'extract' function.
    def extract(a, b):
    m = min(len(a), len(b))
    for i in range(m):
    if a[i] != b[i]:
    return a[:i]
    return a[:m]

    Here's one that uses the new zip() function:
    def extract2(a, b):
    m = min(len(a), len(b))
    for i, ai, bi in (range(m), a, b):
    if ai != bi: return a[:i]
    return a[:m]
    .. unfortunately, it seems to be slower than the first method. On my
    machine (800MHz PIII):
    $ python timeit.py -s 'import ravi' \
    'ravi.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    10000 loops, best of 3: 32.7 usec per loop

    If your goal is actually something slightly different---for instance,
    find the string from a list with the largest shared prefix to a given
    string---then you probably need to research an efficient algorithm.*
    Otherwise, you may be stuck with the above. If you have 200GB, and each
    line is 80 chars, you have about 2.7 billion lines. If you call extract()
    once per line, and it takes 32 usec, you're looking at 24 hours to run.

    Writing extract as a C extension may be wise, you're likely to be able to
    cut those 32 usec down to little more than Python function call overhead,
    or <2usec per loop. That makes your 2.7x10^9 calls only take 70 minutes.

    Jeff
    * The efficient algorithm for this problem might involve arranging the
    list of strings as a tree, which makes finding the right prefix for
    a given string against the list take only about lg(len(l)) steps,
    instead of len(l) steps. This isn't so much a Python problem as a
    computer science problem, though.
  • Scott David Daniels at Aug 3, 2003 at 4:18 am

    Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi
    While you can be forgiven for not have guessed, os.path is the place to
    look:
    import os.path
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    print os.path.commonprefix([a,b])

    -Scott David Daniels
    Scott.Daniels at Acm.Org
  • Ravi at Aug 2, 2003 at 11:28 pm

    While you can be forgiven for not have guessed, os.path is the place to
    look:
    import os.path
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    print os.path.commonprefix([a,b])

    -Scott David Daniels
    Scott.Daniels at Acm.Org
    Certainly not where I was expecting it, Thanks

    Ravi
  • John Machin at Aug 3, 2003 at 3:18 am

    On Sat, 02 Aug 2003 21:18:32 -0700, Scott David Daniels wrote:

    Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi
    While you can be forgiven for not have guessed, os.path is the place to
    look:
    import os.path
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    print os.path.commonprefix([a,b])
    I don't think that os.path.commonprefix was designed with 200Gb of
    data in mind. Inspection of Lib/*path.py gives one the impression that
    it spends a lot of time discovering that the first element is a prefix
    of itself.

    Ravi, You may need to drop down to C to get the speed you want for
    your requirement to find the longest common prefix of two strings. Two
    things puzzling me: (1) how you would do this with regular expressions
    (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
    after a year you have almost 9000Gb; where are you putting it all?
    BTW, I do hope your algorithm is not O(N**2) ...

    Cheers,
    John
  • Ravi at Aug 3, 2003 at 3:46 am

    I don't think that os.path.commonprefix was designed with 200Gb of
    data in mind. Inspection of Lib/*path.py gives one the impression that
    it spends a lot of time discovering that the first element is a prefix
    of itself.

    Ravi, You may need to drop down to C to get the speed you want for
    your requirement to find the longest common prefix of two strings. Two
    things puzzling me: (1) how you would do this with regular expressions
    (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
    after a year you have almost 9000Gb; where are you putting it all?
    BTW, I do hope your algorithm is not O(N**2) ...

    Cheers,
    John
    Well, I timed os.path.commonprefix with some typical data and it's
    pulling about 40usec per loop. So I did what I hated and coded a little
    function in C. Goes something like this. My reasoning is that usually
    the point where the strings start to differ is in the 30 - 50 character
    range. Basically the idea is the same as a binary search on a sorted
    array. Divide and conquer by going halfway each time.

    Read in both strings.
    Check to see if the first character matches.
    If yes:
    Check halfway through the string and see if that character matches
    Repeatedly check halfway until the difference point is found.
    Go back through from the difference point backwards and make sure
    the characters match from the start to the difference point.

    I timed it, and it seems to be doing about 3.5usec per loop. Using
    pipes, I can feed it directly into the processing program. Good enough
    for me.

    As for the data, it's data from a radio telescope that's being recorded.
    We do pattern analysis and reduce it to strings. By examining these
    strings (more analysis than just the first common bit of course), we can
    determine what data should be looked at further and what data is
    garbage. The 9000GB problem isn't all that bad, the stuff compresses
    extremely well, down to about 700GB for a year's worth. A couple of RAID
    arrays makes quick work of that.

    Thanks,

    Ravi
  • Andrew Dalke at Aug 3, 2003 at 6:22 am

    Ravi:
    Read in both strings.
    Check to see if the first character matches.
    If yes:
    Check halfway through the string and see if that character matches
    Repeatedly check halfway until the difference point is found.
    Go back through from the difference point backwards and make sure
    the characters match from the start to the difference point.

    I timed it, and it seems to be doing about 3.5usec per loop.
    There's a lot of overhead for doing that. Have you tried the simple

    char *s1 = ... the first string ..
    char *s2 = ... the second string ..
    n = ... the shorter of the two ..
    for(i=0; i<n; i++) {
    if (*s1++ != *s2++) {
    break;
    }
    }
    return ... the string s1[:n] (or even just the int)

    Easy to understand, and the CPU is spending almost its whole
    time doing character tests.

    Andrew
    dalke at dalkescientific.com
  • Jim Richardson at Aug 3, 2003 at 7:22 am

    On Sun, 3 Aug 2003 00:22:46 -0600, Andrew Dalke wrote:
    Ravi:
    Read in both strings.
    Check to see if the first character matches.
    If yes:
    Check halfway through the string and see if that character matches
    Repeatedly check halfway until the difference point is found.
    Go back through from the difference point backwards and make sure
    the characters match from the start to the difference point.

    I timed it, and it seems to be doing about 3.5usec per loop.
    There's a lot of overhead for doing that. Have you tried the simple

    char *s1 = ... the first string ..
    char *s2 = ... the second string ..
    n = ... the shorter of the two ..
    for(i=0; i<n; i++) {
    if (*s1++ != *s2++) {
    break;
    }
    }
    return ... the string s1[:n] (or even just the int)

    Easy to understand, and the CPU is spending almost its whole
    time doing character tests.
    Why bother finding out which one is the shorter? if you try the compare,
    and you run out of the other to compare to, then by default, it's not
    the same :)


    --
    Jim Richardson http://www.eskimo.com/~warlock

    Linux, because eventually, you grow up enough to be trusted with a fork()
  • Andrew Dalke at Aug 3, 2003 at 6:44 pm

    Jim Richardson:
    Why bother finding out which one is the shorter? if you try the compare,
    and you run out of the other to compare to, then by default, it's not
    the same :)
    Because I wasn't sure if the strings had embedded NULs in them.
    Python strings allow those.

    Otherwise something like this would work

    char *s1, *s2 = ... the strings from Python
    char *s = s1;
    while ( *s1 && (*s1++ == *s2++))
    ;
    return the string from s->s1, or just the size s1-s.

    Andrew
    dalke at dalkescientific.com
  • Jim Richardson at Aug 4, 2003 at 12:48 am

    On Sun, 3 Aug 2003 12:44:59 -0600, Andrew Dalke wrote:
    Jim Richardson:
    Why bother finding out which one is the shorter? if you try the compare,
    and you run out of the other to compare to, then by default, it's not
    the same :)
    Because I wasn't sure if the strings had embedded NULs in them.
    Python strings allow those.
    Ah, I hadn't considered that.
    Otherwise something like this would work

    char *s1, *s2 = ... the strings from Python
    char *s = s1;
    while ( *s1 && (*s1++ == *s2++))
    ;
    return the string from s->s1, or just the size s1-s.

    Andrew
    dalke at dalkescientific.com

    --
    Jim Richardson http://www.eskimo.com/~warlock

    Linux, because eventually, you grow up enough to be trusted with a fork()


    From http Mon Aug 4 03:23:13 2003
    From: http (Paul Rubin)
    Date: 03 Aug 2003 18:23:13 -0700
    Subject: Example code for anydbm wanted...
    References: <mailman.1059955807.23968.python-list@python.org>
    Message-ID: <7x8yqa8bcu.fsf@ruckus.brouhaha.com>

    "John D." <lists at webcrunchers.com> writes:
    Does anyone have any good example code that shows how to use the
    "anydbm" wrapper tp interface with a very simple database. like
    some examples of how to use "dumbdbm"? Then perhaps at a later
    time, I'm going to want to then interface it to a more robust
    database like PostGreSQL or mySQL. But I want to external interface
    to "hide" the specifics and have a really generic interface.
    I think using anydbm isn't wise because it makes a completely
    indeterminate choice of what underlying database library to call. I
    used it in a Python 1.5.2 script and later the machine it ran on had
    an OS upgrade. The result was that anydbm switched to using an
    incompatible dbm library and my script broke and I had to do contorted
    things to get it working again. If you don't want to use particular
    features of a specific dbm library (and deal with the headaches of
    handling upgrades to that library), it's probably better to just use
    dumbdbm. If dumbdbm's performance is really awful, then dumbdbm
    should be fixed.
  • Mañungo at Aug 22, 2003 at 6:51 am
    "Andrew Dalke" <adalke at mindspring.com> wrote in message news:<bgi9g8$dc0$1 at slb6.atl.mindspring.net>...
    Ravi:
    Read in both strings.
    Check to see if the first character matches.
    If yes:
    Check halfway through the string and see if that character matches
    Repeatedly check halfway until the difference point is found.
    Go back through from the difference point backwards and make sure
    the characters match from the start to the difference point.

    I timed it, and it seems to be doing about 3.5usec per loop.
    There's a lot of overhead for doing that. Have you tried the simple

    char *s1 = ... the first string ..
    char *s2 = ... the second string ..
    n = ... the shorter of the two ..
    for(i=0; i<n; i++) {
    if (*s1++ != *s2++) {
    break;
    }
    }
    return ... the string s1[:n] (or even just the int)

    Easy to understand, and the CPU is spending almost its whole
    time doing character tests.

    Andrew
    dalke at dalkescientific.com
    First, I'm not a python programmer... but I think is better test int.
    Something like that:

    int *a = (int *) ...first string...;
    int *b = (int *) ...second string...;

    for( i = 0; i < n; i += 4 )
    if( *a++ != *b++ )
    break;
    char *aa = (char *) a;
    char *bb = (char *) b;
    if( *aa++ == *bb++ ) i++;
    if( *aa++ == *bb++ ) i++;
    if( *aa == *bb ) i++;

    and return i.
  • Andrew Dalke at Aug 22, 2003 at 7:28 am

    Ma?ungo:
    First, I'm not a python programmer... but I think is better test int.
    Something like that:

    int *a = (int *) ...first string...;
    int *b = (int *) ...second string...;
    Sure. That's an old trick. However, your code assumes your
    strings are word aligned. (Perhaps for Python they are but
    in general you cannot make that assumption about C char *s)

    You also have an off-by-one/off-by-four error. Suppose the
    two strings are "A", that is, an 'A' followed by a NUL. Then

    for (i=0; i<n; i+=4)
    if (*a++ != *b++)
    break

    will compare the four bytes and increment the counters.
    You then do

    char *aa = (char *) a;

    and work with the characters *after* the first four characters
    tested in the int compare.

    Finally, you assume ints are 4 bytes long, which is
    not universal.

    There's something to be said for simplicity. I also
    wonder if modern optimizing compliers could figure
    out some of this automatically, but I don't wonder
    enough to find out.

    Andrew
    dalke at dalkescientific.com
  • Andrew Bennetts at Aug 22, 2003 at 7:42 am

    On Fri, Aug 22, 2003 at 07:28:11AM +0000, Andrew Dalke wrote:
    There's something to be said for simplicity. I also
    wonder if modern optimizing compliers could figure
    out some of this automatically, but I don't wonder
    enough to find out.
    Judging from this URL, I'd say they problem can:
    http://lwn.net/Articles/45404/

    -Andrew.
  • Bengt Richter at Aug 3, 2003 at 7:44 pm

    On Sat, 2 Aug 2003 22:23:12 -0500, Jeff Epler wrote:
    This is a naive implementation of the 'extract' function.
    def extract(a, b):
    m = min(len(a), len(b))
    for i in range(m):
    if a[i] != b[i]:
    return a[:i]
    return a[:m]

    Here's one that uses the new zip() function:
    I don't see "zip" ;-)
    def extract2(a, b):
    m = min(len(a), len(b))
    for i, ai, bi in (range(m), a, b):
    if ai != bi: return a[:i]
    return a[:m]
    .. unfortunately, it seems to be slower than the first method. On my
    machine (800MHz PIII):
    $ python timeit.py -s 'import ravi' \
    'ravi.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    10000 loops, best of 3: 32.7 usec per loop
    My timing harness (I seem to need a new getopt for timeit.py under 2.3)
    shows a slight (15-22% less time) improvement for this 2.3 alternative:

    def commonprefix(s1, s2): # very little tested!
    try:
    for i, c in enumerate(s1):
    if c != s2[i]: return s1[:i]
    except IndexError:
    return s1[:i]
    return s1


    [12:39] C:\pywk\clp>timefuns ravi -c extract -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmno
    pBHLHT' -c commonprefix -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmnopBHLHT'
    timing oh: 0.000007 ratio
    extract: 0.000088 1.00
    commonprefix: 0.000074 0.85

    [12:39] C:\pywk\clp>timefuns ravi -c extract -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmno
    pBHLHT' -c commonprefix -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmnopBHLHT'
    timing oh: 0.000007 ratio
    extract: 0.000091 1.00
    commonprefix: 0.000071 0.78

    [12:40] C:\pywk\clp>timefuns ravi -c extract -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmno
    pBHLHT' -c commonprefix -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmnopBHLHT'
    timing oh: 0.000007 ratio
    extract: 0.000091 1.00
    commonprefix: 0.000071 0.78

    [12:40] C:\pywk\clp>timefuns ravi -c extract -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmno
    pBHLHT' -c commonprefix -s 'abcdefghijklmnopqrstuvwxyz' -s 'abcdefghijklmnopBHLHT'
    timing oh: 0.000007 ratio
    extract: 0.000088 1.00
    commonprefix: 0.000071 0.81

    Regards,
    Bengt Richter
  • Alex Martelli at Aug 4, 2003 at 9:30 am
    Jeff Epler wrote:
    ...
    .. unfortunately, it seems to be slower than the first method. On my
    machine (800MHz PIII):
    $ python timeit.py -s 'import ravi' \
    'ravi.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    10000 loops, best of 3: 32.7 usec per loop
    Here's my proposal...:

    import sys

    def extract(a, b):
    m = min(len(a), len(b))
    for i in range(m):
    if a[i] != b[i]:
    return a[:i]
    return a[:m]

    def extract2(a, b):
    for i, ai, bi in zip(xrange(sys.maxint), a, b):
    if ai != bi: return a[:i]
    return a[:m]

    def extract3(a, b):
    for i, ai in enumerate(a):
    if ai != b[i:i+1]:
    return a[:i]
    return a

    [alex at lancelot python2.3]$ python -O timeit.py -s 'import exa'
    'exa.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    100000 loops, best of 3: 13.9 usec per loop
    [alex at lancelot python2.3]$ python -O timeit.py -s 'import exa'
    'exa.extract2("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    10000 loops, best of 3: 19.7 usec per loop
    [alex at lancelot python2.3]$ python -O timeit.py -s 'import exa'
    'exa.extract3("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    100000 loops, best of 3: 15.7 usec per loop

    Now add after the "import sys" two lines:

    import psyco
    psyco.full()

    and run again:

    [alex at lancelot python2.3]$ python -O timeit.py -s 'import exa'
    'exa.extract("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    1000000 loops, best of 3: 0.771 usec per loop

    [alex at lancelot python2.3]$ python -O timeit.py -s 'import exa'
    'exa.extract3("abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopBHLHT")'
    100000 loops, best of 3: 3.37 usec per loop

    However, extract2 doesn't run correctly with psyco (gets a MemoryError).

    Still, the 18-times acceleration that psyco is able to effect on
    the naive extract IS typical of psyco's effect on functions coded
    in simple, elementary terms. When you really need speed, assuming
    that your processor is Intel-compatible, consider psyco (of course,
    you'll generally use psyco.profile, or something more selective
    still, rather than psyco.full) -- orders-of-magnitude improvements
    on low-level bottleneck functions are anything but surprising...


    Alex
  • Alex Martelli at Aug 4, 2003 at 11:56 am

    Ravi wrote:

    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    Here's my latest study on this:

    *** pexa.py:

    import sys

    import psyco
    psyco.full()

    import cexa
    import exa

    def extract(a, b):
    m = min(len(a), len(b))
    for i in range(m):
    if a[i] != b[i]:
    return a[:i]
    return a[:m]

    def extract2(a, b):
    for i, ai, bi in zip(xrange(len(a)), a, b):
    if ai != bi: return a[:i]
    return a[:m]

    def extract3(a, b):
    for i, ai in enumerate(a):
    if ai != b[i:i+1]:
    return a[:i]
    return a

    extract_pyrex = exa.exa

    extract_c = cexa.cexa

    *** exa.pyx:

    def exa(a, b):
    cdef int la
    cdef int lb
    la = len(a)
    lb = len(b)
    cdef int lmin
    lmin = min(la, lb)
    cdef int i
    i = 0
    while i < lmin:
    if a[i] != b[i]:
    return a[:i]
    i = i + 1
    if lmin == la:
    return a
    else:
    return b

    *** cexa.c:

    #include <Python.h>

    static PyObject*
    cexa(PyObject* self, PyObject* args)
    {
    char *a, *b;
    int la, lb;
    int lmin, i;
    if(!PyArg_ParseTuple(args, "s#s#", &a, &la, &b, &lb))
    return 0;

    lmin = la;
    if(lmin<lb) lmin = lb;

    for(i=0; i<lmin; i++)
    if(a[i] != b[i])
    break;

    return Py_BuildValue("s#", a, i);
    }

    static PyMethodDef cexaMethods[] = {
    {"cexa", cexa, METH_VARARGS, "Extract common prefix"},
    {0}
    };

    void
    initcexa(void)
    {
    Py_InitModule("cexa", cexaMethods);
    }

    I've built the pyrex-coded extension with:

    from distutils.core import setup
    from distutils.extension import Extension
    from Pyrex.Distutils import build_ext

    setup(
    name = "exa",
    ext_modules=[
    Extension("exa", ["exa.pyx"])
    ],
    cmdclass = {'build_ext': build_ext}
    )

    and the C-coded one with:

    from distutils.core import setup
    from distutils.extension import Extension

    setup(
    name = "cexa",
    ext_modules=[
    Extension("cexa", ["cexa.c"])
    ],
    )

    and my measurements give me:

    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa' \
    'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 2.39 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 2.14 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    10000 loops, best of 3: 30.2 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 9.59 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    10000 loops, best of 3: 21.8 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 1.88 usec per loop
    [alex at lancelot exi]$

    So, it seems you can still get a tiny drop of extra speed
    with a C-coded extension, though it's doubtful whether it's
    worth the bother wrt the pyro-optimized simple Python code
    in function 'extract'. I'm not sure where I went wrong in
    the Pyrex coding (it doesn't seem to be performing anywhere
    as well as I thought it might) and I'll be happy for real
    Pyrex expert to show me the way.

    Of course, as others have pointed out, it's unclear from
    your problem description that doing such operations pairwise
    on a lot of pairs of strings is actually what you need. It
    IS quite possible that what you're doing could often be
    better modeled, e.g., by repeated "prefix extractions"
    between ONE fixed string and several other candidate strings;
    or "prefix extraction" between a set of more than two strings.
    In each case, it's likely that you can get much better
    performance by more sophisticated algorithms. However,
    which algorithms those might be is unclear unless you can
    provide mode details on what you're doing.


    Alex
  • John Machin at Aug 4, 2003 at 1:55 pm

    On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli wrote:

    I'm not sure where I went wrong in
    the Pyrex coding (it doesn't seem to be performing anywhere
    as well as I thought it might) and I'll be happy for real
    Pyrex expert to show me the way.
    I don't call myself an expert, but here's my best shot:

    If you look at the generated C code, you'll see lots of conversion
    between C and Python types. The trick is to get your args into C, stay
    in C as much as possible, and ship back a Python return value.

    It's made harder with strings as there is not (yet) any way of hinting
    to Pyrex to use the "s#" gadget, you have to DIY, see below.

    cdef extern from "Python.h":
    int PyString_Size(object s)

    def exa2(arga, argb):
    cdef int la, lb, lmin, i
    cdef char *a, *b

    a = arga
    b = argb
    la = PyString_Size(arga)
    lb = PyString_Size(argb)
    # living dangerously, not testing for error;
    # Easy to eyeball for correctness in this case,
    # but ...
    if la <= lb:
    lmin = la
    else:
    lmin = lb
    i = 0
    while i < lmin:
    if a[i] != b[i]:
    return arga[:i]
    i = i + 1
    if lmin == la:
    return arga
    else:
    return argb
  • Richie Hindle at Aug 4, 2003 at 3:38 pm
    [Alex]
    I'm not sure where I went wrong in
    the Pyrex coding (it doesn't seem to be performing anywhere
    as well as I thought it might) and I'll be happy for real
    Pyrex expert to show me the way.
    I'm no an expert, but I can see a few easily-fixed problems. The line:

    if a[i] != b[i]:

    is working with Python strings when it could be working with C
    strings. Here's the original code and its output on my machine:

    def exa(a, b):
    cdef int la
    cdef int lb
    la = len(a)
    lb = len(b)
    cdef int lmin
    lmin = min(la, lb)
    cdef int i
    i = 0
    while i < lmin:
    if a[i] != b[i]:
    return a[:i]
    i = i + 1
    if lmin == la:
    return a
    else:
    return b

    100000 loops, best of 3: 9.11 usec per loop

    Here's a modified version of the code comparing C strings:

    def exa(a, b):
    cdef char* c_a # `a` as a C string
    cdef char* c_b # `b` as a C string
    cdef int la
    cdef int lb

    c_a = a
    c_b = b
    la = len(a)
    lb = len(b)
    cdef int lmin
    lmin = min(la, lb)
    cdef int i
    i = 0
    while i < lmin:
    if c_a[i] != c_b[i]:
    return a[:i]
    i = i + 1
    if lmin == la:
    return a
    else:
    return b

    100000 loops, best of 3: 5.79 usec per loop

    Almost twice as fast. Looking at the generated C is always
    worthwhile when optimising Pyrex code - here's the code that does
    the comparison against Python strings:

    /* "D:\src\tests\pyrex\exa.pyx":11 */
    __pyx_3 = PyInt_FromLong(__pyx_v_i); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
    __pyx_1 = PyObject_GetItem(__pyx_v_a, __pyx_3); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
    Py_DECREF(__pyx_3); __pyx_3 = 0;
    __pyx_5 = PyInt_FromLong(__pyx_v_i); if (!__pyx_5) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
    __pyx_2 = PyObject_GetItem(__pyx_v_b, __pyx_5); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
    Py_DECREF(__pyx_5); __pyx_5 = 0;
    if (PyObject_Cmp(__pyx_1, __pyx_2, &__pyx_4) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
    __pyx_4 = __pyx_4 != 0;
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    if (__pyx_4) {

    vs.

    /* "D:\src\tests\pyrex\exa.pyx":16 */
    __pyx_5 = ((__pyx_v_c_a[__pyx_v_i]) != (__pyx_v_c_b[__pyx_v_i]));
    if (__pyx_5) {

    for C strings. There's another similar optimisation that the C
    output leads you to: you can use strlen rather than Python's len:

    cdef extern from "string.h":
    int strlen(char*)

    def exa(a, b):
    cdef char* c_a # `a` as a C string
    cdef char* c_b # `b` as a C string
    cdef int la
    cdef int lb

    c_a = a
    c_b = b
    la = strlen(c_a)
    lb = strlen(c_b)
    cdef int lmin
    lmin = min(la, lb)
    cdef int i
    i = 0
    while i < lmin:
    if c_a[i] != c_b[i]:
    return a[:i]
    i = i + 1
    if lmin == la:
    return a
    else:
    return b

    100000 loops, best of 3: 3.58 usec per loop

    That replaces:

    /* "D:\src\tests\pyrex\exa.pyx":4 */
    __pyx_1 = __Pyx_GetName(__pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
    __pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
    Py_INCREF(__pyx_v_a);
    PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_a);
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    __pyx_4 = PyInt_AsLong(__pyx_3); if (PyErr_Occurred()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
    Py_DECREF(__pyx_3); __pyx_3 = 0;
    __pyx_v_la = __pyx_4;

    /* "D:\src\tests\pyrex\exa.pyx":5 */
    __pyx_1 = __Pyx_GetName(__pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
    __pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
    Py_INCREF(__pyx_v_b);
    PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_b);
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    __pyx_4 = PyInt_AsLong(__pyx_3); if (PyErr_Occurred()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
    Py_DECREF(__pyx_3); __pyx_3 = 0;
    __pyx_v_lb = __pyx_4;

    with :

    /* "D:\src\tests\pyrex\exa.pyx":12 */
    __pyx_v_la = strlen(__pyx_v_c_a);

    /* "D:\src\tests\pyrex\exa.pyx":13 */
    __pyx_v_lb = strlen(__pyx_v_c_b);

    and leaves the call to 'min' as the only remaining huge block of C.
    The final version looks like this, eliminating 'min' (Greg, can we have
    the terary operator in Pyrex please? <good_mood ? wink : frown>)

    cdef extern from "string.h":
    int strlen(char*)

    def exa(a, b):
    cdef char* c_a # `a` as a C string
    cdef char* c_b # `b` as a C string
    cdef int la
    cdef int lb

    c_a = a
    c_b = b
    la = strlen(c_a)
    lb = strlen(c_b)
    cdef int lmin
    if la < lb:
    lmin = la
    else:
    lmin = lb
    cdef int i
    i = 0
    while i < lmin:
    if c_a[i] != c_b[i]:
    return a[:i]
    i = i + 1
    if lmin == la:
    return a
    else:
    return b


    1000000 loops, best of 3: 0.803 usec per loop

    Over ten times quicker than the original, for the sake of a couple of
    small tweaks driven by looking at the C output. Although the C still
    looks very verbose at first glance, it's now substantially the same as
    Alex's cexa.c.

    Hope that helps,

    --
    Richie Hindle
    richie at entrian.com
  • Bengt Richter at Aug 4, 2003 at 7:27 pm

    On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli wrote:
    Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    Here's my latest study on this:

    *** pexa.py:
    [...]

    JFTHOI, if you have the inclination, I'm curious how this slightly
    different 2.3-dependent version would fare in your harness on your
    system with the rest:

    def commonprefix(s1, s2): # very little tested!
    try:
    for i, c in enumerate(s1):
    if c != s2[i]: return s1[:i]
    except IndexError:
    return s1[:i]
    return s1

    [...]
    and my measurements give me:

    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa' \
    'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 2.39 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 2.14 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract2("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    10000 loops, best of 3: 30.2 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract3("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 9.59 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract_pyrex("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    10000 loops, best of 3: 21.8 usec per loop
    [alex at lancelot exi]$ python -O timeit.py -s 'import pexa'
    'pexa.extract_c("abcdefghijklmonpKOU", "abcdefghijklmonpZE")'
    100000 loops, best of 3: 1.88 usec per loop
    [alex at lancelot exi]$
    Interesting, but I think I will have to write a filter so I can
    see a little more easily what your timeit.py outputs say ;-)

    Regards,
    Bengt Richter
  • Alex Martelli at Aug 4, 2003 at 10:27 pm
    Richie Hindle wrote:
    ...
    output leads you to: you can use strlen rather than Python's len:
    ...IF somebody absolutely ensures you that the Python strings contain
    no nulls. I do not recall any such guarantee in the original thread,
    therefore it does not seem to me that this optimization is warranted.

    Over ten times quicker than the original, for the sake of a couple of
    small tweaks driven by looking at the C output. Although the C still
    looks very verbose at first glance, it's now substantially the same as
    Alex's cexa.c.
    Impressive! But IMHO a way must be found to use the actual lengths
    of the Python strings, as cexa.c did, rather than assuming that the
    Python strings contain no nulls. Perhaps this can be done by direct
    use of PyString_AsStringAndSize?


    Alex
  • John Machin at Aug 4, 2003 at 11:07 pm
    Richie Hindle <richie at entrian.com> wrote in message news:<mailman.1060011610.28906.python-list at python.org>...
    for C strings. There's another similar optimisation that the C
    output leads you to: you can use strlen rather than Python's len:
    You can, if you don't care about the possibility that the input may contain NULs.
  • Ravi at Aug 4, 2003 at 10:53 pm

    Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi
    I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
    and Bengt. However I have this problem taken care of now. Took around 6
    hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
    for the data, if you want to know about it just for the sake of an
    optimized algorithm, there are no Null (\0) characters in the strings
    (actually they're Base64), and I've included a typical pair of strings.
    The version I used was Andrew's.

    Someone suggested that this would be better done in larger sets than
    just pairs. That's not suitable because of the structure of the data,
    two strings might be highly correlated, but are probably quite different
    from another pair of strings. Perhaps more significantly, correlation in
    sets of greater than two has no physical significance to the experiment.

    I grabbed this from a typical data file. So I would want to be
    extracting 'A832nv81a'
    "
    A832nv81a81nW103v9c24jgpy92T
    A832nv81aTyqiep4v9c324jgpy92T
    "

    Thanks for your help everyone, coming from a Perl (It's a four letter
    word to me :) world, I'm very impressed by how helpful all of you are.

    Ravi
  • Bengt Richter at Aug 7, 2003 at 2:35 am

    On Mon, 04 Aug 2003 18:53:27 -0400, Ravi wrote:
    Ravi wrote:
    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    a = "abcdefghijklmnopqrstuvwxyz"
    b = "abcdefghijklmnopBHLHT"
    c = extract(a,b)
    print c
    "abcdefghijklmnop"

    Here I want to extract the common string "abcdefghijklmnop". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi
    I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
    and Bengt. However I have this problem taken care of now. Took around 6
    hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
    for the data, if you want to know about it just for the sake of an
    optimized algorithm, there are no Null (\0) characters in the strings
    (actually they're Base64), and I've included a typical pair of strings.
    The version I used was Andrew's.

    Someone suggested that this would be better done in larger sets than
    just pairs. That's not suitable because of the structure of the data,
    two strings might be highly correlated, but are probably quite different
    from another pair of strings. Perhaps more significantly, correlation in
    sets of greater than two has no physical significance to the experiment.

    I grabbed this from a typical data file. So I would want to be
    extracting 'A832nv81a'
    "
    A832nv81a81nW103v9c24jgpy92T
    A832nv81aTyqiep4v9c324jgpy92T
    "
    I still don't understand your use case ;-)

    1) Are you periodically batch processing to look for near-pairs in a 200GB
    unsorted list of strings? Are they just in a huge unsorted ascii file with
    line feeds delimiting?

    or

    1a) Does someone send you a single string in a message every now and then (how often?)
    and ask you to find the closest match in the data base? I.e., 200GB at 30 bytes/string
    would be 6.67 billion strings. Which you say you now crunch in ~6hrs? You don't do that
    amount of crunching just for one match request, do you?

    1b) If you get a lot of "match requests," wouldn't you gain a lot by at least partially
    ordering the incoming streams into separate buckets? E.g., the simplest thing would be
    to have 64 files corresponding to the first character, or even 64*64 files for the first two.
    Then have 64*64 file buffers (not open files) of say 1000 strings each, which would be
    64*64*1000*30 or call it 32k/file buffer -> 64*64*32*1024/(1024*1024) ->128 megabytes of buffers
    and on the average you'd expect 1GB/hr to fill a buffer of 32k about
    (1e9/3600)/(32*1024) -> 8.477 times/second. It should be reasonable to open a file for append
    and write 32k and close it that often, IWT. Whateve box writes the 200GB data now could either
    do the partitioning itself or send it by gigabit ethernet to a dedicated box for that, IWT. Or
    you could distribute the load by patitioning the ethernet destinations per the leading n bits
    and let 2**n boxes maybe do even more ordering on the fly.

    Even without distributing the load, this could give you 200GB/4096 (if evenly distributed!!)
    or only '%e'%(200e9/4096) ->4.882813e+007 or less than 50MB to read off disk and search
    to make a probe with a single string. If you sorted when you probed and wrote back the
    sorted data with an indication that it was sorted, further probes to that same partition could
    go a lot faster.

    2) Why do the string lengths vary if they are samples from some process? Are they
    effectively just right-stripped from some fixed length, which would be a guaranteed max length?
    2a) what are the max and min lengths?

    3) what is the distribution of leading characters as they come from the source?
    3a) E.g., is the first character uniformly distributed within its possible Base64 codes?
    3b) Are they uncorrelated timewise? Or e.g. do they arrive in clumps of <<64 possibilities for a time?
    3c) Are the individual strings' characters randomly distributed and uncorrelated within the string?
    3d) You say 9000GB compresses to 700GB. That suggests a lot of correlation and/or uneven distributions.
    Is there a fixed dictionary you could use to repack the strings bit-wise with huffman and/or
    rle compression?
    3d1) What is the significance of the Base64 character boundaries? I.e., would a common prefix of
    8-bit bytes representing sequentially compressed string be good enough, even if the first non-
    matching byte contained some bits that did match and would have been included in base64?

    3d2) Note that compressing during partitioned 4096-bucket storage could save a lot of space
    as well as speed up matching.

    4) 1GB/hr translates to about 10k strings like the your example per second.
    4a) Are they just logged sequentially to a 200GB data store? (cf. 1b)
    4b==2a) Is there a max and/or min length to these strings? (the above are 29 & 30 chars).
    4c) You say they are base 64. That means binary would make (6/8)*200gb = 150GB, and the
    strings (6/8)*30 ~ 22.5 or say 24 for a nice multiple of 8, even without compression.

    Just some thoughts. Might want to check my arithmetic ;-)
    Thanks for your help everyone, coming from a Perl (It's a four letter
    word to me :) world, I'm very impressed by how helpful all of you are.
    Some newsgroups are like that. In the past I spent a fair amount of time on the Borland
    Delphi n.g., and that was mostly very friendly and helpful too. I think that's just the
    way people are unless some stupidities get in the way. Renaissance yes! Armageddon no!

    Regards,
    Bengt Richter
  • Christos TZOTZIOY Georgiou at Aug 6, 2003 at 6:54 pm
    On Sat, 02 Aug 2003 17:39:26 -0400, rumours say that Ravi
    <rxs141 at cwru.edu> might have written:
    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    If you know in advance that many of these comparisons will involve large
    strings, you might improve your apps' response time by using the
    os.path.commonprefix function modified as proposed in

    http://sourceforge.net/tracker/index.php?funcÞtail&aidh1780&group_idT70&atid05470

    which is O(logN)-ish instead of O(N). This patch was sent by me, but
    there is a suggestion by another SF user which seems to be efficient too
    (never benchmarked it against mine).
    --
    TZOTZIOY, I speak England very best,
    Microsoft Security Alert: the Matrix began as open source.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 2, '03 at 9:39p
activeAug 22, '03 at 7:42a
posts27
users12
websitepython.org

People

Translate

site design / logo © 2022 Grokbase