Ravi wrote:

I have about 200GB of data that I need to go through and extract the
common first part of a line. Something like this.
a = "abcdefghijklmnopqrstuvwxyz"
b = "abcdefghijklmnopBHLHT"
c = extract(a,b)
print c

Here I want to extract the common string "abcdefghijklmnop". Basically I
need a fast way to do that for any two given strings. For my situation,
the common string will always be at the beginning of both strings. I can
use regular expressions to do this, but from what I understand there is
a lot of overhead. New data is being generated at the rate of about 1GB
per hour, so this needs to be reasonably fast while leaving CPU time for
other processes.

I really appreciate all your help, Alex, Jim, Jeff, Andrew, John, Richie
and Bengt. However I have this problem taken care of now. Took around 6
hours to run on a P4 2.8Ghz 1.0GB DDR (I suspect I/O limitations). As
for the data, if you want to know about it just for the sake of an
optimized algorithm, there are no Null (\0) characters in the strings
(actually they're Base64), and I've included a typical pair of strings.
The version I used was Andrew's.

Someone suggested that this would be better done in larger sets than
just pairs. That's not suitable because of the structure of the data,
two strings might be highly correlated, but are probably quite different
from another pair of strings. Perhaps more significantly, correlation in
sets of greater than two has no physical significance to the experiment.

I grabbed this from a typical data file. So I would want to be
extracting 'A832nv81a'

Thanks for your help everyone, coming from a Perl (It's a four letter
word to me :) world, I'm very impressed by how helpful all of you are.


Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 21 of 27 | next ›
Discussion Overview
grouppython-list @
postedAug 2, '03 at 9:39p
activeAug 22, '03 at 7:42a



site design / logo © 2022 Grokbase