I don't think that os.path.commonprefix was designed with 200Gb of
data in mind. Inspection of Lib/*path.py gives one the impression that
it spends a lot of time discovering that the first element is a prefix
of itself.

Ravi, You may need to drop down to C to get the speed you want for
your requirement to find the longest common prefix of two strings. Two
things puzzling me: (1) how you would do this with regular expressions
(2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
after a year you have almost 9000Gb; where are you putting it all?
BTW, I do hope your algorithm is not O(N**2) ...

Well, I timed os.path.commonprefix with some typical data and it's
pulling about 40usec per loop. So I did what I hated and coded a little
function in C. Goes something like this. My reasoning is that usually
the point where the strings start to differ is in the 30 - 50 character
range. Basically the idea is the same as a binary search on a sorted
array. Divide and conquer by going halfway each time.

Read in both strings.
Check to see if the first character matches.
If yes:
Check halfway through the string and see if that character matches
Repeatedly check halfway until the difference point is found.
Go back through from the difference point backwards and make sure
the characters match from the start to the difference point.

I timed it, and it seems to be doing about 3.5usec per loop. Using
pipes, I can feed it directly into the processing program. Good enough
for me.

As for the data, it's data from a radio telescope that's being recorded.
We do pattern analysis and reduce it to strings. By examining these
strings (more analysis than just the first common bit of course), we can
determine what data should be looked at further and what data is
garbage. The 9000GB problem isn't all that bad, the stuff compresses
extremely well, down to about 700GB for a year's worth. A couple of RAID
arrays makes quick work of that.



Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 27 | next ›
Discussion Overview
grouppython-list @
postedAug 2, '03 at 9:39p
activeAug 22, '03 at 7:42a



site design / logo © 2022 Grokbase