FAQ
Hi?
I have two large files,each has more than 200000000 lines,and each line
consists of two fields,one is the id and the other a value,
the ids are sorted.

for example:

file1
(uin_a y)
1 10000245
2 12333
3 324543
5 3464565
....


file2
(uin_b gift)
1 34545
3 6436466
4 35345646
5 463626
....

I want to merge them and get a file,the lines of which consists of an id and
the sum of the two values in file1 and file2?
the codes are as below:

uin_y=open('file1')
uin_gift=open(file2')

y_line=uin_y.next()
gift_line=uin_gift.next()

while 1:
try:
uin_a,y=[int(i) for i in y_line.split()]
uin_b,gift=[int(i) for i in gift_line.split()]
if uin_a==uin_b:
score=y+gift
print uin_a,score
y_line=uin_y.next()
gift_line=uin_gift.next()
if uin_a<uin_b:
print uin_a,y
y_line=uin_y.next()
if uin_a>uin_b:
print uin_b,gift
gift_line=uin_gift.next()
except StopIteration:
break


the question is that those code runs 40+ minutes on a server(16 core,32G
mem),
the time complexity is O(n),and there are not too much operations,
I think it should be faster.So I want to ask which part costs so much.
I tried the cProfile module but didn't get too much.
I guess maybe it is the int() operation that cost so much,but I'm not sure
and don't know how to solve this.
Is there a way to avoid type convertion in Python such as scanf in C?
Thanks for your help ??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/67b72f31/attachment-0001.html>

Search Discussions

  • Ken Seehart at Jun 21, 2011 at 5:31 am

    On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:
    Hi?
    I have two large files,each has more than 200000000 lines,and each
    line consists of two fields,one is the id and the other a value,
    the ids are sorted.

    for example:

    file1
    (uin_a y)
    1 10000245
    2 12333
    3 324543
    5 3464565
    ....


    file2
    (uin_b gift)
    1 34545
    3 6436466
    4 35345646
    5 463626
    ....

    I want to merge them and get a file,the lines of which consists of an
    id and the sum of the two values in file1 and file2?
    the codes are as below:

    uin_y=open('file1')
    uin_gift=open(file2')

    y_line=uin_y.next()
    gift_line=uin_gift.next()

    while 1:
    try:
    uin_a,y=[int(i) for i in y_line.split()]
    uin_b,gift=[int(i) for i in gift_line.split()]
    if uin_a==uin_b:
    score=y+gift
    print uin_a,score
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    if uin_a<uin_b:
    print uin_a,y
    y_line=uin_y.next()
    if uin_a>uin_b:
    print uin_b,gift
    gift_line=uin_gift.next()
    except StopIteration:
    break


    the question is that those code runs 40+ minutes on a server(16
    core,32G mem),
    the time complexity is O(n),and there are not too much operations,
    I think it should be faster.So I want to ask which part costs so much.
    I tried the cProfile module but didn't get too much.
    I guess maybe it is the int() operation that cost so much,but I'm not
    sure
    and don't know how to solve this.
    Is there a way to avoid type convertion in Python such as scanf in C?
    Thanks for your help ??
    Unfortunately python does not have a scanf equivalent AFAIK. Most use
    cases for scanf can be handled by regular expressions, but that would
    clearly useless for you, and just slow you down more since it does not
    perform the int conversion for you.

    Your code appears to have a bug: I would expect that the last entry will
    be lost unless both files end with the same index value. Be sure to test
    your code on a few short test files.

    I recommend psyco to make the whole thing faster.

    Regards,
    Ken Seehart

    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110620/f9ea1fd6/attachment.html>
  • Ken Seehart at Jun 21, 2011 at 5:50 am

    On 6/20/2011 10:31 PM, Ken Seehart wrote:
    On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:
    Hi?
    I have two large files,each has more than 200000000 lines,and each
    line consists of two fields,one is the id and the other a value,
    the ids are sorted.

    for example:

    file1
    (uin_a y)
    1 10000245
    2 12333
    3 324543
    5 3464565
    ....


    file2
    (uin_b gift)
    1 34545
    3 6436466
    4 35345646
    5 463626
    ....

    I want to merge them and get a file,the lines of which consists of an
    id and the sum of the two values in file1 and file2?
    the codes are as below:

    uin_y=open('file1')
    uin_gift=open(file2')

    y_line=uin_y.next()
    gift_line=uin_gift.next()

    while 1:
    try:
    uin_a,y=[int(i) for i in y_line.split()]
    uin_b,gift=[int(i) for i in gift_line.split()]
    if uin_a==uin_b:
    score=y+gift
    print uin_a,score
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    if uin_a<uin_b:
    print uin_a,y
    y_line=uin_y.next()
    if uin_a>uin_b:
    print uin_b,gift
    gift_line=uin_gift.next()
    except StopIteration:
    break


    the question is that those code runs 40+ minutes on a server(16
    core,32G mem),
    the time complexity is O(n),and there are not too much operations,
    I think it should be faster.So I want to ask which part costs so much.
    I tried the cProfile module but didn't get too much.
    I guess maybe it is the int() operation that cost so much,but I'm not
    sure
    and don't know how to solve this.
    Is there a way to avoid type convertion in Python such as scanf in C?
    Thanks for your help ??
    Unfortunately python does not have a scanf equivalent AFAIK. Most use
    cases for scanf can be handled by regular expressions, but that would
    clearly useless for you, and just slow you down more since it does not
    perform the int conversion for you.

    Your code appears to have a bug: I would expect that the last entry
    will be lost unless both files end with the same index value. Be sure
    to test your code on a few short test files.

    I recommend psyco to make the whole thing faster.

    Regards,
    Ken Seehart
    Another thought (a bit of extra work, but you might find it worthwhile
    if psyco doesn't yield a sufficient boost):

    Write a couple filter programs to convert to and from binary data (pairs
    of 32 or 64 bit integers depending on your requirements).

    Modify your program to use the subprocess module to open two instances
    of the binary conversion process with the two input files. Then pipe the
    output of that program into the binary to text filter.

    This might turn out to be faster since each process would make use of a
    core. Also it gives you other options, such as keeping your data in
    binary form for processing, and converting to text only as needed.

    Ken Seehart

    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110620/f11ce57e/attachment-0001.html>
  • Vijay Murthy at Jun 21, 2011 at 9:31 am
    I just wrote something. I could not run a profiler or analyze the timing but
    I felt it was effecient. Have a look and see if it helps:

    from itertools import *

    def sep_add(line1, line2):
    if line1 and line2:
    val1 = line1.split()
    val2 = line2.split()
    if (val1 and val2) and (len(val1) == len(val2) == 2):
    return (val1[0] == val2[0])

    def add_col_files(file1, file2):
    fs1 = open(file1, "r")
    fs2 = open(file2, "r")

    fsn = open("new_sample.txt", "w")

    # Zip the files together and find the ones that match the index
    # process the tuple accordingly
    # output should be what you want to write to the file

    for k in ifilter(lambda (i,j): (sep_add(i,j)), izip(fs1, fs2)):
    if k:
    output = k[0] + " " + k[1] + "\n" # sample output
    fsn.write(output)

    fsn.close()
    fs1.close()
    fs2.close()

    if __name__ == "__main__":
    import time
    start = time.localtime(time.time())
    print time.asctime(start)

    #add_col_files("sample1.txt", "sample3.txt")

    end = time.localtime(time.time())
    print time.asctime(end)

    It took about a minute on my comp for comparing about 50-100M sized files.
    As I said not done too much of testing on this code.

    2011/6/21 Ken Seehart <ken at seehart.com>
    **
    On 6/20/2011 10:31 PM, Ken Seehart wrote:

    On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:

    Hi?
    I have two large files,each has more than 200000000 lines,and each line
    consists of two fields,one is the id and the other a value,
    the ids are sorted.

    for example:

    file1
    (uin_a y)
    1 10000245
    2 12333
    3 324543
    5 3464565
    ....


    file2
    (uin_b gift)
    1 34545
    3 6436466
    4 35345646
    5 463626
    ....

    I want to merge them and get a file,the lines of which consists of an id
    and the sum of the two values in file1 and file2?
    the codes are as below:

    uin_y=open('file1')
    uin_gift=open(file2')

    y_line=uin_y.next()
    gift_line=uin_gift.next()

    while 1:
    try:
    uin_a,y=[int(i) for i in y_line.split()]
    uin_b,gift=[int(i) for i in gift_line.split()]
    if uin_a==uin_b:
    score=y+gift
    print uin_a,score
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    if uin_a<uin_b:
    print uin_a,y
    y_line=uin_y.next()
    if uin_a>uin_b:
    print uin_b,gift
    gift_line=uin_gift.next()
    except StopIteration:
    break


    the question is that those code runs 40+ minutes on a server(16 core,32G
    mem),
    the time complexity is O(n),and there are not too much operations,
    I think it should be faster.So I want to ask which part costs so much.
    I tried the cProfile module but didn't get too much.
    I guess maybe it is the int() operation that cost so much,but I'm not sure
    and don't know how to solve this.
    Is there a way to avoid type convertion in Python such as scanf in C?
    Thanks for your help ??


    Unfortunately python does not have a scanf equivalent AFAIK. Most use cases
    for scanf can be handled by regular expressions, but that would clearly
    useless for you, and just slow you down more since it does not perform the
    int conversion for you.

    Your code appears to have a bug: I would expect that the last entry will be
    lost unless both files end with the same index value. Be sure to test your
    code on a few short test files.

    I recommend psyco to make the whole thing faster.

    Regards,
    Ken Seehart

    Another thought (a bit of extra work, but you might find it worthwhile if
    psyco doesn't yield a sufficient boost):

    Write a couple filter programs to convert to and from binary data (pairs of
    32 or 64 bit integers depending on your requirements).

    Modify your program to use the subprocess module to open two instances of
    the binary conversion process with the two input files. Then pipe the output
    of that program into the binary to text filter.

    This might turn out to be faster since each process would make use of a
    core. Also it gives you other options, such as keeping your data in binary
    form for processing, and converting to text only as needed.

    Ken Seehart


    --
    http://mail.python.org/mailman/listinfo/python-list
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/c1206aa6/attachment.html>
  • Ken Seehart at Jun 21, 2011 at 10:51 am

    On 6/20/2011 11:56 PM, king6cong at gmail.com wrote:
    Thanks for your reply,Ken :)
    I found the two files ends with the same id??so I am lazy^-^
    I tried psyco,and unfortunately it costs nearly the same time as before.
    Is it true that we can only get str from files in Python?
    Nope^_* . There are many applications such as image processing that
    involve working with binary data.

    (^_* well, technically yes actually: read() does in fact return str, but
    the str can contain binary data)

    But in order to do this, you need to use any of several modules that
    allow python to operate on flat data.

    Two standard modules exist for this purpose: *array *and *struct*. In
    addition there are others such as *numpy *(for mathematical
    applications) and *ctypes *(for interoperability between python and C/C++).

    For your application, the *struct *module is sufficient.
    fout = open('junk.dat', 'wb') # open for writing binary
    fout.write(struct.pack('LL', 123,234))
    fout.write(struct.pack('LL', 123,234))
    fout.write(struct.pack('LL', 3,4))
    fout.close()
    fin = open('junk.dat', 'rb') # open for reading binary
    print struct.unpack('LL', fin.read(8))
    (123, 234)
    print struct.unpack('LL', fin.read(8))
    (123, 234)
    print struct.unpack('LL', fin.read(8))
    (3, 4)
    print struct.unpack('LL', fin.read(8)) # raises struct.error at end
    of file (because 0 bytes were read)
    Traceback (most recent call last):
    File "<string>", line 1, in <fragment>
    struct.error: unpack requires a string argument of length 8
    >>>

    ? 2011?6?21? ??1:50?Ken Seehart <ken at seehart.com
    <mailto:ken at seehart.com>>? ??
    On 6/20/2011 10:31 PM, Ken Seehart wrote:
    On 6/20/2011 7:59 PM, king6cong at gmail.com
    wrote:
    Hi?
    I have two large files,each has more than 200000000 lines,and
    each line consists of two fields,one is the id and the other a
    value,
    the ids are sorted.

    for example:

    file1
    (uin_a y)
    1 10000245
    2 12333
    3 324543
    5 3464565
    ....


    file2
    (uin_b gift)
    1 34545
    3 6436466
    4 35345646
    5 463626
    ....

    I want to merge them and get a file,the lines of which consists
    of an id and the sum of the two values in file1 and file2?
    the codes are as below:

    uin_y=open('file1')
    uin_gift=open(file2')

    y_line=uin_y.next()
    gift_line=uin_gift.next()

    while 1:
    try:
    uin_a,y=[int(i) for i in y_line.split()]
    uin_b,gift=[int(i) for i in gift_line.split()]
    if uin_a==uin_b:
    score=y+gift
    print uin_a,score
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    if uin_a<uin_b:
    print uin_a,y
    y_line=uin_y.next()
    if uin_a>uin_b:
    print uin_b,gift
    gift_line=uin_gift.next()
    except StopIteration:
    break


    the question is that those code runs 40+ minutes on a server(16
    core,32G mem),
    the time complexity is O(n),and there are not too much operations,
    I think it should be faster.So I want to ask which part costs so
    much.
    I tried the cProfile module but didn't get too much.
    I guess maybe it is the int() operation that cost so much,but
    I'm not sure
    and don't know how to solve this.
    Is there a way to avoid type convertion in Python such as scanf
    in C?
    Thanks for your help ??
    Unfortunately python does not have a scanf equivalent AFAIK. Most
    use cases for scanf can be handled by regular expressions, but
    that would clearly useless for you, and just slow you down more
    since it does not perform the int conversion for you.

    Your code appears to have a bug: I would expect that the last
    entry will be lost unless both files end with the same index
    value. Be sure to test your code on a few short test files.

    I recommend psyco to make the whole thing faster.

    Regards,
    Ken Seehart
    Another thought (a bit of extra work, but you might find it
    worthwhile if psyco doesn't yield a sufficient boost):

    Write a couple filter programs to convert to and from binary data
    (pairs of 32 or 64 bit integers depending on your requirements).

    Modify your program to use the subprocess module to open two
    instances of the binary conversion process with the two input
    files. Then pipe the output of that program into the binary to
    text filter.

    This might turn out to be faster since each process would make use
    of a core. Also it gives you other options, such as keeping your
    data in binary form for processing, and converting to text only as
    needed.

    Ken Seehart


    --
    http://mail.python.org/mailman/listinfo/python-list
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/4fc20239/attachment.html>
  • Terry Reedy at Jun 21, 2011 at 10:41 pm

    On 6/20/2011 10:59 PM, king6cong at gmail.com wrote:
    Hi?
    I have two large files,each has more than 200000000 lines,and each line
    consists of two fields,one is the id and the other a value,
    the ids are sorted.

    for example:

    file1
    (uin_a y)
    1 10000245
    2 12333
    3 324543
    5 3464565
    ....


    file2
    (uin_b gift)
    1 34545
    3 6436466
    4 35345646
    5 463626
    ....

    I want to merge them and get a file,the lines of which consists of an id
    and the sum of the two values in file1 and file2?
    the codes are as below:
    One minor thing you can do is use bound methods
    uin_y=open('file1')
    uin_gift=open(file2')
    ynext = open('file1').next
    gnext = open(file1').next
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    y_line = ynext()
    gift_list = gnext()

    and similarly for all .next appearances in what follows.
    while 1:
    try:
    uin_a,y=[int(i) for i in y_line.split()]
    This creates an unnecessary second temporary list. Unroll the loop.

    pair = y_line.split
    uin_a = int(pair[0])
    y = int(pair[1])
    uin_b,gift=[int(i) for i in gift_line.split()]
    same for this line
    if uin_a==uin_b:
    score=y+gift
    print uin_a,score
    y_line=uin_y.next()
    gift_line=uin_gift.next()
    if uin_a<uin_b:
    print uin_a,y
    y_line=uin_y.next()
    if uin_a>uin_b:
    print uin_b,gift
    gift_line=uin_gift.next()
    except StopIteration:
    break


    --
    Terry Jan Reedy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJun 21, '11 at 2:59a
activeJun 21, '11 at 10:41p
posts6
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase