FAQ
I am new to Python but have used many other (mostly dead) languages in the past.? I want to be able to process *.txt and *.csv files.? I can now read that and then change them as needed ? mostly just take a column and do some if-then to create a new variable.? My problem is sorting these files:
1.)??? How do I sort file1.txt by position and write out file1_sorted.txt; for example, if all the records are 100 bytes long and there is a three digit id in the position 0-2; here would be some sample data:
a.?????? 001JohnFilben??
b.????? 002Joe? Smith?..
2.)??? How do I sort file1.csv by column name; for example, if all the records have three column headings, ?id?, ?first_name?, ?last_name?; ?here would be some sample data:
a.?????? Id, first_name,last_name
b.????? 001,John,Filben
c.?????? 002,Joe, Smith
3.)??? What about if I have millions of records and I am processing on a laptop with a large external drive ? basically, are there space considerations? What are the work arounds.
Any help would be appreciated. Thank you.

?John Filben
Cell Phone - 773.401.2822
Email - johnfilben at yahoo.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20100303/b0134d1e/attachment.html>

Search Discussions

  • Mk at Mar 3, 2010 at 6:20 pm

    John Filben wrote:
    I am new to Python but have used many other (mostly dead) languages in
    the past. I want to be able to process *.txt and *.csv files. I can
    now read that and then change them as needed ? mostly just take a column
    and do some if-then to create a new variable. My problem is sorting
    these files:

    1.) How do I sort file1.txt by position and write out
    file1_sorted.txt; for example, if all the records are 100 bytes long and
    there is a three digit id in the position 0-2; here would be some sample
    data:

    a. 001JohnFilben??

    b. 002Joe Smith?..
    Use a dictionary:

    linedict = {}
    for line in f:
    key = line[:3]
    linedict[key] = line[3:] # or alternatively 'line' if you want to
    include key in the line anyway

    sortedlines = []
    for key in linedict.keys().sort():
    sortedlines.append(linedict[key])

    (untested)

    This is the simplest, and probably inefficient approach. But it should work.
    2.) How do I sort file1.csv by column name; for example, if all the
    records have three column headings, ?id?, ?first_name?, ?last_name?;
    here would be some sample data:

    a. Id, first_name,last_name

    b. 001,John,Filben

    c. 002,Joe, Smith
    This is more complicated: I would make a list of lines, where each line
    is a list split according to columns (like ['001', 'John', 'Filben']),
    and then I would sort this list using operator.itemgetter, like this:

    lines.sort(key = operator.itemgetter(num)) # where num is the number of
    column, starting with 0 of course

    Read up on operator.*, it's very useful.

    3.) What about if I have millions of records and I am processing on a
    laptop with a large external drive ? basically, are there space
    considerations? What are the work arounds.
    The simplest is to use smth like SQLite: define a table, fill it up, and
    then do SELECT with ORDER BY.

    But with a million records I wouldn't worry about it, it should fit in
    RAM. Observe:
    a={}
    for i in range(1000000):
    ... a[i] = 'spam'*10
    ...
    sys.getsizeof(a)
    25165960

    So that's what, 25 MB?

    Although I have to note that TEMPORARY ram usage in Python process on my
    machine did go up to 113MB.

    Regards,
    mk
  • MRAB at Mar 3, 2010 at 7:59 pm

    mk wrote:
    John Filben wrote:
    I am new to Python but have used many other (mostly dead) languages in
    the past. I want to be able to process *.txt and *.csv files. I can
    now read that and then change them as needed ? mostly just take a
    column and do some if-then to create a new variable. My problem is
    sorting these files:

    1.) How do I sort file1.txt by position and write out
    file1_sorted.txt; for example, if all the records are 100 bytes long
    and there is a three digit id in the position 0-2; here would be some
    sample data:

    a. 001JohnFilben??

    b. 002Joe Smith?..
    Use a dictionary:

    linedict = {}
    for line in f:
    key = line[:3]
    linedict[key] = line[3:] # or alternatively 'line' if you want to
    include key in the line anyway

    sortedlines = []
    for key in linedict.keys().sort():
    sortedlines.append(linedict[key])

    (untested)

    This is the simplest, and probably inefficient approach. But it should
    work.
    [snip]
    Simpler would be:

    lines = f.readlines()
    lines.sort(key=lambda line: line[ : 3])

    or even:

    lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
  • Mk at Mar 3, 2010 at 9:52 pm

    MRAB wrote:

    [snip]
    Simpler would be:

    lines = f.readlines()
    lines.sort(key=lambda line: line[ : 3])

    or even:

    lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
    Sure, but a complete newbie (I have this impression about OP) doesn't
    have to know about lambda.

    I expected my solution to be slower, but it's not (on a file with
    100,000 random string lines):

    # time ./sort1.py

    real 0m0.386s
    user 0m0.372s
    sys 0m0.014s

    # time ./sort2.py

    real 0m0.303s
    user 0m0.286s
    sys 0m0.017s


    sort1.py:

    #!/usr/bin/python

    def sortit(fname):
    lines = open(fname).readlines()
    lines.sort(key = lambda x: x[:3])

    if __name__ == '__main__':
    sortit('testfile.txt')



    sort2.py:

    #!/usr/bin/python

    def sortit(fname):
    fo = open(fname)
    linedict = {}
    for line in fo:
    key = line[:3]
    linedict[key] = line
    sortedlines = []
    keys = linedict.keys()
    keys.sort()
    for key in keys:
    sortedlines.append(linedict[key])
    return sortedlines

    if __name__ == '__main__':
    sortit('testfile.txt')


    Any idea why? After all, I'm "manually" doing quite a lot: allocating
    key in a dict, then sorting dict's keys, then iterating over keys and
    accessing dict value.

    Regards,
    mk
  • Arnaud Delobelle at Mar 3, 2010 at 8:58 pm

    MRAB <python at mrabarnett.plus.com> writes:

    mk wrote:
    John Filben wrote:
    I am new to Python but have used many other (mostly dead) languages
    in the past. I want to be able to process *.txt and *.csv files.
    I can now read that and then change them as needed ? mostly just
    take a column and do some if-then to create a new variable. My
    problem is sorting these files:

    1.) How do I sort file1.txt by position and write out
    file1_sorted.txt; for example, if all the records are 100 bytes
    long and there is a three digit id in the position 0-2; here would
    be some sample data:

    a. 001JohnFilben??

    b. 002Joe Smith?..
    Use a dictionary:

    linedict = {}
    for line in f:
    key = line[:3]
    linedict[key] = line[3:] # or alternatively 'line' if you want
    to include key in the line anyway

    sortedlines = []
    for key in linedict.keys().sort():
    sortedlines.append(linedict[key])

    (untested)

    This is the simplest, and probably inefficient approach. But it
    should work.
    [snip]
    Simpler would be:

    lines = f.readlines()
    lines.sort(key=lambda line: line[ : 3])

    or even:

    lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
    Or even:

    lines = sorted(f)

    --
    Arnaud
  • Mk at Mar 3, 2010 at 9:46 pm
    John, there's an error in my program, I forgot that list.sort() method
    doesn't return the list (it sorts in place). So it should look like:

    #!/usr/bin/python

    def sortit(fname):
    fo = open(fname)
    linedict = {}
    for line in fo:
    key = line[:3]
    linedict[key] = line
    sortedlines = []
    keys = linedict.keys()
    keys.sort()
    for key in keys:
    sortedlines.append(linedict[key])
    return sortedlines

    if __name__ == '__main__':
    sortit('testfile.txt')


    MRAB's solution is obviously better, provided you know about Python's
    lambda.

    Regards,
    mk
  • Jonathan Gardner at Mar 3, 2010 at 10:16 pm

    On Wed, Mar 3, 2010 at 8:19 AM, John Filben wrote:
    I am new to Python but have used many other (mostly dead) languages in the
    past.? I want to be able to process *.txt and *.csv files.? I can now read
    that and then change them as needed ? mostly just take a column and do some
    if-then to create a new variable.? My problem is sorting these files:

    1.)??? How do I sort file1.txt by position and write out file1_sorted.txt;
    for example, if all the records are 100 bytes long and there is a three
    digit id in the position 0-2; here would be some sample data:

    a.?????? 001JohnFilben??

    b.????? 002Joe? Smith?..

    2.)??? How do I sort file1.csv by column name; for example, if all the
    records have three column headings, ?id?, ?first_name?, ?last_name?; ?here
    would be some sample data:

    a.?????? Id, first_name,last_name

    b.????? 001,John,Filben

    c.?????? 002,Joe, Smith

    3.)??? What about if I have millions of records and I am processing on a
    laptop with a large external drive ? basically, are there space
    considerations? What are the work arounds.

    Any help would be appreciated. Thank you.
    You may also want to look at the GNU tools "sort" and "cut". If your
    job is to process files, I'd recommend tools designed to process files
    for the task.

    --
    Jonathan Gardner
    jgardner at jonathangardner.net

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 3, '10 at 4:19p
activeMar 3, '10 at 10:16p
posts7
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase