FAQ
I am trying to split a file by a fixed string.
The file is too large to just read it into a string and split this.
I could probably use a lexer but there maybe anything more simple?
thanks
m.

Search Discussions

  • Steve Holden at Nov 22, 2004 at 1:53 pm

    Martin Dieringer wrote:

    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    thanks
    m.
    Depends on your definition of "simple", I suppose. The problem with
    *not* using a lexer is that you'd have to examine the file in a sequence
    of overlapping chunks to make sure that a regex could pick up all
    matches. For me that would be more complex than using a lexer, given the
    excellent range of modules such as SPARK and PLY, to mention but two.

    regards
    Steve
  • Diez B. Roggisch at Nov 22, 2004 at 2:09 pm

    Depends on your definition of "simple", I suppose. The problem with
    *not* using a lexer is that you'd have to examine the file in a sequence
    of overlapping chunks to make sure that a regex could pick up all
    matches. For me that would be more complex than using a lexer, given the
    excellent range of modules such as SPARK and PLY, to mention but two.
    At least spark operates on whole strings if used as lexer/tokenizer - you
    can of course feed it a lazy sequence of tokens by using a generator - but
    that's up to you.

    --
    Regards,

    Diez B. Roggisch
  • Martin Dieringer at Nov 22, 2004 at 2:12 pm

    Steve Holden <steve at holdenweb.com> writes:

    Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    thanks
    m.
    Depends on your definition of "simple", I suppose. The problem with
    *not* using a lexer is that you'd have to examine the file in a
    sequence of overlapping chunks to make sure that a regex could pick up
    all matches. For me that would be more complex than using a lexer,
    given the excellent range of modules such as SPARK and PLY, to mention
    but two.
    yes lexing would be the simplest, but PLY also can't read from streams
    and it looks to me (from the examples) as if it's the same with SPARK.
    I wonder why something like this is not in any lib.
    Is there any known lexer that can do this?
    I don't have to parse, just write the junks to separate files.
    I really hate doing that sequence thing...

    m.
  • Denis S. Otkidach at Nov 22, 2004 at 6:20 pm

    On Mon, 22 Nov 2004 08:53:02 -0500 Steve Holden wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    thanks
    m.
    Depends on your definition of "simple", I suppose. The problem with
    *not* using a lexer is that you'd have to examine the file in a sequence
    of overlapping chunks to make sure that a regex could pick up all
    re module works fine with mmap-ed file, so no need to read it into memory.
    matches. For me that would be more complex than using a lexer, given the
    excellent range of modules such as SPARK and PLY, to mention but two.
    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
  • Martin Dieringer at Nov 22, 2004 at 7:48 pm

    "Denis S. Otkidach" <ods at strana.ru> writes:

    On Mon, 22 Nov 2004 08:53:02 -0500
    Steve Holden wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    thanks
    m.
    Depends on your definition of "simple", I suppose. The problem with
    *not* using a lexer is that you'd have to examine the file in a sequence
    of overlapping chunks to make sure that a regex could pick up all
    re module works fine with mmap-ed file, so no need to read it into memory.
    thank you, this is the solution!
    Now I can mmap.find all locations and then read the chunks them via
    file.seek and file.read

    m.
  • Denis S. Otkidach at Nov 23, 2004 at 11:00 am

    On Mon, 22 Nov 2004 20:48:16 +0100 Martin Dieringer wrote:

    "Denis S. Otkidach" <ods at strana.ru> writes:
    [...]
    re module works fine with mmap-ed file, so no need to read it into
    memory.
    thank you, this is the solution!
    Now I can mmap.find all locations and then read the chunks them via
    file.seek and file.read
    mmap-ed files also support subscription and slicing. I guess
    mmfile[start:stop] would more readable.

    --
    Denis S. Otkidach
    http://www.python.ru/ [ru]
  • Martin Dieringer at Nov 23, 2004 at 4:22 pm

    "Denis S. Otkidach" <ods at strana.ru> writes:

    On Mon, 22 Nov 2004 20:48:16 +0100
    Martin Dieringer wrote:
    "Denis S. Otkidach" <ods at strana.ru> writes:
    [...]
    re module works fine with mmap-ed file, so no need to read it into
    memory.
    thank you, this is the solution!
    Now I can mmap.find all locations and then read the chunks them via
    file.seek and file.read
    mmap-ed files also support subscription and slicing. I guess
    mmfile[start:stop] would more readable.
    yes, even better :-)

    m.
  • Jason Rennie at Nov 22, 2004 at 2:05 pm

    On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    If the pattern is contained within a single line, do something like this:

    import re
    myre = re.compile(r'foo')
    fh = open(f)
    fh1 = open(f1,'w')
    s = fh.readline()
    while not myre.search(s):
    fh1.write(s)
    s = fh.readline()
    fh1.close()
    fh2.open(f1,'w')
    while fh
    fh2.write(s)
    s = fh.readline()
    fh2.close()
    fh.close()

    I'm doing this off the top of my head, so this code almost certainly
    has bugs. Hopefully its enough to get you started... Note that only
    one line is held in memory at any point in time. Oh, if there's a
    chance that the pattern does not appear in the file, you'll need to
    check for eof in the first while loop.

    Jason
  • Martin Dieringer at Nov 22, 2004 at 2:28 pm

    Jason Rennie <jrennie at csail.mit.edu> writes:
    On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    If the pattern is contained within a single line, do something like this:
    Hmm it's binary data, I can't tell how long lines would be. OTOH a
    line would certainly contain the pattern as it has no \n in it... and
    the lines probably wouldn't be too large for memory...

    m.
  • Bengt Richter at Nov 22, 2004 at 5:21 pm

    On Mon, 22 Nov 2004 15:28:54 +0100, Martin Dieringer wrote:
    Jason Rennie <jrennie at csail.mit.edu> writes:
    On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    If the pattern is contained within a single line, do something like this:
    Hmm it's binary data, I can't tell how long lines would be. OTOH a
    line would certainly contain the pattern as it has no \n in it... and
    the lines probably wouldn't be too large for memory...

    m.
    Do you want to keep the splitting string? I.e., if you split with xxx
    from '1231xxx45646xxx45646xxx78' do you want the long-file equivalent of
    '1231xxx45646xxx45646xxx78'.split('xxx')
    ['1231', '45646', '45646', '78']

    or (I chose this for below)
    ['1231', 'xxx', '45646', 'xxx', '45646', 'xxx', '78']

    or maybe

    ['1231xxx', '45646xxx', '45646xxx', '78']

    ??

    Anyway, I'd use a generator to iterate through the file and look for the delimiter.
    This is case-sensitive, BTW (practically untested ;-):

    --< splitfile.py >----------------------------------------------
    def splitfile(path, splitstr, chunksize24*64): # try a megabyte?
    splen = len(splitstr)
    chunks = iter(lambda f=open(path,'rb'):f.read(chunksize), '')
    buf = ''
    for chunk in chunks:
    buf += chunk
    start = end = 0
    while end>=0 and len(buf)>=splen:
    start, end = end, buf.find(splitstr, end)
    if end>=0:
    yield buf[start:end] #not including splitstr
    yield splitstr # == buf[end:end+splen] # splitstr
    end += splen
    else:
    buf = buf[start:]
    break

    yield buf

    def test(*args):
    for chunk in splitfile(*args):
    print repr(chunk)

    if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    try:
    if len(args)==3: args[2]=int(args[2])
    except Exception:
    raise SystemExit, 'Usage: python splitfile.py path splitstr [chunksizedk]'
    test(*args)
    ----------------------------------------------------------------

    Extent of testing follows :-)
    print '%s\n%s%s'%('-'*40, open('splitfile.txt','rb').read(),'-'*40)
    ----------------------------------------
    01234abc5678abc901234
    567ab890abc
    ----------------------------------------
    import ut.splitfile
    ut.splitfile.test('splitfile.txt', 'abc')
    '01234'
    'abc'
    '5678'
    'abc'
    '901234\r\n567ab890'
    'abc'
    '\r\n'
    ut.splitfile.test('splitfile.txt', '012')
    ''
    '012'
    '34abc5678abc9'
    '012'
    '34\r\n567ab890abc\r\n'
    it = ut.splitfile.splitfile('splitfile.txt','ab89',4)
    it.next
    <method-wrapper object at 0x02EF1C6C>
    it.next()
    '01234abc5678abc901234\r\n567'
    it.next()
    'ab89'
    it.next()
    '0abc\r\n'
    it.next()
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    StopIteration

    (I put it in my ut package directory but you can put splitfile.py anywhere handy
    and mod it to do what you need).

    Regards,
    Bengt Richter
  • William Park at Nov 22, 2004 at 7:51 pm

    Martin Dieringer wrote:
    Jason Rennie <jrennie at csail.mit.edu> writes:
    On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    If the pattern is contained within a single line, do something like this:
    Hmm it's binary data, I can't tell how long lines would be. OTOH a
    line would certainly contain the pattern as it has no \n in it... and
    the lines probably wouldn't be too large for memory...
    man strings (-o option)

    --
    William Park <opengeometry at yahoo.ca>
    Linux solution for data management and processing.
  • Martin Dieringer at Nov 22, 2004 at 8:54 pm

    William Park <opengeometry at yahoo.ca> writes:

    Martin Dieringer wrote:
    Jason Rennie <jrennie at csail.mit.edu> writes:
    On Mon, Nov 22, 2004 at 09:38:55AM +0100, Martin Dieringer wrote:
    I am trying to split a file by a fixed string.
    The file is too large to just read it into a string and split this.
    I could probably use a lexer but there maybe anything more simple?
    If the pattern is contained within a single line, do something like this:
    Hmm it's binary data, I can't tell how long lines would be. OTOH a
    line would certainly contain the pattern as it has no \n in it... and
    the lines probably wouldn't be too large for memory...
    man strings (-o option)

    this doesn't make sense at all

    m.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 22, '04 at 8:38a
activeNov 23, '04 at 4:22p
posts13
users7
websitepython.org

People

Translate

site design / logo © 2022 Grokbase