FAQ
Hello all,

I'm trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I'm not letting Python handle the
line endings.

Is the following safe and sane :

text = open('test.txt', 'rb').read()
if encoding:
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'


My worry is that if '\n' *doesn't* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
'\n'`` prematurely ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Search Discussions

  • Sybren Stuvel at Feb 6, 2006 at 3:17 pm

    Fuzzyman enlightened us with:
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?
    I'd count the number of occurences of '\r\n', '\n' without a preceding
    '\r' and '\r' without following '\n', and let the majority decide.

    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa
  • Fuzzyman at Feb 6, 2006 at 3:26 pm

    Sybren Stuvel wrote:
    Fuzzyman enlightened us with:
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?
    I'd count the number of occurences of '\r\n', '\n' without a preceding
    '\r' and '\r' without following '\n', and let the majority decide.
    Sounds reasonable, edge cases for small files be damned. :-)

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa
  • Fuzzyman at Feb 6, 2006 at 9:56 pm

    Sybren Stuvel wrote:
    Fuzzyman enlightened us with:
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?
    I'd count the number of occurences of '\r\n', '\n' without a preceding
    '\r' and '\r' without following '\n', and let the majority decide.
    This is what I came up with. As you can see from the docstring, it
    attempts to sensible(-ish) things in the event of a tie, or no line
    endings at all.

    Comments/corrections welcomed. I know the tests aren't very useful
    (because they make no *assertions* they won't tell you if it breaks),
    but you can see what's going on :

    import re
    import os

    rn = re.compile('\r\n')
    r = re.compile('\r(?!\n)')
    n = re.compile('(?<!\r)\n')

    # Sequence of (regex, literal, priority) for each line ending
    line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]


    def find_ending(text, default=os.linesep):
    """
    Given a piece of text, use a simple heuristic to determine the line
    ending in use.

    Returns the value assigned to default if no line endings are found.
    This defaults to ``os.linesep``, the native line ending for the
    machine.

    If there is a tie between two endings, the priority chain is
    ``'\n', '\r\n', '\r'``.
    """
    results = [(len(exp.findall(text)), priority, literal) for
    exp, literal, priority in line_ending]
    results.sort()
    print results
    if not sum([m[0] for m in results]):
    return default
    else:
    return results[-1][-1]

    if __name__ == '__main__':
    tests = [
    'hello\ngoodbye\nmy fish\n',
    'hello\r\ngoodbye\r\nmy fish\r\n',
    'hello\rgoodbye\rmy fish\r',
    'hello\rgoodbye\n',
    '',
    '\r\r\r \n\n',
    '\n\n \r\n\r\n',
    '\n\n\r \r\r\n',
    '\n\r \n\r \n\r',
    ]
    for entry in tests:
    print repr(entry)
    print repr(find_ending(entry))
    print

    All the best,


    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa
  • Sybren Stuvel at Feb 7, 2006 at 9:17 am

    Fuzzyman enlightened us with:
    This is what I came up with. [...] Comments/corrections welcomed.
    You could use a little more comments in the code, but apart from that
    it looks nice.

    Sybren
    --
    The problem with the world is stupidity. Not saying there should be a
    capital punishment for stupidity, but why don't we just take the
    safety labels off of everything and let the problem solve itself?
    Frank Zappa
  • Alex Martelli at Feb 7, 2006 at 4:38 am

    Fuzzyman wrote:

    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    It does, since a few years, since MacOSX is version of Unix to all
    practical intents and purposes.


    Alex
  • Fuzzyman at Feb 7, 2006 at 9:27 am

    Alex Martelli wrote:
    Fuzzyman wrote:
    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Ha, so long as it works with Python 2.2, that makes things a bit
    easier.

    Rats, I liked that snippet of code (I'm a great fan of list
    comprehensions). :-)
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    It does, since a few years, since MacOSX is version of Unix to all
    practical intents and purposes.
    I wondered if that might be the case. I think I've worried about this
    more than enough now.

    Thanks

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Alex
  • Fuzzyman at Feb 7, 2006 at 9:29 am

    Alex Martelli wrote:
    Fuzzyman wrote:
    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Do you know if this works for multi-byte encodings ? Do files have
    metadata associated with them showing the line-ending in use ?

    I suppose I could test this...

    All the best,


    Fuzzy
    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    It does, since a few years, since MacOSX is version of Unix to all
    practical intents and purposes.


    Alex
  • Alex Martelli at Feb 7, 2006 at 3:51 pm
    Fuzzyman wrote:
    ...
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Do you know if this works for multi-byte encodings ? Do files have
    You mean when you open them with the codecs module?
    metadata associated with them showing the line-ending in use ?
    Not in the filesystems I'm familiar with (they did use to, in
    filesystems used on VMS and other ancient OSs, but that was a very long
    time ago).


    Alex
  • Fuzzyman at Feb 7, 2006 at 4:43 pm

    Alex Martelli wrote:
    Fuzzyman wrote:
    ...
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Do you know if this works for multi-byte encodings ? Do files have
    You mean when you open them with the codecs module?
    No, if I open a UTF16 encoded file in universal mode - will it still
    have the correct lineending attribute ?

    I can't open with a codec unless an encoding is explicitly supplied. I
    still want to detect UTF16 even if the encoding isn't specified.

    As I said, I ought to test this... Without metadata I wonder how Python
    determines it ?

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    metadata associated with them showing the line-ending in use ?
    Not in the filesystems I'm familiar with (they did use to, in
    filesystems used on VMS and other ancient OSs, but that was a very long
    time ago).


    Alex
  • Alex Martelli at Feb 8, 2006 at 3:35 am
    Fuzzyman wrote:
    ...
    I can't open with a codec unless an encoding is explicitly supplied. I
    still want to detect UTF16 even if the encoding isn't specified.

    As I said, I ought to test this... Without metadata I wonder how Python
    determines it ?
    It doesn't. Python doesn't even try to guess: nor would any other
    sensible programming language.


    Alex
  • Fuzzyman at Feb 8, 2006 at 9:00 am

    Alex Martelli wrote:
    Fuzzyman wrote:
    ...
    I can't open with a codec unless an encoding is explicitly supplied. I
    still want to detect UTF16 even if the encoding isn't specified.

    As I said, I ought to test this... Without metadata I wonder how Python
    determines it ?
    It doesn't. Python doesn't even try to guess: nor would any other
    sensible programming language.
    Right, so opening in "rU" mode and testing the 'newline' attribute
    *won't* work for UTF16 encoded files. (Which was what I was asking.)

    I'll have to read, determine encoding, decode, then *either* use my
    code to determine line endings *or* use ``splitlines(True)``.

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Alex
  • Fuzzyman at Feb 8, 2006 at 11:34 am

    Alex Martelli wrote:
    Fuzzyman wrote:
    ...
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Just to confirm, for a UTF16 encoded file, the newlines attribute is
    ``None``.

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
  • Fuzzyman at Feb 8, 2006 at 12:10 pm

    Fuzzyman wrote:
    Alex Martelli wrote:
    Fuzzyman wrote:
    ...
    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Just to confirm, for a UTF16 encoded file, the newlines attribute is
    ``None``.
    Hmmm... having read the documentation, the newlines attribute remains
    None until some newlines are encountered. :oops:

    I don't think it's technique is any better than mine though. ;-)

    Fuzzy
    http://www.voidspace.org.uk/python/index.shtml
  • Arthur at Feb 7, 2006 at 11:59 am

    Alex Martelli wrote:
    Fuzzyman wrote:

    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using

    Open the file with 'rU' mode, and check the file object's newline
    attribute.
    Do you think it would be sensible to have file.readline in universal
    newline support by default?

    I just got flummoxed by this issue, working with a (pre-alpha) package
    by very experienced Python programmers who sent file.readline to
    tokenizer.py without universal newline support. Went on a long (and
    educational) journey trying to figure out why my file was not being
    processed as expected.

    Are there circumstances that it would be sensible to have tokenizer
    process files without universal newline support?

    The result here was having tokenizer detect indentation inconstancies
    that did not exist - in the sense that the files were compiled and ran
    fine by Python.exe.

    Art
  • Arthur at Feb 7, 2006 at 12:21 pm

    Arthur wrote:
    Alex Martelli wrote:

    I just got flummoxed by this issue, working with a (pre-alpha) package
    by very experienced Python programmers who sent file.readline to
    tokenizer.py without universal newline support. Went on a long (and
    educational) journey trying to figure out why my file was not being
    processed as expected.
    For example, the widely used MoinMoin source code colorizer sends files
    to tokenizer without universal newline support:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52298

    Is my premise that tokenizer needs universal newline support to be
    reliable correct?

    What else could put it out of sync with the complier?

    Art
  • Ajsiegel at Feb 7, 2006 at 6:08 pm

    Arthur wrote:
    Arthur wrote:
    Is my premise that tokenizer needs universal newline support to be
    reliable correct?

    What else could put it out of sync with the complier?
    Anybody out there?

    Is my question, and the real world issue that provked it, unclear.

    Is the answer too obvious?

    Have I made *everybody's* kill list?

    Isn't it a prima facie issue if the tokenizer fails in ways
    incompatible with what the compiler is seeing?

    Is this just easy, and I am making it hard? As I apparently do with
    Python more generally.

    Art
  • Bengt Richter at Feb 7, 2006 at 3:42 pm

    On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" wrote:
    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    multi-byte encodings) - which is why I'm not letting Python handle the
    line endings.

    Is the following safe and sane :

    text = open('test.txt', 'rb').read()
    if encoding:
    text = text.decode(encoding)
    ending = '\n' # default
    if '\r\n' in text:
    text = text.replace('\r\n', '\n')
    ending = '\r\n'
    elif '\n' in text:
    ending = '\n'
    elif '\r' in text:
    text = text.replace('\r', '\n')
    ending = '\r'


    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?
    Are you guaranteed that text bodies don't contain escape or quoting
    mechanisms for binary data where it would be a mistake to convert
    or delete an '\r' ? (E.g., I think XML CDATA might be an example).

    Regards,
    Bengt Richter
  • Fuzzyman at Feb 7, 2006 at 3:57 pm

    Bengt Richter wrote:
    On 6 Feb 2006 06:35:14 -0800, "Fuzzyman" wrote:

    Hello all,

    I'm trying to detect line endings used in text files. I *might* be
    decoding the files into unicode first (which may be encoded using
    multi-byte encodings) - which is why I'm not letting Python handle the
    line endings.

    Is the following safe and sane :

    text = open('test.txt', 'rb').read()
    if encoding:
    text = text.decode(encoding)
    ending = '\n' # default
    if '\r\n' in text:
    text = text.replace('\r\n', '\n')
    ending = '\r\n'
    elif '\n' in text:
    ending = '\n'
    elif '\r' in text:
    text = text.replace('\r', '\n')
    ending = '\r'


    My worry is that if '\n' *doesn't* signify a line break on the Mac,
    then it may exist in the body of the text - and trigger ``ending =
    '\n'`` prematurely ?
    Are you guaranteed that text bodies don't contain escape or quoting
    mechanisms for binary data where it would be a mistake to convert
    or delete an '\r' ? (E.g., I think XML CDATA might be an example).
    My personal use case is for reading config files in arbitrary encodings
    (so it's not an issue).

    How would Python handle opening such files when not in binary mode ?
    That may be an issue even on Linux - if you open a windows file and
    use splitlines does Python convert '\r\n' to '\n' ? (or does it leave
    the extra '\r's in place, which is *different to the behaviour under
    windows).

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Regards,
    Bengt Richter

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedFeb 6, '06 at 2:35p
activeFeb 8, '06 at 12:10p
posts19
users5
websitepython.org

People

Translate

site design / logo © 2018 Grokbase