FAQ
I've a character encoding issue that has stumped me (not that hard to
do). I am parsing a small text file with some possibility of various
currencies being involved, and want to handle them without messing up.

Initially I was simply doing:

currs = [u'$', u'?', u'?', u'?']
aFile = open(thisFile, 'r')
for mline in aFile: # mline might be "?5.50"
if item[0] in currs:
item = item[1:]

But the problem was:
SyntaxError: Non-ASCII character '\xa3' in file

The remedy was of course to declare the file encoding for my Python
module, at the start of the file I used:

# -*- coding: UTF-8 -*-

That allowed me to progress. But now when I come to line item that is
a non $ currency, I get this error:

views.py:3364: UnicodeWarning: Unicode equal comparison failed to
convert both arguments to Unicode - interpreting them as being
unequal.

?which I think means Python's unable to convert the char's in the file
I'm reading from into unicode to compare to the items in the list
currs.

I think this is saying that u'?' == '?' is false.
(I hope those chars show up okay in my post here)

Since I can't control the encoding of the input file that users
submit, how to I get past this? How do I make such comparisons be
True?

Thanks in advance for any suggestions
Ross.

Search Discussions

  • Ross at Dec 10, 2010 at 8:07 pm

    On Dec 10, 2:51?pm, Ross wrote:

    Initially I was simply doing:

    ? currs = [u'$', u'?', u'?', u'?']
    ? aFile = open(thisFile, 'r')
    ? for mline in aFile: ? ? ? ? ? ? ?# mline might be "?5.50"
    ? ? ?if item[0] in currs:
    ? ? ? ? ? item = item[1:]
    Don't you love it when someone solves their own problem? Posting a
    reply here so that other poor chumps like me can get around this...

    I found I could import codecs that allow me to read the file with my
    desired encoding. Huzzah!

    Instead of opening the file with a standard
    aFile = open(thisFile, 'r')

    I instead ensure I've imported the codecs:

    import codecs

    ... and then I used a specific encoding on the file read:

    aFile = codecs.open(thisFile, encoding='utf-8')

    Then all my compares seem to work fine.
    If I'm off-base and kludgey here and should be doing something
    differently please give me a poke.

    Regards,
    Ross.
  • Nobody at Dec 10, 2010 at 9:09 pm

    On Fri, 10 Dec 2010 11:51:44 -0800, Ross wrote:

    Since I can't control the encoding of the input file that users
    submit, how to I get past this? How do I make such comparisons be
    True?
    On Fri, 10 Dec 2010 12:07:19 -0800, Ross wrote:

    I found I could import codecs that allow me to read the file with my
    desired encoding. Huzzah!
    If I'm off-base and kludgey here and should be doing something
    Er, do you know the file's encoding or don't you? Using:

    aFile = codecs.open(thisFile, encoding='utf-8')

    is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
    you'll get decoding errors.

    If you are given a file with no known encoding, then you can't reliably
    determine what /characters/ it contains, and thus can't reliably compare
    the contents of the file against strings of characters, only against
    strings of bytes.

    About the best you can do is to use an autodetection library such as:

    http://chardet.feedparser.org/
  • Ross at Dec 13, 2010 at 3:33 pm

    On Dec 10, 4:09?pm, Nobody wrote:
    On Fri, 10 Dec 2010 11:51:44 -0800, Ross wrote:
    Since I can't control the encoding of the input file that users
    submit, how to I get past this? ?How do I make such comparisons be
    True?
    On Fri, 10 Dec 2010 12:07:19 -0800, Ross wrote:
    I found I could import codecs that allow me to read the file with my
    desired encoding. Huzzah!
    If I'm off-base and kludgey here and should be doing something
    Er, do you know the file's encoding or don't you? Using:

    ? ? aFile = codecs.open(thisFile, encoding='utf-8')

    is telling Python that the file /is/ in utf-8. If it isn't in utf-8,
    you'll get decoding errors.

    If you are given a file with no known encoding, then you can't reliably
    determine what /characters/ it contains, and thus can't reliably compare
    the contents of the file against strings of characters, only against
    strings of bytes.

    About the best you can do is to use an autodetection library such as:

    ? ? ? ?http://chardet.feedparser.org/
    That's right I don't know what encoding the user will have used. The
    use of autodetection sounds good - I'll look into that. Thx.

    R.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 10, '10 at 7:51p
activeDec 13, '10 at 3:33p
posts4
users2
websitepython.org

2 users in discussion

Ross: 3 posts Nobody: 1 post

People

Translate

site design / logo © 2022 Grokbase