FAQ
Dear all,

I've some applciations which fetch HTML docuemnts off the web, parse
their content and do stuff with it. Every once in a while it happens
that the web site administrators put up files which are encoded in a
wrong manner.

Thus my Python script dies a horrible death:

File "./update_db", line 67, in <module>
for line in open(tempfile, "r"):
File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
3286: unexpected code byte

This is well and ok usually, but I'd like to be able to tell Python:
"Don't worry, some idiot encoded that file, just skip over such
parts/replace them by some character sequence".

Is that possible? If so, how?

Kind regards,
Johannes

--
"Aus starken Potentialen k?nnen starke Erdbeben resultieren; es k?nnen
aber auch kleine entstehen - und "du" wirst es nicht f?r m?glich halten
(!), doch sieh': Es k?nnen dabei auch gar keine Erdbeben resultieren."
-- "R?diger Thomas" alias Thomas Schulz in dsa ?ber seine "Vorhersagen"
<1a30da36-68a2-4977-9eed-154265b17d28 at q14g2000vbi.googlegroups.com>

Search Discussions

  • Bruno Desthuilliers at Dec 6, 2009 at 8:04 pm

    Johannes Bauer a ?crit :
    Dear all,

    I've some applciations which fetch HTML docuemnts off the web, parse
    their content and do stuff with it. Every once in a while it happens
    that the web site administrators put up files which are encoded in a
    wrong manner.

    Thus my Python script dies a horrible death:

    File "./update_db", line 67, in <module>
    for line in open(tempfile, "r"):
    File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
    3286: unexpected code byte

    This is well and ok usually, but I'd like to be able to tell Python:
    "Don't worry, some idiot encoded that file, just skip over such
    parts/replace them by some character sequence".

    Is that possible? If so, how?
    This might get you started:

    """
    help(str.decode)
    decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.
    """

    HTH
  • Johannes Bauer at Dec 7, 2009 at 7:16 pm

    Bruno Desthuilliers schrieb:

    Is that possible? If so, how?
    This might get you started:

    """
    help(str.decode)
    decode(...)
    S.decode([encoding[,errors]]) -> object
    Hmm, this would work nicely if I called "decode" explicitly - but what
    I'm doing is:

    #!/usr/bin/python3
    for line in open("broken", "r"):
    pass

    Which still raises the UnicodeDecodeError when I do not even do any
    decoding explicitly. How can I achieve this?

    Kind regards,
    Johannes

    --
    "Aus starken Potentialen k?nnen starke Erdbeben resultieren; es k?nnen
    aber auch kleine entstehen - und "du" wirst es nicht f?r m?glich halten
    (!), doch sieh': Es k?nnen dabei auch gar keine Erdbeben resultieren."
    -- "R?diger Thomas" alias Thomas Schulz in dsa ?ber seine "Vorhersagen"
    <1a30da36-68a2-4977-9eed-154265b17d28 at q14g2000vbi.googlegroups.com>
  • Benjamin Kaplan at Dec 7, 2009 at 7:42 pm

    On Mon, Dec 7, 2009 at 2:16 PM, Johannes Bauer wrote:
    Bruno Desthuilliers schrieb:
    Is that possible? If so, how?
    This might get you started:

    """
    help(str.decode)
    decode(...)
    ? ? S.decode([encoding[,errors]]) -> object
    Hmm, this would work nicely if I called "decode" explicitly - but what
    I'm doing is:

    #!/usr/bin/python3
    for line in open("broken", "r"):
    ? ? ? ?pass

    Which still raises the UnicodeDecodeError when I do not even do any
    decoding explicitly. How can I achieve this?

    Kind regards,
    Johannes
    Looking at the python 3 docs, it seems that open takes the encoding
    and errors parameters as optional arguments. So you can call
    open('broken', 'r',errors='replace')
    --
    "Aus starken Potentialen k?nnen starke Erdbeben resultieren; es k?nnen
    aber auch kleine entstehen - und "du" wirst es nicht f?r m?glich halten
    (!), doch sieh': Es k?nnen dabei auch gar keine Erdbeben resultieren."
    -- "R?diger Thomas" alias Thomas Schulz in dsa ?ber seine "Vorhersagen"
    <1a30da36-68a2-4977-9eed-154265b17d28 at q14g2000vbi.googlegroups.com>
    --
    http://mail.python.org/mailman/listinfo/python-list
  • Martin v. Loewis at Dec 8, 2009 at 6:26 pm

    Thus my Python script dies a horrible death:

    File "./update_db", line 67, in <module>
    for line in open(tempfile, "r"):
    File "/usr/local/lib/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
    3286: unexpected code byte

    This is well and ok usually, but I'd like to be able to tell Python:
    "Don't worry, some idiot encoded that file, just skip over such
    parts/replace them by some character sequence".

    Is that possible? If so, how?
    As Benjamin says: if you pass errors='replace' to open, then it will
    replace the faulty characters; if you pass errors='ignore', it will
    skip over them.

    Alternatively, you can open the files in binary ('rb'), so that no
    decoding will be attempted at all, or you can specify latin-1 as
    the encoding, which means that you can decode all files successfully
    (though possibly not correctly).

    Regards,
    Martin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 6, '09 at 8:04p
activeDec 8, '09 at 6:26p
posts5
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase