FAQ
hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312", but i have no idea of how to convert it
back to utf-8

to re-create this one is easy:

this will work
============================
su = u"??".encode('gb2312')
su
u
print su.decode('gb2312')
?? -> (same as the original string)

============================
but this doesn't,why
===========================
su = u'\xd6\xd0\xce\xc4'
su
u'\xd6\xd0\xce\xc4'
print su.decode('gb2312')
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordinal not in range(128)
===========================

thank you

Search Discussions

  • Chris Rebert at Apr 1, 2010 at 11:22 am

    2010/4/1 Mister Yu <eryan.yu at gmail.com>:
    hi experts,

    i m new to python, i m writing crawlers to extract data from some
    chinese websites, and i run into a encoding problem.

    i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
    which is encoded in "gb2312",
    No! Instances of type 'unicode' (i.e. strings with a leading 'u')
    ***aren't encoded at all***.
    but i have no idea of how to convert it
    back to utf-8
    To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')
    to re-create this one is easy:

    this will work
    ============================
    su = u"??".encode('gb2312')
    su
    u
    print su.decode('gb2312')
    ?? ? ?-> (same as the original string)

    ============================
    but this doesn't,why
    ===========================
    su = u'\xd6\xd0\xce\xc4'
    su
    u'\xd6\xd0\xce\xc4'
    print su.decode('gb2312')
    You can't decode a unicode string, it's already been decoded!

    One decodes a bytestring to get a unicode string.
    One **encodes** a unicode string to get a bytestring.

    So the last line of your example should be:
    print su.encode('gb2312')

    Only call .encode() on things of type 'unicode'.
    Only call .decode() on things of type 'str'.
    [When using Python 2.x that is. Python 3.x renames the types in question.]

    Cheers,
    Chris
  • Mister Yu at Apr 1, 2010 at 11:38 am

    On Apr 1, 7:22?pm, Chris Rebert wrote:
    2010/4/1 Mister Yu <eryan... at gmail.com>:
    hi experts,
    i m new to python, i m writing crawlers to extract data from some
    chinese websites, and i run into a encoding problem.
    i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
    which is encoded in "gb2312",
    No! Instances of type 'unicode' (i.e. strings with a leading 'u')
    ***aren't encoded at all***.
    but i have no idea of how to convert it
    back to utf-8
    To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')


    to re-create this one is easy:
    this will work
    ============================
    su = u"??".encode('gb2312')
    su
    u
    print su.decode('gb2312')
    ?? ? ?-> (same as the original string)
    ============================
    but this doesn't,why
    ===========================
    su = u'\xd6\xd0\xce\xc4'
    su
    u'\xd6\xd0\xce\xc4'
    print su.decode('gb2312')
    You can't decode a unicode string, it's already been decoded!

    One decodes a bytestring to get a unicode string.
    One **encodes** a unicode string to get a bytestring.

    So the last line of your example should be:
    print su.encode('gb2312')

    Only call .encode() on things of type 'unicode'.
    Only call .decode() on things of type 'str'.
    [When using Python 2.x that is. Python 3.x renames the types in question.]

    Cheers,
    Chris
    --http://blog.rebertia.com
    hi, thanks for the tips.

    but i m still not very sure how to convert a unicode object **
    u'\xd6\xd0\xce\xc4 ** back to "??" the string it supposed to be?

    thanks.

    sorry i m really new to python.
  • Chris Rebert at Apr 1, 2010 at 12:13 pm

    On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu wrote:
    On Apr 1, 7:22?pm, Chris Rebert wrote:
    2010/4/1 Mister Yu <eryan... at gmail.com>:
    hi experts,
    i m new to python, i m writing crawlers to extract data from some
    chinese websites, and i run into a encoding problem.
    i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
    which is encoded in "gb2312",
    <snip>
    hi, thanks for the tips.

    but i m still not very sure how to convert a unicode object ?**
    u'\xd6\xd0\xce\xc4 ** back to "??" the string it supposed to be?
    Ah, my apologies! I overlooked something (sorry, it's early in the
    morning where I am).
    What you have there is ***really*** screwy. It's the 2 Chinese
    characters, encoded in gb2312, and then somehow cast *directly* into a
    'unicode' string (which ought never to be done).

    In answer to your original question (after some experimentation):
    gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8') #as you wanted

    If possible, I'd look at the code that's giving you that funky
    "string" in the first place and see if it can be fixed to give you
    either a proper bytestring or proper unicode string rather than the
    bastardized mess you're currently having to deal with.

    Apologies again and Cheers,
    Chris
  • Stefan Behnel at Apr 1, 2010 at 12:16 pm

    Mister Yu, 01.04.2010 13:38:
    i m still not very sure how to convert a unicode object **
    u'\xd6\xd0\xce\xc4 ** back to "??" the string it supposed to be?
    You are confused. '\xd6\xd0\xce\xc4' is an encoded byte string, not a
    unicode string. The fact that you have it stored in a unicode string
    implies that something in your code (or in a library) has done an incorrect
    conversion from bytes to unicode that did not take into account the real
    character set in use. So you end up with a completely meaningless unicode
    string.

    Please show us the code that does the conversion to a unicode string.

    Stefan
  • Mister Yu at Apr 1, 2010 at 12:26 pm

    On Apr 1, 8:13?pm, Chris Rebert wrote:
    On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu wrote:
    On Apr 1, 7:22?pm, Chris Rebert wrote:
    2010/4/1 Mister Yu <eryan... at gmail.com>:
    hi experts,
    i m new to python, i m writing crawlers to extract data from some
    chinese websites, and i run into a encoding problem.
    i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
    which is encoded in "gb2312",
    <snip>
    hi, thanks for the tips.
    but i m still not very sure how to convert a unicode object ?**
    u'\xd6\xd0\xce\xc4 ** back to "??" the string it supposed to be?
    Ah, my apologies! I overlooked something (sorry, it's early in the
    morning where I am).
    What you have there is ***really*** screwy. It's the 2 Chinese
    characters, encoded in gb2312, and then somehow cast *directly* into a
    'unicode' string (which ought never to be done).

    In answer to your original question (after some experimentation):
    gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8') #as you wanted

    If possible, I'd look at the code that's giving you that funky
    "string" in the first place and see if it can be fixed to give you
    either a proper bytestring or proper unicode string rather than the
    bastardized mess you're currently having to deal with.

    Apologies again and Cheers,
    Chris
    --http://blog.rebertia.com
    Hi Chris,

    thanks for the great tips! it works like a charm.

    i m using the Scrapy project(http://doc.scrapy.org/intro/
    tutorial.html) to write my crawler, when it extract data with xpath,
    it puts the chinese characters directly into the unicode object.

    thanks again chris, and have a good april fool day.

    Cheers,
    Yu
  • Stefan Behnel at Apr 1, 2010 at 1:31 pm

    Mister Yu, 01.04.2010 14:26:
    On Apr 1, 8:13 pm, Chris Rebert wrote:
    gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8') #as you wanted
    Simplifying this hack a bit:

    gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8')

    Although I have to wonder why you want a UTF-8 encoded byte string as
    output instead of Unicode.

    If possible, I'd look at the code that's giving you that funky
    "string" in the first place and see if it can be fixed to give you
    either a proper bytestring or proper unicode string rather than the
    bastardized mess you're currently having to deal with.
    thanks for the great tips! it works like a charm.
    I hope you're aware that it's a big ugly hack, though. You should really
    try to fix your input instead.

    i m using the Scrapy project(http://doc.scrapy.org/intro/
    tutorial.html) to write my crawler, when it extract data with xpath,
    it puts the chinese characters directly into the unicode object.
    My guess is that the HTML page you are parsing is broken and doesn't
    specify its encoding. In that case, all that scrapy can do is guess, and it
    seems to have guessed incorrectly.

    You should check if there is a way to tell scrapy about the expected page
    encoding, so that it can return correctly decoded unicode strings directly,
    instead of resorting to dirty hacks that may or may not work depending on
    the page you are parsing.

    Stefan
  • Mister Yu at Apr 1, 2010 at 2:53 pm

    On Apr 1, 9:31 pm, Stefan Behnel wrote:
    Mister Yu, 01.04.2010 14:26:
    On Apr 1, 8:13 pm, Chris Rebert wrote:
    gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8') #as you wanted
    Simplifying this hack a bit:

    gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
    unicode_string = gb2312_bytes.decode('gb2312')
    utf8_bytes = unicode_string.encode('utf-8')

    Although I have to wonder why you want a UTF-8 encoded byte string as
    output instead of Unicode.
    If possible, I'd look at the code that's giving you that funky
    "string" in the first place and see if it can be fixed to give you
    either a proper bytestring or proper unicode string rather than the
    bastardized mess you're currently having to deal with.
    thanks for the great tips! it works like a charm.
    I hope you're aware that it's a big ugly hack, though. You should really
    try to fix your input instead.
    i m using the Scrapy project(http://doc.scrapy.org/intro/
    tutorial.html) to write my crawler, when it extract data with xpath,
    it puts the chinese characters directly into the unicode object.
    My guess is that the HTML page you are parsing is broken and doesn't
    specify its encoding. In that case, all that scrapy can do is guess, and it
    seems to have guessed incorrectly.

    You should check if there is a way to tell scrapy about the expected page
    encoding, so that it can return correctly decoded unicode strings directly,
    instead of resorting to dirty hacks that may or may not work depending on
    the page you are parsing.

    Stefan
    Hi Stefan,

    i don't think the page is broken or somehow, you can take a look at
    the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like
    a chinese IMDB rip off

    from what i can see from the source code of the page header, it
    contains the coding info:
    <HTML><head><meta http-equiv="Content-Type" content="text/html;
    charset=gb2312" /><meta http-equiv="Content-Language" content="zh-CN" /
    <meta content="all" name="robots" /><meta name="author"
    content="admin(at)7176.com" /><meta name="Copyright" content="www.
    7176.com" /> <meta content="??? ?? ????? ?1?" name="keywords" /><TITLE>
    ??? ?? ????? ?1?</TITLE><LINK href="http://www.7176.com/images/
    pro.css" rel=stylesheet></HEAD>

    maybe i should take a look at the source code of Scrapy, but i m just
    not more than a week's newbie of python. not sure if i can understand
    the source.

    earlier Chris's walk around is looking pretty well until it meets some
    string like this:
    su = u'???? 12345 ????'
    su
    u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
    gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ValueError: chr() arg not in range(256)

    the digis doesn't get encoded so it messes up the code.

    any ideas?

    once again, thanks everybody's help!!!!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 1, '10 at 10:56a
activeApr 1, '10 at 2:53p
posts8
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase