FAQ
Hello

I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that as I do not create those files myself.

I have to read those files and convert the umlauts like ? to stuff like &oumol; as the text files should become html files.

I have this code:


#!/usr/bin/python
# -*- coding: latin1 -*-

import codecs

f = codecs.open('abc.txt', encoding='latin1')

for line in f:
print line
for c in line:
if c == "?":
print "oe"
else:
print c


and I get this error message:

$ ./read.py
Abc

./read.py:11: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if c == "?":
A
b
c



Traceback (most recent call last):
File "./read.py", line 9, in <module>
print line
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)




I checked the web and tried several approaches but I also get some strange encoding errors.
Has anyone ever done this before?
I am currently using Python 2.5 and may be able to use 2.6 but I cannot yet move to 3.1 as many libs we use don't yet work with Python 3.

any help more than welcome. This has been driving me crazy for two days now.

best wishes

Claus
--
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
f?r nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02

Search Discussions

  • Stefan Behnel at Jul 7, 2009 at 2:04 pm

    Claus Hausberger wrote:
    Hello

    I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that as I do not create those files myself.

    I have to read those files and convert the umlauts like ? to stuff like &oumol; as the text files should become html files.

    I have this code:


    #!/usr/bin/python
    # -*- coding: latin1 -*-

    import codecs

    f = codecs.open('abc.txt', encoding='latin1')

    for line in f:
    print line
    for c in line:
    if c == "?":
    You are reading Unicode strings, so you have to compare it to a unicode
    string as in

    if c == u"?":
    print "oe"
    else:
    print c
    Note that printing non-ASCII characters may not always work, depending on
    your terminal.

    Stefan
  • Michiel Overtoom at Jul 7, 2009 at 3:05 pm

    Claus Hausberger wrote:

    I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
    change that as I do not create those files myself. I have to read
    those files and convert the umlauts like ? to stuff like &oumol; as
    the text files should become html files.
    umlaut-in.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele R?hre. Macht spa?! Ts?sch!


    umlaut-in.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara
    000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is
    000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf
    000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
    000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..!
    000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!.......


    umlaut.py:
    ----
    # -*- coding: utf-8 -*-
    import codecs
    text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
    text=text.replace(u"?",u"oe")
    text=text.replace(u"?",u"ss")
    text=text.replace(u"?",u"ue")
    of=open("umlaut-out.txt","w")
    of.write(text)
    of.close()


    umlaut-out.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele Roehre. Macht spass! Tsuesch!


    umlaut-out.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char
    000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is
    000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut
    000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele Roe
    000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass
    000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!.....





    --
    "The ability of the OSS process to collect and harness
    the collective IQ of thousands of individuals across
    the Internet is simply amazing." - Vinod Valloppillil
    http://www.catb.org/~esr/halloween/halloween4.html
  • Claus Hausberger at Jul 7, 2009 at 7:00 pm
    Thanks a lot. Now I am one step further but I get another strange error:

    Traceback (most recent call last):
    File "./read.py", line 12, in <module>
    of.write(text)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

    according to google ufeff has something to do with byte order.

    I use an Linux system, maybe this helps to find the error.

    Claus
    Claus Hausberger wrote:
    I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
    change that as I do not create those files myself. I have to read
    those files and convert the umlauts like ? to stuff like &oumol; as
    the text files should become html files.
    umlaut-in.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele R?hre. Macht spa?! Ts?sch!


    umlaut-in.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara
    000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is
    000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf
    000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
    000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..!
    000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!.......


    umlaut.py:
    ----
    # -*- coding: utf-8 -*-
    import codecs
    text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
    text=text.replace(u"?",u"oe")
    text=text.replace(u"?",u"ss")
    text=text.replace(u"?",u"ue")
    of=open("umlaut-out.txt","w")
    of.write(text)
    of.close()


    umlaut-out.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele Roehre. Macht spass! Tsuesch!


    umlaut-out.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char
    000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is
    000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut
    000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele Roe
    000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass
    000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!.....





    --
    "The ability of the OSS process to collect and harness
    the collective IQ of thousands of individuals across
    the Internet is simply amazing." - Vinod Valloppillil
    http://www.catb.org/~esr/halloween/halloween4.html
    --
    Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
    f?r nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
  • MRAB at Jul 7, 2009 at 8:40 pm

    Claus Hausberger wrote:
    Thanks a lot. Now I am one step further but I get another strange error:

    Traceback (most recent call last):
    File "./read.py", line 12, in <module>
    of.write(text)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

    according to google ufeff has something to do with byte order.

    I use an Linux system, maybe this helps to find the error.
    'text' contains Unicode, but you're writing it to a file that's not
    opened for Unicode. Either open the output file for Unicode:

    of = codecs.open("umlaut-out.txt", "w", encoding="latin1")

    or encode the text before writing:

    text = text.encode("latin1")

    (I'm assuming you want the output file to be in Latin1.)
    Claus Hausberger wrote:
    I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
    change that as I do not create those files myself. I have to read
    those files and convert the umlauts like ? to stuff like &oumol; as
    the text files should become html files.
    umlaut-in.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele R?hre. Macht spa?! Ts?sch!


    umlaut-in.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara
    000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is
    000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf
    000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
    000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..!
    000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!.......


    umlaut.py:
    ----
    # -*- coding: utf-8 -*-
    import codecs
    text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
    text=text.replace(u"?",u"oe")
    text=text.replace(u"?",u"ss")
    text=text.replace(u"?",u"ue")
    of=open("umlaut-out.txt","w")
    of.write(text)
    of.close()


    umlaut-out.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele Roehre. Macht spass! Tsuesch!


    umlaut-out.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char
    000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is
    000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut
    000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele Roe
    000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass
    000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!.....





    --
    "The ability of the OSS process to collect and harness
    the collective IQ of thousands of individuals across
    the Internet is simply amazing." - Vinod Valloppillil
    http://www.catb.org/~esr/halloween/halloween4.html
  • Claus Hausberger at Jul 9, 2009 at 7:41 am
    Thanks a lot. I will try that on the weekend.

    Claus
    Claus Hausberger wrote:
    Thanks a lot. Now I am one step further but I get another strange error:

    Traceback (most recent call last):
    File "./read.py", line 12, in <module>
    of.write(text)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
    position 0: ordinal not in range(128)
    according to google ufeff has something to do with byte order.

    I use an Linux system, maybe this helps to find the error.
    'text' contains Unicode, but you're writing it to a file that's not
    opened for Unicode. Either open the output file for Unicode:

    of = codecs.open("umlaut-out.txt", "w", encoding="latin1")

    or encode the text before writing:

    text = text.encode("latin1")

    (I'm assuming you want the output file to be in Latin1.)
    Claus Hausberger wrote:
    I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
    change that as I do not create those files myself. I have to read
    those files and convert the umlauts like ? to stuff like &oumol; as
    the text files should become html files.
    umlaut-in.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele R?hre. Macht spa?! Ts?sch!


    umlaut-in.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is
    con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in
    th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e
    unicode..chara
    000030: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and
    is
    000040: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with
    utf
    000050: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele
    R..hr
    000060: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht
    spa..!
    000070: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00
    Ts..sch!.......

    umlaut.py:
    ----
    # -*- coding: utf-8 -*-
    import codecs
    text=codecs.open("umlaut-in.txt",encoding="utf-8").read()
    text=text.replace(u"?",u"oe")
    text=text.replace(u"?",u"ss")
    text=text.replace(u"?",u"ue")
    of=open("umlaut-out.txt","w")
    of.write(text)
    of.close()


    umlaut-out.txt:
    ----
    This file is contains data in the unicode
    character set and is encoded with utf-8.
    Viele Roehre. Macht spass! Tsuesch!


    umlaut-out.txt hexdump:
    ----
    000000: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is
    con
    000010: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in
    th
    000020: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e
    unicode...char
    000030: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and
    is
    000040: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with
    ut
    000050: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8....Viele
    Roe
    000060: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht
    spass
    000070: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 !
    Tsuesch!.....




    --
    "The ability of the OSS process to collect and harness
    the collective IQ of thousands of individuals across
    the Internet is simply amazing." - Vinod Valloppillil
    http://www.catb.org/~esr/halloween/halloween4.html
    --
    http://mail.python.org/mailman/listinfo/python-list
    --
    Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
    f?r nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
  • Stefan Behnel at Jul 7, 2009 at 4:50 pm

    Michiel Overtoom schrob:
    Viele R?hre. Macht spa?! Ts?sch!
    LOL! :)

    Stefan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 7, '09 at 1:59p
activeJul 9, '09 at 7:41a
posts7
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase