FAQ
Hi!
I have a set of strings (all letters are capitalized) at utf-8,
russian language. I need to lower it, but
my_string.lower(). Doesn't work.
See sample script:
# -*- coding: utf-8 -*-
[skip]
s1 = self.title
s2 = self.title.lower()
print s1 == s2

returns true.
I have no problems with lower() for english letters:, or with
something like this:
u'russian_letters_here'.lower(), but I don't need constants, I need to
modify variables, but there is no any changs, when I apply lower()
function to mine strings.

Search Discussions

  • Diez B. Roggisch at Oct 5, 2008 at 9:15 pm

    Alexey Moskvin schrieb:
    Hi!
    I have a set of strings (all letters are capitalized) at utf-8,
    russian language. I need to lower it, but
    my_string.lower(). Doesn't work.
    See sample script:
    # -*- coding: utf-8 -*-
    [skip]
    s1 = self.title
    s2 = self.title.lower()
    print s1 == s2

    returns true.
    I have no problems with lower() for english letters:, or with
    something like this:
    u'russian_letters_here'.lower(), but I don't need constants, I need to
    modify variables, but there is no any changs, when I apply lower()
    function to mine strings.
    Can you give a concrete example? I doubt that there is anything
    different between lowering a unicode object given as literal or acquired
    somewhere else. And because my russian skills equal my chinese - total
    of zero - I can't create a test myself :)
  • Martin v. Löwis at Oct 5, 2008 at 9:30 pm
    I have a set of strings (all letters are capitalized) at utf-8,
    That's the problem. If these are really utf-8 encoded byte strings,
    then .lower likely won't work. It uses the C library's tolower API,
    which works on a byte level, i.e. can't work for multi-byte encodings.

    What you need to do is to operate on Unicode strings. I.e. instead
    of

    s.lower()

    do

    s.decode("utf-8").lower()

    or (if you need byte strings back)

    s.decode("utf-8").lower().encode("utf-8")

    If you find that you write the latter, I recommend that you redesign
    your application. Don't use byte strings to represent text, but use
    Unicode strings all the time, except at the system boundary (where
    you decode/encode as appropriate).

    There are some limitations with Unicode .lower also, but I don't
    think they apply to Russian (specifically, SpecialCasing.txt is
    not considered).

    HTH,
    Martin
  • Alexey Moskvin at Oct 6, 2008 at 4:39 am
    Martin, thanks for fast reply, now anything is ok!
    On Oct 6, 1:30 am, "Martin v. L?wis" wrote:
    I have a set of strings (all letters are capitalized) at utf-8,
    That's the problem. If these are really utf-8 encoded byte strings,
    then .lower likely won't work. It uses the C library's tolower API,
    which works on a byte level, i.e. can't work for multi-byte encodings.

    What you need to do is to operate on Unicode strings. I.e. instead
    of

    s.lower()

    do

    s.decode("utf-8").lower()

    or (if you need byte strings back)

    s.decode("utf-8").lower().encode("utf-8")

    If you find that you write the latter, I recommend that you redesign
    your application. Don't use byte strings to represent text, but use
    Unicode strings all the time, except at the system boundary (where
    you decode/encode as appropriate).

    There are some limitations with Unicode .lower also, but I don't
    think they apply to Russian (specifically, SpecialCasing.txt is
    not considered).

    HTH,
    Martin
  • Konstantin at Oct 6, 2008 at 11:35 am

    On Oct 6, 8:39 am, Alexey Moskvin wrote:
    Martin, thanks for fast reply, now anything is ok!
    On Oct 6, 1:30 am, "Martin v. L?wis" wrote:

    I have a set of strings (all letters are capitalized) at utf-8,
    That's the problem. If these are really utf-8 encoded byte strings,
    then .lower likely won't work. It uses the C library's tolower API,
    which works on a byte level, i.e. can't work for multi-byte encodings.
    What you need to do is to operate on Unicode strings. I.e. instead
    of
    s.lower()
    do
    s.decode("utf-8").lower()
    or (if you need byte strings back)
    s.decode("utf-8").lower().encode("utf-8")
    If you find that you write the latter, I recommend that you redesign
    your application. Don't use byte strings to represent text, but use
    Unicode strings all the time, except at the system boundary (where
    you decode/encode as appropriate).
    There are some limitations with Unicode .lower also, but I don't
    think they apply to Russian (specifically, SpecialCasing.txt is
    not considered).
    HTH,
    Martin
    Alexey,

    if your strings stored in some text file you can use "codecs" package
    import codecs
    handler = codecs.open('somefile', 'r', 'utf-8')
    # ... do the job
    handler.close()
    I prefer this way to deal with russian in utf-8.

    Konstantin.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 5, '08 at 8:58p
activeOct 6, '08 at 11:35a
posts5
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase