FAQ
Hello!

Reading XML Schema docs I found that there are some useful extensions
to regular expressions like an ability to specify class of characters.
For example,

[\p{Lu}]

will match any uppercase letters.

Is the feature planned in Python re too?

Sincerely yours, Roman A.Suzi
--
- Petrozavodsk - Karelia - Russia - mailto:rnd at onego.ru -

Search Discussions

  • Martin v. Löwis at Jul 30, 2002 at 10:27 am

    Roman Suzi <rnd at onego.ru> writes:

    Reading XML Schema docs I found that there are some useful extensions
    to regular expressions like an ability to specify class of characters.
    For example,

    [\p{Lu}]

    will match any uppercase letters.

    Is the feature planned in Python re too?
    Python currently supports Unicode character classes by explicitly
    enumerating all characters, e.g.

    r=re.compile(u"[\u0400-\u04FF]")

    In addition, it extends the categories to Unicode, if the UNICODE flag
    is given:

    - \d (digit): Character has a 'digit value' property; covers all of Nd
    and most of No
    - \s (space): bidirectional type WS, B, S, or category Zs
    - \w (word): alpha (Ll, Lu, Lt, Lo, or Lm),
    decimal (has 'decimal value' property),
    digit,
    numeric (has 'numeric value' property),
    or '_'
    - (linebreak, currently not supported in sre_parse):
    Category Zl, or type B

    There has been talk about supporting the POSIX regular expression
    categories (alnum, cntrl, lower, space, alpha, digit, print, upper,
    blank, graph, punct, xdigit, plus any categories defined by LC_CTYPE);
    this is not implemented, yet.

    So far, nobody has proposed to support Unicode categories in SRE. You
    can easily implement this yourself by means of using
    unicodedata.category, e.g.

    import unicodedata, sys

    def gencategory(cat):
    start = end = None
    result = [u"["]
    for i in range(sys.maxunicode+1):
    c = unichr(i)
    if unicodedata.category(c) == cat:
    if start is None:
    start = end = c
    else:
    end = c
    elif start:
    # XXX: special-case ] and -
    if start == end:
    result.append(start)
    else:
    result.append(start + "-" + end)
    start = None
    result.append(u"]")
    return u"".join(result)

    print repr(gencategory("Lu"))

    It turns out that those categories are useless for XML, since the XML
    character classes (in XML 1.0) have been defined using a different
    Unicode versions (XML uses the Unicode 2.0 database). The same appears
    to be the case for XML Schema: They use the Unicode 3.1 database;
    Python 2.2 has the Unicode 3.0 database.

    So to implement XML Schema, you probably have to parse the specific
    version of the Unicode database yourself, and construct the re class
    from that.

    Regards,
    Martin
  • Roman Suzi at Jul 30, 2002 at 12:10 pm

    On 30 Jul 2002, Martin v. [iso-8859-15] L?wis wrote:

    Roman Suzi <rnd at onego.ru> writes:
    So far, nobody has proposed to support Unicode categories in SRE. You
    can easily implement this yourself by means of using
    unicodedata.category, e.g.
    OK. Thanks.

    Probably, there should be pre-compiled categories somewhere in
    the standard library... Say, in RE module.
    It turns out that those categories are useless for XML, since the XML
    character classes (in XML 1.0) have been defined using a different
    Unicode versions (XML uses the Unicode 2.0 database). The same appears
    to be the case for XML Schema: They use the Unicode 3.1 database;
    Python 2.2 has the Unicode 3.0 database.
    Isn't Python one of the best choices for XML processing ;-)
    So to implement XML Schema, you probably have to parse the specific
    version of the Unicode database yourself, and construct the re class
    from that.
    Regards,
    Martin
    Sincerely yours, Roman A.Suzi
    --
    - Petrozavodsk - Karelia - Russia - mailto:rnd at onego.ru -
  • Martin v. Loewis at Jul 30, 2002 at 6:08 pm

    Roman Suzi <rnd at onego.ru> writes:

    Probably, there should be pre-compiled categories somewhere in
    the standard library... Say, in RE module.
    Feel free to submit a patch to sourceforge. A decision on that will be
    made by Fredrik Lundh, so you may want to contact him in advance.

    Regards,
    Martin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 30, '02 at 8:15a
activeJul 30, '02 at 6:08p
posts4
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase