FAQ
The regex below identifies words in all languages I tested, but not in
Hindi:

# -*- coding: utf-8 -*-

import re
pat = re.compile('^(\w+)$', re.U)
langs = ('English', '??', '??????')

for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)

Output:

English English
?? ??
?????? None
From this is assumed that the Hindi text contains punctuation or other
characters that prevent the word match. Now, even more alienating is
this:

pat = re.compile('^(\W+)$', re.U) # note: now \W

for l in langs:
m = pat.search(l.decode('utf-8'))
print l, m and m.group(1)

Output:

English None
?? None
?????? None

How can the Hindi be both not a word and "not not a word"??

Any clue would be much appreciated!

Best.

Search Discussions

  • Peter Otten at Nov 28, 2008 at 4:29 pm

    Shiao wrote:

    The regex below identifies words in all languages I tested, but not in
    Hindi:

    # -*- coding: utf-8 -*-

    import re
    pat = re.compile('^(\w+)$', re.U)
    langs = ('English', '??', '??????')

    for l in langs:
    m = pat.search(l.decode('utf-8'))
    print l, m and m.group(1)

    Output:

    English English
    ?? ??
    ?????? None

    From this is assumed that the Hindi text contains punctuation or other
    characters that prevent the word match. Now, even more alienating is
    this:

    pat = re.compile('^(\W+)$', re.U) # note: now \W

    for l in langs:
    m = pat.search(l.decode('utf-8'))
    print l, m and m.group(1)

    Output:

    English None
    ?? None
    ?????? None

    How can the Hindi be both not a word and "not not a word"??

    Any clue would be much appreciated!
    It's not a word, but that doesn't mean that it consists entirely of
    non-alpha characters either. Here's what Python gets to see:
    langs[2]
    u'\u0939\u093f\u0928\u094d\u0926\u0940'
    from unicodedata import name
    for c in langs[2]:
    ... print repr(c), name(c), ["non-alpha", "ALPHA"][c.isalpha()]
    ...
    u'\u0939' DEVANAGARI LETTER HA ALPHA
    u'\u093f' DEVANAGARI VOWEL SIGN I non-alpha
    u'\u0928' DEVANAGARI LETTER NA ALPHA
    u'\u094d' DEVANAGARI SIGN VIRAMA non-alpha
    u'\u0926' DEVANAGARI LETTER DA ALPHA
    u'\u0940' DEVANAGARI VOWEL SIGN II non-alpha

    Peter
  • Jerry Hill at Nov 28, 2008 at 4:36 pm

    On Fri, Nov 28, 2008 at 10:47 AM, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:

    # -*- coding: utf-8 -*-

    import re
    pat = re.compile('^(\w+)$', re.U)
    langs = ('English', '??', '??????')
    I think the problem is that the Hindi Text contains both alphanumeric
    and non-alphanumeric characters. I'm not very familiar with Hindi,
    much less how it's held in unicode, but take a look at the output of
    this code:

    # -*- coding: utf-8 -*-
    import unicodedata as ucd

    langs = (u'English', u'??', u'??????')
    for lang in langs:
    print lang
    for char in lang:
    print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))

    Output:

    English
    E LATIN CAPITAL LETTER E (Lu)
    n LATIN SMALL LETTER N (Ll)
    g LATIN SMALL LETTER G (Ll)
    l LATIN SMALL LETTER L (Ll)
    i LATIN SMALL LETTER I (Ll)
    s LATIN SMALL LETTER S (Ll)
    h LATIN SMALL LETTER H (Ll)
    ??
    ? CJK UNIFIED IDEOGRAPH-4E2D (Lo)
    ? CJK UNIFIED IDEOGRAPH-6587 (Lo)
    ??????
    ? DEVANAGARI LETTER HA (Lo)
    ? DEVANAGARI VOWEL SIGN I (Mc)
    ? DEVANAGARI LETTER NA (Lo)
    ? DEVANAGARI SIGN VIRAMA (Mn)
    ? DEVANAGARI LETTER DA (Lo)
    ? DEVANAGARI VOWEL SIGN II (Mc)

    From that, we see that there are some characters in the Hindi string
    that aren't letters (they're not in unicode category L), but are
    instead marks (unicode category M).
  • Terry Reedy at Nov 28, 2008 at 7:29 pm

    Jerry Hill wrote:
    On Fri, Nov 28, 2008 at 10:47 AM, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:

    # -*- coding: utf-8 -*-

    import re
    pat = re.compile('^(\w+)$', re.U)
    langs = ('English', '??', '??????')
    I think the problem is that the Hindi Text contains both alphanumeric
    and non-alphanumeric characters. I'm not very familiar with Hindi,
    much less how it's held in unicode, but take a look at the output of
    this code:

    # -*- coding: utf-8 -*-
    import unicodedata as ucd

    langs = (u'English', u'??', u'??????')
    for lang in langs:
    print lang
    for char in lang:
    print "\t %s %s (%s)" % (char, ucd.name(char), ucd.category(char))

    Output:

    English
    E LATIN CAPITAL LETTER E (Lu)
    n LATIN SMALL LETTER N (Ll)
    g LATIN SMALL LETTER G (Ll)
    l LATIN SMALL LETTER L (Ll)
    i LATIN SMALL LETTER I (Ll)
    s LATIN SMALL LETTER S (Ll)
    h LATIN SMALL LETTER H (Ll)
    ??
    ? CJK UNIFIED IDEOGRAPH-4E2D (Lo)
    ? CJK UNIFIED IDEOGRAPH-6587 (Lo)
    ??????
    ? DEVANAGARI LETTER HA (Lo)
    ? DEVANAGARI VOWEL SIGN I (Mc)
    ? DEVANAGARI LETTER NA (Lo)
    ? DEVANAGARI SIGN VIRAMA (Mn)
    ? DEVANAGARI LETTER DA (Lo)
    ? DEVANAGARI VOWEL SIGN II (Mc)

    From that, we see that there are some characters in the Hindi string
    that aren't letters (they're not in unicode category L), but are
    instead marks (unicode category M).
    Python3.0 allows unicode identifiers. Mn and Mc characters are included
    in the set of allowed alphanumeric characters. 'Hindi' is a word in
    both its native characters and in latin tranliteration.

    http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords


    re is too restrictive in its definition of 'word'. I suggest that OP
    (original poster) Shiao file a bug report at http://bugs.python.org

    tjr
  • MRAB at Nov 28, 2008 at 7:41 pm

    Terry Reedy wrote:
    Jerry Hill wrote:
    On Fri, Nov 28, 2008 at 10:47 AM, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:

    # -*- coding: utf-8 -*-

    import re
    pat = re.compile('^(\w+)$', re.U)
    langs = ('English', '??', '??????')
    I think the problem is that the Hindi Text contains both alphanumeric
    and non-alphanumeric characters. I'm not very familiar with Hindi,
    much less how it's held in unicode, but take a look at the output of
    this code:

    # -*- coding: utf-8 -*-
    import unicodedata as ucd

    langs = (u'English', u'??', u'??????')
    for lang in langs:
    print lang
    for char in lang:
    print "\t %s %s (%s)" % (char, ucd.name(char),
    ucd.category(char))

    Output:

    English
    E LATIN CAPITAL LETTER E (Lu)
    n LATIN SMALL LETTER N (Ll)
    g LATIN SMALL LETTER G (Ll)
    l LATIN SMALL LETTER L (Ll)
    i LATIN SMALL LETTER I (Ll)
    s LATIN SMALL LETTER S (Ll)
    h LATIN SMALL LETTER H (Ll)
    ??
    ? CJK UNIFIED IDEOGRAPH-4E2D (Lo)
    ? CJK UNIFIED IDEOGRAPH-6587 (Lo)
    ??????
    ? DEVANAGARI LETTER HA (Lo)
    ? DEVANAGARI VOWEL SIGN I (Mc)
    ? DEVANAGARI LETTER NA (Lo)
    ? DEVANAGARI SIGN VIRAMA (Mn)
    ? DEVANAGARI LETTER DA (Lo)
    ? DEVANAGARI VOWEL SIGN II (Mc)

    From that, we see that there are some characters in the Hindi string
    that aren't letters (they're not in unicode category L), but are
    instead marks (unicode category M).
    Python3.0 allows unicode identifiers. Mn and Mc characters are included
    in the set of allowed alphanumeric characters. 'Hindi' is a word in
    both its native characters and in latin tranliteration.

    http://docs.python.org/dev/3.0/reference/lexical_analysis.html#identifiers-and-keywords


    re is too restrictive in its definition of 'word'. I suggest that OP
    (original poster) Shiao file a bug report at http://bugs.python.org
    Should the Mc and Mn codepoints match \w in the re module even though
    u'??????'.isalpha() returns False (in Python 2.x, haven't tried Python
    3.x)? Issue 1693050 said no. Perhaps someone with knowledge of Hindi
    could suggest how Python should handle it. I wouldn't want the re module
    to say one thing and the rest of the language to say another! :-)
  • Terry Reedy at Nov 28, 2008 at 9:15 pm

    MRAB wrote:

    Should the Mc and Mn codepoints match \w in the re module even though
    u'??????'.isalpha() returns False (in Python 2.x, haven't tried Python
    3.x)?
    Same. And to me, that is wrong. The condensation of vowel characters
    (which Hindi, etc, also have for words that begin with vowels) to 'vowel
    marks' attached to the previous consonant does change their nature as
    indications of speech sounds. The difference is purely graphical.

    Issue 1693050 said no.
    The full url
    http://bugs.python.org/issue1693050
    would have been nice, but thank you for finding this. I search but
    obviously not with the right word. In any case, this issue is still
    open. MAL is wrong about at least Mc and Mn. I will explain there also.
    Perhaps someone with knowledge of Hindi
    could suggest how Python should handle it.
    Recognize that vowel are parts of words, as it already does for identifiers.
    I wouldn't want the re module
    to say one thing and the rest of the language to say another! :-)
    I will add a note about .isapha

    Terry Jan Reedy
  • John Machin at Nov 28, 2008 at 11:05 pm

    On Nov 29, 2:47?am, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:
    pat = re.compile('^(\w+)$', re.U)
    ...
    m = pat.search(l.decode('utf-8'))
    [example snipped]
    From this is assumed that the Hindi text contains punctuation or other
    characters that prevent the word match.
    This appears to be a bug in Python, as others have pointed out. Two
    points not covered so far:

    (1) Instead of search() with pattern ^blahblah, use match() with
    pattern blahblah -- unless it has been fixed fairly recently, search()
    doesn't notice that the ^ means that it can give up when failure
    occurs at the first try; it keeps on trying futilely at the 2nd,
    3rd, .... positions.

    (2) "identifies words": \w+ (when fixed) matches a sequence of one or
    more characters that could appear *anywhere* in a word in any language
    (including computer languages). So it not only matches words, it also
    matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
    in other words, you may need to filter out false positives. Also, in
    some languages (e.g. Chinese) a "word" consists of one or more
    characters and there is typically no spacing between "words"; \w+ will
    identify whole clauses or sentences.

    Cheers,
    John
  • MRAB at Nov 28, 2008 at 11:51 pm

    John Machin wrote:
    On Nov 29, 2:47 am, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:
    pat = re.compile('^(\w+)$', re.U)
    ...
    m = pat.search(l.decode('utf-8'))
    [example snipped]
    From this is assumed that the Hindi text contains punctuation or other
    characters that prevent the word match.
    This appears to be a bug in Python, as others have pointed out. Two
    points not covered so far:
    Well, not so much a bug as a lack of knowledge.
    (1) Instead of search() with pattern ^blahblah, use match() with
    pattern blahblah -- unless it has been fixed fairly recently, search()
    doesn't notice that the ^ means that it can give up when failure
    occurs at the first try; it keeps on trying futilely at the 2nd,
    3rd, .... positions.

    (2) "identifies words": \w+ (when fixed) matches a sequence of one or
    more characters that could appear *anywhere* in a word in any language
    (including computer languages). So it not only matches words, it also
    matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
    in other words, you may need to filter out false positives. Also, in
    some languages (e.g. Chinese) a "word" consists of one or more
    characters and there is typically no spacing between "words"; \w+ will
    identify whole clauses or sentences.
    This is down to the definition of "word character". Should \w match Mc
    characters? Should \w match a single character or a non-combining
    character with any combining characters, ie just Lo or Lo, Lo+Mc,
    Lo+Mc+Mc, etc?
  • John Machin at Nov 29, 2008 at 10:11 am

    On Nov 29, 10:51?am, MRAB wrote:
    John Machin wrote:
    On Nov 29, 2:47 am, Shiao wrote:
    The regex below identifies words in all languages I tested, but not in
    Hindi:
    pat = re.compile('^(\w+)$', re.U)
    ...
    ? ?m = pat.search(l.decode('utf-8'))
    [example snipped]
    From this is assumed that the Hindi text contains punctuation or other
    characters that prevent the word match.
    This appears to be a bug in Python, as others have pointed out. Two
    points not covered so far:
    Well, not so much a bug as a lack of knowledge.
    It's a bug. See below.
    (1) Instead of search() with pattern ^blahblah, use match() with
    pattern blahblah -- unless it has been fixed fairly recently, search()
    doesn't notice that the ^ means that it can give up when failure
    occurs at the first try; it keeps on trying futilely at the 2nd,
    3rd, .... positions.
    (2) "identifies words": \w+ (when fixed) matches a sequence of one or
    more characters that could appear *anywhere* in a word in any language
    (including computer languages). So it not only matches words, it also
    matches non-words like '123' and '0x000' and '0123_' and 10 viramas --
    in other words, you may need to filter out false positives. Also, in
    some languages (e.g. Chinese) a "word" consists of one or more
    characters and there is typically no spacing between "words"; \w+ will
    identify whole clauses or sentences.
    This is down to the definition of "word character".
    What is "This"? The two additional points I'm making have nothing to
    do with \w.
    Should \w match Mc
    characters? Should \w match a single character or a non-combining
    character with any combining characters, ie just Lo or Lo, Lo+Mc,
    Lo+Mc+Mc, etc?
    Huh? I thought it was settled. Read Terry Ready's latest message. Read
    the bug report it points to (http://bugs.python.org/issue1693050),
    especially the contribution from MvL. To paraphrase a remark by the
    timbot, Martin reads Unicode tech reports so that we don't have to.
    However if you are a doubter or have insomnia, read http://unicode.org/reports/tr18/

    Cheers,
    John
  • Martin v. Löwis at Nov 29, 2008 at 11:06 am

    Huh? I thought it was settled. Read Terry Ready's latest message. Read
    the bug report it points to (http://bugs.python.org/issue1693050),
    especially the contribution from MvL. To paraphrase a remark by the
    timbot, Martin reads Unicode tech reports so that we don't have to.
    However if you are a doubter or have insomnia, read http://unicode.org/reports/tr18/
    To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
    annex C was added somewhere between revision 6 and 9, i.e. in early
    2004. Python's current definition of \w is a straight-forward extension
    of the historical \w definition (of Perl, I believe), which,
    unfortunately, fails to recognize some of the Unicode subtleties.

    In any case, the desired definition is very well available in Python
    today - one just has to define a character class that contains all
    characters that one thinks \w should contain, e.g. with the code below.
    While the regular expression source becomes very large, the compiled
    form will be fairly compact, and efficient in lookup.

    Regards,
    Martin

    # UTR#18 says \w is
    # \p{alpha}\p{gc=Mark}\p{digit}\p{gc=Connector_Punctuation}
    #
    # In turn, \p{alpha} is \p{Alphabetic}, which, in turn
    # is Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic
    # Other_Alphabetic can be ignored: it is a fixed list of
    # characters from Mn and Mc, which are included, anyway
    #
    # \p{digit} is \p{gc=Decimal_Number}, i.e. Nd
    # \p{gc=Mark} is all Mark category, i.e. Mc, Me, Mn
    # \p{gc=Connector_Punctuation} is Pc
    def make_w():
    import unicodedata, sys
    w_chars = []
    for i in range(sys.maxunicode):
    c = unichr(i)
    if unicodedata.category(c) in \
    ('Lu','Ll','Lt','Lm','Lo','Nl','Nd',
    'Mc','Me','Mn','Pc'):
    w_chars.append(c)
    return u'['+u''.join(w_chars)+u']'

    import re
    re.compile(make_w())
  • Terry Reedy at Nov 29, 2008 at 5:33 pm

    Martin v. L?wis wrote:

    To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
    annex C was added somewhere between revision 6 and 9, i.e. in early
    2004. Python's current definition of \w is a straight-forward extension
    of the historical \w definition (of Perl, I believe), which,
    unfortunately, fails to recognize some of the Unicode subtleties.
    I agree about not dumping on the past. When unicode support was added
    to re, it was a somewhat experimental advance over bytes-only re. Now
    that Python has spread to south Asia as well as east Asia, it is time to
    advance it further. I think this is especially important for 3.0, which
    will attract such users with the option of native identifiers. Re
    should be able to recognize Python identifiers as words. I care not
    whether the patch is called a fix or an update.

    I have no personal need for this at the moment but it just happens that
    I studied Sanskrit a bit some years ago and understand the script and
    could explain why at least some 'marks' are really 'letters'. There are
    several other south Asian scripts descended from Devanagari, and
    included in Unicode, that use the same or similar vowel mark system. So
    updating Python's idea of a Unicode word will help users of several
    languages and make it more of a world language.

    I presume that not viewing letter marks as part of words would affect
    Hebrew and Arabic also.

    I wonder if the current rule also affect European words with accents
    written as separate marks instead of as part of combined characters.
    For instance, if Martin's last name is written 'L' 'o' 'diaresis mark'
    'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5
    chars), is it still recognized as a word? (I don't know how to do the
    input to do the test.)

    I notice from the manual "All identifiers are converted into the normal
    form NFC while parsing; comparison of identifiers is based on NFC." If
    NFC used accented letters, then the issue is finesses away for European
    words simply because Unicode includes includes combined characters for
    European scripts but not for south Asian scripts.

    Terry Jan Reedy
  • MRAB at Nov 29, 2008 at 6:20 pm

    Terry Reedy wrote:
    Martin v. L?wis wrote:
    To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
    annex C was added somewhere between revision 6 and 9, i.e. in early
    2004. Python's current definition of \w is a straight-forward extension
    of the historical \w definition (of Perl, I believe), which,
    unfortunately, fails to recognize some of the Unicode subtleties.
    I agree about not dumping on the past. When unicode support was added
    to re, it was a somewhat experimental advance over bytes-only re. Now
    that Python has spread to south Asia as well as east Asia, it is time to
    advance it further. I think this is especially important for 3.0, which
    will attract such users with the option of native identifiers. Re
    should be able to recognize Python identifiers as words. I care not
    whether the patch is called a fix or an update.

    I have no personal need for this at the moment but it just happens that
    I studied Sanskrit a bit some years ago and understand the script and
    could explain why at least some 'marks' are really 'letters'. There are
    several other south Asian scripts descended from Devanagari, and
    included in Unicode, that use the same or similar vowel mark system. So
    updating Python's idea of a Unicode word will help users of several
    languages and make it more of a world language.

    I presume that not viewing letter marks as part of words would affect
    Hebrew and Arabic also.

    I wonder if the current rule also affect European words with accents
    written as separate marks instead of as part of combined characters. For
    instance, if Martin's last name is written 'L' 'o' 'diaresis mark' 'w'
    'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5
    chars), is it still recognized as a word? (I don't know how to do the
    input to do the test.)

    I notice from the manual "All identifiers are converted into the normal
    form NFC while parsing; comparison of identifiers is based on NFC." If
    NFC used accented letters, then the issue is finesses away for European
    words simply because Unicode includes includes combined characters for
    European scripts but not for south Asian scripts.
    Does that mean that the re module will need to convert both the pattern
    and the text to be searched into NFC form first? And I'm still not clear
    whether \w, when used on a string consisting of Lo followed by Mc,
    should match Lo and then Mc (one codepoint at a time) or together (one
    character at a time, where a character consists of some base character
    codepoint possibly followed by modifier codepoints).

    I ask because I'm working on the re module at the moment.
  • Terry Reedy at Nov 29, 2008 at 10:43 pm

    MRAB wrote:
    Terry Reedy wrote:
    I notice from the manual "All identifiers are converted into the
    normal form NFC while parsing; comparison of identifiers is based on
    NFC." If NFC used accented letters, then the issue is finesses away
    for European words simply because Unicode includes includes combined
    characters for European scripts but not for south Asian scripts.
    Does that mean that the re module will need to convert both the pattern
    and the text to be searched into NFC form first?
    The quote says that Python3 internally converts all identifiers in
    source code to NFC before compiling the code, so it can properly compare
    them. If this was purely an internal matter, this would not need to be
    said. I interpret the quote as a warning that a programmer who wants to
    compare a 3.0 string to an identifier represented as a string is
    responsible for making sure that *his* string is also in NFC. For instance:

    ident = 3
    ...
    if 'ident' in globals(): ...

    The second ident must be NFC even if the programmer prefers and
    habitually writes another form because, like it or not, the first one
    will be turned into NFC before insertion into the code object and later
    into globals().

    So my thought is that re should take the strings as given, but that the
    re doc should warn about logically equal forms not matching. (Perhaps
    it does already; I have not read it in years.) If a text uses a
    different normalization form, which some surely will, the programmer is
    responsible for using the same in the re pattern.
    And I'm still not clear
    whether \w, when used on a string consisting of Lo followed by Mc,
    should match Lo and then Mc (one codepoint at a time) or together (one
    character at a time, where a character consists of some base character
    codepoint possibly followed by modifier codepoints).
    Programs that transform text to glyphs may have to read bundles of
    codepoints before starting to output, but my guess is that re should do
    the simplest thing and match codepoint by codepoint, assuming that is
    what it currently does. I gather that would just mean expanding the
    current definition of word char. But I would look at TR18 and see what
    Martin says.
    I ask because I'm working on the re module at the moment.
    Great. I *think* that the change should be fairly simple

    Terry Jan Reedy
  • John Machin at Nov 29, 2008 at 10:41 pm

    On Nov 30, 4:33?am, Terry Reedy wrote:
    Martin v. L?wis wrote:
    To be fair to Python (and SRE),
    I was being unfair? In the context, "bug" == "needs to be changed";
    see below.
    SRE predates TR#18 (IIRC) - atleast
    annex C was added somewhere between revision 6 and 9, i.e. in early
    2004. Python's current definition of \w is a straight-forward extension
    of the historical \w definition (of Perl, I believe), which,
    unfortunately, fails to recognize some of the Unicode subtleties.
    I agree about not dumping on the past.
    Dumping on the past?? I used the term "bug" in the same sense as you
    did: "I suggest that OP (original poster) Shiao file a bug report at
    http://bugs.python.org".
    ?When unicode support was added
    to re, it was a somewhat experimental advance over bytes-only re. ?Now
    that Python has spread to south Asia as well as east Asia, it is time to
    advance it further. ?I think this is especially important for 3.0, which
    will attract such users with the option of native identifiers. ?Re
    should be able to recognize Python identifiers as words. ?I care not
    whether the patch is called a fix or an update.

    I have no personal need for this at the moment but it just happens that
    I studied Sanskrit a bit some years ago and understand the script and
    could explain why at least some 'marks' are really 'letters'. ?There are
    several other south Asian scripts descended from Devanagari, and
    included in Unicode, that use the same or similar vowel mark system. ?So
    updating Python's idea of a Unicode word will help users of several
    languages and make it more of a world language.

    I presume that not viewing letter marks as part of words would affect
    Hebrew and Arabic also.

    I wonder if the current rule also affect European words with accents
    written as separate marks instead of as part of combined characters.
    For instance, if Martin's last name is written 'L' 'o' 'diaresis mark'
    'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5
    chars), is it still recognized as a word? ?(I don't know how to do the
    input to do the test.)
    Like this:
    w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
    w2 = u"Lo\N{COMBINING DIAERESIS}wis"
    w1
    u'L\xf6wis'
    w2
    u'Lo\u0308wis'
    import unicodedats as ucd
    ucd.category(u'\u0308')
    'Mn'
    u'\u0308'.isalpha()
    False
    regex = re.compile(ur'\w+', re.UNICODE)
    regex.match(w1).group(0)
    u'L\xf6wis'
    regex.match(w2).group(0)
    u'Lo'
  • Terry Reedy at Nov 29, 2008 at 11:13 pm
    John Machin wrote:

    John, nothing I wrote was directed at you. If you feel insulted, you
    have my apology. My intention was and is to get future movement on an
    issue that was reported 20 months ago but which has lain dead since,
    until re-reported (a bit more clearly) a week ago, because of a
    misunderstanding by the person who (I believe) rewrote re for unicode
    several years ago.
    Like this:
    w1 = u"L\N{LATIN SMALL LETTER O WITH DIAERESIS}wis"
    w2 = u"Lo\N{COMBINING DIAERESIS}wis"
    w1
    u'L\xf6wis'
    w2
    u'Lo\u0308wis'
    import unicodedats as ucd
    ucd.category(u'\u0308')
    'Mn'
    u'\u0308'.isalpha()
    False
    regex = re.compile(ur'\w+', re.UNICODE)
    regex.match(w1).group(0)
    u'L\xf6wis'
    regex.match(w2).group(0)
    u'Lo'
    Yes, thank you. FWIW, that confirms my suspicion.

    Terry
  • Martin v. Löwis at Nov 30, 2008 at 1:34 am

    John Machin wrote:
    On Nov 30, 4:33 am, Terry Reedy wrote:
    Martin v. L?wis wrote:
    To be fair to Python (and SRE),
    I was being unfair?
    No - sorry if I gave that impression.

    Regards,
    Martin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedNov 28, '08 at 3:47p
activeNov 30, '08 at 1:34a
posts16
users7
websitepython.org

People

Translate

site design / logo © 2022 Grokbase