Martin v. L?wis wrote:
To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
annex C was added somewhere between revision 6 and 9, i.e. in early
2004. Python's current definition of \w is a straight-forward extension
of the historical \w definition (of Perl, I believe), which,
unfortunately, fails to recognize some of the Unicode subtleties.
I agree about not dumping on the past. When unicode support was added
to re, it was a somewhat experimental advance over bytes-only re. Now
that Python has spread to south Asia as well as east Asia, it is time to
advance it further. I think this is especially important for 3.0, which
will attract such users with the option of native identifiers. Re
should be able to recognize Python identifiers as words. I care not
whether the patch is called a fix or an update.
I have no personal need for this at the moment but it just happens that
I studied Sanskrit a bit some years ago and understand the script and
could explain why at least some 'marks' are really 'letters'. There are
several other south Asian scripts descended from Devanagari, and
included in Unicode, that use the same or similar vowel mark system. So
updating Python's idea of a Unicode word will help users of several
languages and make it more of a world language.
I presume that not viewing letter marks as part of words would affect
Hebrew and Arabic also.
I wonder if the current rule also affect European words with accents
written as separate marks instead of as part of combined characters.
For instance, if Martin's last name is written 'L' 'o' 'diaresis mark'
'w' 'i' 's' (6 chars) instead of 'L' 'o with diaresis' 'w' 'i' 's' (5
chars), is it still recognized as a word? (I don't know how to do the
input to do the test.)
I notice from the manual "All identifiers are converted into the normal
form NFC while parsing; comparison of identifiers is based on NFC." If
NFC used accented letters, then the issue is finesses away for European
words simply because Unicode includes includes combined characters for
European scripts but not for south Asian scripts.
Terry Jan Reedy