FAQ
Hi all,

As all the regular readers of this list know perl uses two string
encodings internally. One is essentially latin1, and the other is utf8
encoded unicode.

Unicode has an interesting feature that some people might not be
familiar with, the way it stipulates that an implementer handle
case-insentive matches includes the requirement that in some sitations
a single char can match several chars. The typical example of this is
\xDF, aka LATIN SMALL LETTER SHARP S, which when case folded ends up
as 'ss'.

The two encodings have different semantics, under latin1 \xDF case
insensitively matches only \xDF. Under utf8/unicode it matches 'ss' as
i already said.

Overall this isnt a problem. If the pattern or string is utf8/unicode
then perl uses unicode semantics and things work out pretty much as
one might expect (given one knows about the two encodings and their
differing semantics).

Now it turns out that there is a bug in the regex engine optimiser
related to \xDF in that the behaviour of

"\xDF"=~/\xDF/i

and

"ss"=~/\xDF/i

is not very predicatable. Depending on whether the pattern or the
string is utf8 the pattern will match differently. One would assume
that unicode semantics would be obeyed when either the string or
pattern was unicode, and that latin1 semantics (for lack of a better
term) would be followed only when neither were unicode.

Thus it would seem reasonable to expect that "ss" matches \xDF case
insensitively only when one or the other or both were unicode, and
that \xDF would match \xDF insensitively always. Except it doesnt. The
problem turns out the be minlen checking, and would apparently affect
ALL case-insensitive unicode matches where the fold-case version of a
codepoint is a multi-codepoint sequence.

The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Which obviously cases problems matching a latin-1 \xDF
which is only one byte. Amusingly another bug in the regex engine
allows this to work out ok when the string is unicode. utf8 \xDF is
two bytes long, and the regex engine has some issues with the
distinction between "byte length" and "codepoint length", so it sees
the two bytes of the single codepoint as being sufficient length, and
then uses unicode folding to convert the strings \xDF to 'ss' and
everything works out. But this is fluke, im positive that there are
other fold case scenarios where we cant rely on this bug saving the
day. If the fold case version was longer (in bytes) than the utf8
version of the original it would not work out.

This probably doesnt show up on too many peoples radars as most times
you would be matching against a string that is quite a bit longer than
the pattern. But for cases like the above there is definitely a bug.

At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character
string.

Thats a pretty big hammer for such a case, but its about the best i
can think of.

Other ideas anyone?

cheers,
Yves
ps: Actually I have to say the minlen/startclass optimisations are
pretty crufty and are clearly not properly unicode aware. There is a
serious need to completely rewrite study_chunk(), probably as several
routines so that sanity can be restored. But thats a big project, one
that would probably be sufficiently large that it would need to be
funded by TPF, assuming somebody had time to do it at all.













--
perl -Mre=debug -e "/just|another|perl|hacker/"

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 16 | next ›
Discussion Overview
groupperl5-porters @
categoriesperl
postedApr 24, '07 at 9:38a
activeApr 28, '07 at 10:17a
posts16
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase