On 4/24/07, demerphq wrote:
The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Which obviously cases problems matching a latin-1 \xDF
which is only one byte. Amusingly another bug in the regex engine
allows this to work out ok when the string is unicode. utf8 \xDF is
two bytes long, and the regex engine has some issues with the
distinction between "byte length" and "codepoint length", so it sees
the two bytes of the single codepoint as being sufficient length, and
then uses unicode folding to convert the strings \xDF to 'ss' and
everything works out. But this is fluke, im positive that there are
other fold case scenarios where we cant rely on this bug saving the
day. If the fold case version was longer (in bytes) than the utf8
version of the original it would not work out. [...]
At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character
Well i have a better solution it looks like. Ive created a new regop
FOLDCHAR that will be used to handle the three problematic codepoints
properly. This way the regex engine doesnt see them as normal text and
therefore the optimiser can do the right thing and everything works
out properly.

Sigh, so much trouble for one character. (The other two are just bonus material)

Its actually possible to detect codepoints that will have this problem
so its probably smart to put something in mktables that will detect
and warn if any new one come up. Or we can just do it by hand when
updating the unicode data files.

Patch is attached.


perl -Mre=debug -e "/just|another|perl|hacker/"

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 16 | next ›
Discussion Overview
groupperl5-porters @
postedApr 24, '07 at 9:38a
activeApr 28, '07 at 10:17a



site design / logo © 2021 Grokbase