On Tue, Apr 24, 2007 at 11:37:52AM +0200, demerphq wrote:

The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Which obviously cases problems matching a latin-1 \xDF
which is only one byte. Amusingly another bug in the regex engine
allows this to work out ok when the string is unicode. utf8 \xDF is
two bytes long, and the regex engine has some issues with the
distinction between "byte length" and "codepoint length", so it sees
the two bytes of the single codepoint as being sufficient length, and
then uses unicode folding to convert the strings \xDF to 'ss' and
everything works out. But this is fluke, im positive that there are
other fold case scenarios where we cant rely on this bug saving the
day. If the fold case version was longer (in bytes) than the utf8
version of the original it would not work out.
I encountered the same problem with kurila, a bit stronger because there
are lengths are in bytes. Thus even single characters would
fold to multiple values. And indeed that are many places where the
minlen is used, and the minlen has to be correct, otherwise you get very
unpredictable results.
This probably doesnt show up on too many peoples radars as most times
you would be matching against a string that is quite a bit longer than
the pattern. But for cases like the above there is definitely a bug.

At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character
You also have to include the characters which are in the same folding
class/group (can't remember the terminology) as the multi-character.
i.e. include 's' and 'S', because "\xDF"=~/ss/i should also match.
In kurila I didn't bother about that and just always disabled minlen
checking with case folding.

Also iirc there are some fixed length assumptions about EXACTF (sorry
can't remember where), so just decreasing minlen might not work and
break more things than it would solve.
Thats a pretty big hammer for such a case, but its about the best i
can think of.

Other ideas anyone?
If we want to get performance out of the engine, I think the most interest
approach would be to "unfold" the string at compile time, something
/e/i -> /(e|E)/
/foo/i -> /(f|F)(o|O)(o|O)/
/ss/i -> /(((s|S)(s|S))|\xDF)/
/\xDF/i -> /ss/i -> /(((s|S)(s|S))|\xDF)/

Then let trie, study, etc do all the optimalization.
ps: Actually I have to say the minlen/startclass optimisations are
pretty crufty and are clearly not properly unicode aware. There is a
serious need to completely rewrite study_chunk(), probably as several
routines so that sanity can be restored. But thats a big project, one
that would probably be sufficiently large that it would need to be
funded by TPF, assuming somebody had time to do it at all.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 16 | next ›
Discussion Overview
groupperl5-porters @
postedApr 24, '07 at 9:38a
activeApr 28, '07 at 10:17a



site design / logo © 2021 Grokbase