On 11 November 2010 20:57, karl williamson wrote:
On 10 November 2010 06:24, karl williamson <email@example.com>
See the commit message in the attached patch.
Note that the lines above this are clearly wrong, as LOC may not be set
when the first node should match locale. It also means that a synthetic
start node could match with all of: locale, with default semantics, with
unicode semantics, and (soon) in just ascii; instead of these being
exclusive as in any other node. I need to think about how that can
Ilya is on record as saying that the code involved here is probably
buggy and could do with being reworked.
Actually this is a bug I introduced by not understanding the implications of
adding the /l modifier.
It is mind-boggling to me to realize that given that change and the future
ones for /u and /a, that a bit in the start class ANYOF bitmap could match
two or three different code points with three different folds.
What you say there rings alarm bells for me. "a bit in the start class
ANYOF bitmap could match two or three different code points with three
different folds" seems to me to suggest that you would not store
whatever codepoints could match that is both the folded and unfolder,
but instead only the original. That doesnt make sense to me.
The ANYOF bitmap can be "overly large", it doesnt have to be correct
in the sense that false positives are generally speaking acceptable,
as they will be rejected later. The intention here is to reduce the
number of times the main engine is used, so false-positives arent
really a big deal, they would just mean the optimization didn't help
us for one or two positions, and we had to pay a cost we would pay
So one bit in the bitmap should correspond to one char. There should
not be any folding during the processing of an ANYOF start-class, it
should be sufficient to do character by character lookup in the bitmap
for codepoints 0-255.
To illustrate, consider /(?l:\xC0)|(?u:\xC0)|(?a:\xC0)(?d:\xC0)/i.
If the optimizer created a start node for this, the bit for C0 would have
the following meanings:
1) whatever it does in the current locale. In an 8859-7 locale, it would be
GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS. I don't know what it
folds to in that locale, but it's not E0.
Lets forget the locale stuff. It should be deferred to run time, so
its a whole different kettle of fish.
2) LATIN CAPITAL LETTER A WITH GRAVE, which folds to E0
Ok, so we turn on bit C0 and E0 in the bit map.
3) Just C0, another nameless non-ASCII character, and it folds to itself.
So the bitmap would contain C0 and E0. Whats the problem here again? :-)
Also, there is another point here, currently we dont do startclass
optimisations for folded data. So why are you worrying about this?
I have long thought that if we changed the parsing process so the
first pass built a true parsetree of the regex and then we would do an
optimisation pass, and then a final transcribing pass then we would
overall have much cleaner code, and would have much much more options
in terms of optimization, and I think we would find it would make
proving some of our logic correct much easier.
Sounds good. When do you start? :)
perl -Mre=debug -e "/just|another|perl|hacker/"