demerphq skribis 2009-12-10 14:11 (+0100):
> See regexec.c and regcomp.c for the source of our mutual confusion.Unfortunately I don't speak C. If I understood Perl's source, I would
probably have been much more specific when suggesting fixes, from the
beginning.
> > The first, ASCII-only, would be a mistake.> No it wouldnt. There are no "unicode semantics" for POSIX.That would be relevant if Perl had POSIX character classes.
However, the question of whether [[:xxx:]] is POSIX-like syntax, or an
actual POSIX character class, remains unanswered or at least unclear.
Certainly Perl's documentation isn't fully definitive. perlre mentions:
1 => "POSIX character class syntax"
2 => "POSIX character classes"
3 => Equivalences to \p{} Unicode constructs
1 and 3 can both be true, but then 2 is not. This is how I (prefer to)
think of it.
However, it could also be that 1 and 2 are true, ruling 3 out. If I
understand correctly, that's how you see the matter.
The mere existence of exceptions to the POSIX standard in [:xxx:], and
the exclusion of [.xxx.] and [=xxx=] lead me to believe that it's just
syntax compatibility, and Perl is free to extend the class definitions
to meet more modern requirements, like acknowledging that é is indeed
alphanumeric. Even if the people who invented the original POSIX bracket
expressions failed to notice.
> Try matching all the legal codepoints against [^POSIX] and against [POSIX]> And note all the cases where you have both matching. Then do it with> the strings in unicode. Note all the errors.I wish you had spent the same time trying to explain what happens if you
do try this. Would have saved me some time and failure, because I was
unable to reproduce the errors.
perl -CO -le'(chr($_) =~ /[[:alnum:]]/) and (chr($_) =~ /[^[:alnum:]]/)
and warn sprintf "U+%04x (%s)\n", $_, chr for 1..65000'
doesn't give me anything. It is likely, however, that I misinterpreted
your instructions. (Note: I have no idea which codepoints qualify as
legal for this purpose, so I used the arbitrary limit of 65000.)
> For me this debate is over, POSIX charclasses are not Unicode> charclasses and any contortion to try to make them so is futile and> doomed to screw stuff over.The futile, doomed to screw stuf over attempt has been ongoing for
almost a decade. You suggest going back, I suggest going forward.
Unfortunately I don't understand the points you're making, except the
one about POSIX simply not having any notion of unicode. I'm okay with
a change that makes Perl's [:x:] charclasses fully POSIX compliant, but
then it needs to be done rigourously, and all Perl exceptions have to be
eradicated. This should then not be seen as a fix of any unicode bug,
but as a design/semantics change.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>