Grokbase
x

Juerd Waalboer (j...@convolution.nl)

Profile | Posts (118)

User Information

Display Name:Juerd Waalboer
Partial Email Address:j...@convolution.nl
Posts:
118 total
84 in Perl 5 Porters
34 in qpsmtpd

5 Most Recent

All Posts
1) Juerd Waalboer Re: Should Unicode semantics be the default for Latin1 characters in 5.12?
| +1 vote
karl williamson skribis 2009-12-13 13:06 (-0700): I'm very strongly inclined to think they should....
Perl 5 Porters
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
karl williamson skribis 2009-12-13 13:06 (-0700):
> I'm inclined to think not.

I'm very strongly inclined to think they should.

> As noted before, several CPAN modules that are in blead failed with
> this  change.

Then let's work together to fix these modules in both 5.8 and 5.12.
Apparently a mere "use legacy" does this.

> Gerard has said that Kurila experienced these same module
> failures

Kurila has a WILDLY different string model.

> 2) I will submit a patch that just flips the default. People will for
> the first time not have to do a utf8::upgrade or a Unicode::Semantics
> all over the place to get the new effect. They can just do a 'no
> legacy' at the beginning of their program to get the effect, except,
> unfortunately, for modules outside their control.
> 3) We announce in perldelta, perhaps other places, that the plan is to
> flip the state in 5.14.

Will this automatically fix the CPAN modules, somehow?

I don't see why the change should be postponed by another few years.
Perl 5.10 will be around for everyone who's really dependent on the old
behaviour as the default, for a very long time. Just like enough
businesses are still using 5.6 or 5.005 even. Postponing the inevitable
just to buy people a little more time is a bad idea. A new non-bugfix
release is the perfect opportunity to introduce the change.

Besides, flipping a default essentially means that lots of code without
"use/no legacy" will break just as it will now.

If an in-between version is needed, then I think it would be better to
put that in the 5.10 series! Don't delay the progress of the whole of
Perl, because of one little issue that affects only people who haven't
been paying attention for years (even if that is, perhaps, the larger
part of the user base).
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>
2) Juerd Waalboer Re: POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl #58182] ...)
| +1 vote
karl williamson skribis 2009-12-10 22:29 (-0700): In that case I'm entirely fine with the change,...
Perl 5 Porters
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
karl williamson skribis 2009-12-10 22:29 (-0700):
> It was our intention that 5.12 would use strict Posix definitions
> rigourously for all these,

In that case I'm entirely fine with the change, provided of course that
perldelta documents the change as such.

> except the perl made-up extension, [[:Word:]], which has no Posix
> definition.

Changing the bracket expressions to strict POSIX semantics is an
incompatible change. Why keep [:word:]? Not that I really mind, but
strict interpretation usually doesn't come with exceptions.
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>
3) Juerd Waalboer POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl #58182] ...)
| +1 vote
demerphq skribis 2009-12-10 14:11 (+0100): Unfortunately I don't speak C. If I understood Perl's...
Perl 5 Porters
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
demerphq skribis 2009-12-10 14:11 (+0100):
> See regexec.c and regcomp.c for the source of our mutual confusion.

Unfortunately I don't speak C. If I understood Perl's source, I would
probably have been much more specific when suggesting fixes, from the
beginning.

> > The first, ASCII-only, would be a mistake.
> No it wouldnt. There are no "unicode semantics" for POSIX.

That would be relevant if Perl had POSIX character classes.

However, the question of whether [[:xxx:]] is POSIX-like syntax, or an
actual POSIX character class, remains unanswered or at least unclear.

Certainly Perl's documentation isn't fully definitive. perlre mentions:

1 => "POSIX character class syntax"
2 => "POSIX character classes"
3 => Equivalences to \p{} Unicode constructs

1 and 3 can both be true, but then 2 is not. This is how I (prefer to)
think of it.

However, it could also be that 1 and 2 are true, ruling 3 out. If I
understand correctly, that's how you see the matter.

The mere existence of exceptions to the POSIX standard in [:xxx:], and
the exclusion of [.xxx.] and [=xxx=] lead me to believe that it's just
syntax compatibility, and Perl is free to extend the class definitions
to meet more modern requirements, like acknowledging that é is indeed
alphanumeric. Even if the people who invented the original POSIX bracket
expressions failed to notice.

> Try matching all the legal codepoints against [^POSIX] and against [POSIX]
> And note all the cases where you have both matching. Then do it with
> the strings in unicode. Note all the errors.

I wish you had spent the same time trying to explain what happens if you
do try this. Would have saved me some time and failure, because I was
unable to reproduce the errors.

perl -CO -le'(chr($_) =~ /[[:alnum:]]/) and (chr($_) =~ /[^[:alnum:]]/)
and warn sprintf "U+%04x (%s)\n", $_, chr for 1..65000'

doesn't give me anything. It is likely, however, that I misinterpreted
your instructions. (Note: I have no idea which codepoints qualify as
legal for this purpose, so I used the arbitrary limit of 65000.)

> For me this debate is over, POSIX charclasses are not Unicode
> charclasses and any contortion to try to make them so is futile and
> doomed to screw stuff over.

The futile, doomed to screw stuf over attempt has been ongoing for
almost a decade. You suggest going back, I suggest going forward.
Unfortunately I don't understand the points you're making, except the
one about POSIX simply not having any notion of unicode. I'm okay with
a change that makes Perl's [:x:] charclasses fully POSIX compliant, but
then it needs to be done rigourously, and all Perl exceptions have to be
eradicated. This should then not be seen as a fix of any unicode bug,
but as a design/semantics change.
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>
4) Juerd Waalboer Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
| +1 vote
demerphq skribis 2009-12-10 13:23 (+0100): juerd@lanova:~$ perl -le'print "foo" =~ /[[:word:]]/'...
Perl 5 Porters
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
demerphq skribis 2009-12-10 13:23 (+0100):
> And, [[:word:]] is spelled [[:alnum:]].

juerd@lanova:~$ perl -le'print "foo" =~ /[[:word:]]/'
1

See perlre

> You cannot have both the current behaviour and non buggy implementation.

Fully agreed. That's certainly not what I'm after, either.

> Simply put I consider that:
> [^STUFF] matching the same code points as [STUFF] to be an irrefutable
> and overwhelming reason why the current behavior of POSIX charclass
> cannot be preserved.

What exactly do you mean by "current behaviour"?

To fix the issue that codepoints 128..255 are included depending on
internal encoding, there are two options:

- Ignore anything above 127
- Provide full unicode semantics.

The first, ASCII-only, would be a mistake.

Perhaps there is other current behaviour that I am not aware of.
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>
5) Juerd Waalboer Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
| +1 vote
karl williamson skribis 2009-12-09 12:11 (-0700): These "posix" constructs have for a long time...
Perl 5 Porters
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
karl williamson skribis 2009-12-09 12:11 (-0700):
> Since Yves is incommunicado, I took what he had done before Larry's veto
> and extended and modified it, adding an intermediate way. What that
> means is that anything that looks like[[:xxx:]] will match only in the
> ASCII range, or in the current locale, if set. I never heard any
> controversy about that part of the proposal, and it makes sense to me
> that a Posix construct should act like the Posix definition says to.

These "posix" constructs have for a long time been documented as
*equivalent* to \d, \s and \w, with two remarks: [[:space:]] also
includes \cK and [[:word:]] doesn't even exist in POSIX.

Changing them is as bad as changing the metacharacters. Changing them to
break the equivalency might even be worse.

Also, note that perlre calls this "POSIX character class **syntax**"
(emphasis mine).

An even stronger argument is that perlre defines equivalence with
\p{...}, and explicitly mentions that these are Unicode constructs.
--
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sales@convolution.nl>

spacer
Profile | Posts (118)
Home > People > Juerd Waalboer