Grokbase
x

Re: PATCH [perl #56444] delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage

View TopicPrint | Flat  Thread  Threaded
1) Dave Mitchell Thanks for this! I've had a look through and it seems generally ok. (Actually my eyes glazed over...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On Sat, Feb 06, 2010 at 04:30:50PM -0700, karl williamson wrote:
> This patch should be carefully reviewed because it is to critical
> components of Perl, and I certainly am not an expert in the tokenizer,
> and also may not understand some underlying philosophies.

Thanks for this!

I've had a look through and it seems generally ok. (Actually my eyes
glazed over after a while, especially in the toke.c bits, so if someone
else wants to inspect this too, feel free!)

A few comments:

> but I decided to not expose it externally for
> now in case we find reason to change it.

At a quick glance, I think you mean that its exposed to the regex
compiler, but not to double-quoted strings in the lexer; i.e.

this is illegal:

    $x = "\N{U+11.22.33.44}"

but this works:

    $re = '\N{U+11.22.33.44}'
    /$re/;

So it's partially exposed (but undocumented) in 5.12?

Can it cope with a charhandler that returns a zero-length string?

regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
flag to embed.fnc to make it possible to call it as 'regcurly' within
core. Otherwise you may get a name clash when perl is embedded/extended.

You've added a lot of useful explanatory text in perldelta and perldiag;
does some of this need to make its way into *re*.pod too?

> +=item Using just the first bytes returned by \N{}

Perhaps you should avoid using the word 'bytes' here as it has unfortunate
historical "bytes" verses "chars/unicode" connotations?

> For example I made some errors there fatal, whereas one could instead
> change an invalid input character into a UNICODE_REPLACEMENT and press
> ahead. I chose this because it was easier, and it's extremely unlikely
> that the change would be something that the user wanted. If we know at
> compile time that something won't work, why even try to execute? But
> perhaps that is contrary to standard practice.

I think being fatal in the lexer is good.

> I included enough tests to exercise every leg I could figure out a way
> to in the affected code. I didn't know where to put some of the tests
> for \N in a double quoted string

Looks like they should go in lib/charnames.t


--
The crew of the Enterprise encounter an alien life form which is
surprisingly neither humanoid nor made from pure energy.
-- Things That Never Happen in "Star Trek" #22
2) Joshua ben Jore rote: nts not nge =A0I time s Some errors are discoverable at compilation time but should only be...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On Sat, Feb 6, 2010 at 3:30 PM, karl williamson <public@khwilliamson.com> w=
rote:
> A 'make regen' needs to be run on the changed embed.fnc.
>
> This patch should be carefully reviewed because it is to critical compone=
nts
> of Perl, and I certainly am not an expert in the tokenizer, and also may =
not
> understand some underlying philosophies.
>
> For example I made some errors there fatal, whereas one could instead cha=
nge
> an invalid input character into a UNICODE_REPLACEMENT and press ahead. =
=A0I
> chose this because it was easier, and it's extremely unlikely that the
> change would be something that the user wanted. =A0If we know at compile =
time
> that something won't work, why even try to execute? =A0But perhaps that i=
s
> contrary to standard practice.

Some errors are discoverable at compilation time but should only be
raised until runtime. Sometime in my recent memory Nick Clark fixed
constant folding so that exceptions detected at compilation time are
deferred to runtime:

1 / 0 # Compile-time exception
eval { 1 / 0 } # Run-time exception

Since the code being remarked on isn't in an eval{} I assume it's
proper to detect the error at compilation and throw it. It might be
useful to test that compile-time exceptions aren't being thrown when
the code is reasonably expected to fail at runtime.

Regenerated the patch after Ricardo's commits to the pod/* for
arrayref stuff caused a conflict. Also changed the function name to
Perl_regcurly and regen'd. Wouldn't build without those steps. Going
to poke at this a little, see if the behavior makes sense. You can
pull it from http://github.com/jbenjore/perl/tree/bug_56444. I won't
promise not to rebase that branch to follow blead.

Josh
3) karl williamson What I was doing was showing my ignorance or naivety about how easily one can figure out how to use...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Dave Mitchell wrote:
> On Sat, Feb 06, 2010 at 04:30:50PM -0700, karl williamson wrote:
>> This patch should be carefully reviewed because it is to critical
>> components of Perl, and I certainly am not an expert in the tokenizer,
>> and also may not understand some underlying philosophies.
>
> Thanks for this!
>
> I've had a look through and it seems generally ok. (Actually my eyes
> glazed over after a while, especially in the toke.c bits, so if someone
> else wants to inspect this too, feel free!)
>
> A few comments:
>
>> but I decided to not expose it externally for
>> now in case we find reason to change it.
>
> At a quick glance, I think you mean that its exposed to the regex
> compiler, but not to double-quoted strings in the lexer; i.e.
>
> this is illegal:
>
>     $x = "\N{U+11.22.33.44}"
>
> but this works:
>
>     $re = '\N{U+11.22.33.44}'
>     /$re/;
>
> So it's partially exposed (but undocumented) in 5.12?

What I was doing was showing my ignorance or naivety about how easily
one can figure out how to use quoting methods to bypass most any
checking in Perl.  The possibility of your example just didn't occur to 
me.  So is it better to document it and say don't use it, or don't 
document it and have that mean it is subject to change?
>
> Can it cope with a charhandler that returns a zero-length string?

Yes, and there were pre-existing tests for that.  I should see what 
happens if the handler returns undef.  I would think that would be 
handled at a lower level.
>
> regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
> flag to embed.fnc to make it possible to call it as 'regcurly' within
> core. Otherwise you may get a name clash when perl is embedded/extended.

Ok.  I see Josh has done that.  And, for efficiency would it be better 
to say:
  if (!e || (PL_lex_inpat && isDigit(*s) && regcurly(s)))

to avoid the function call overhead since that is the first thing
regcurly does, and the chances of changing quantifier syntax are small?
  That is what the original patch adding the new meaning of \N did, 
without using regcurly at all.  I decided to use regcurly to minimize 
the potential conflicts with existing charnames handlers that accept
names beginning with digits.

I do think we should put some restrictions on those names, and deprecate
those that don't meet them to avoid any future conflicts.  My guess is 
that Jesse won't want this in 5.12.  Right?

>
> You've added a lot of useful explanatory text in perldelta and perldiag;
> does some of this need to make its way into *re*.pod too?

I will do that.  Actually, I think I just removed stuff from delta.  I 
don't know if this would be one of the Selected Bug Fixes to highlight
there or not.  I also don't know if we need to note there that \N{U+...} 
now implies utf8.  The old implementation was inconsistent, with 
different rules of utf8ness based on the charnames handler and whether
it was in a [...] class or not.  None of that was documented.
>
>> +=item Using just the first bytes returned by \N{}
>
> Perhaps you should avoid using the word 'bytes' here as it has unfortunate
> historical "bytes" verses "chars/unicode" connotations?

I agree.  Originally I did mean bytes, then realized I shouldn't break 
apart whole characters, and didn't change the message.
>
>> For example I made some errors there fatal, whereas one could instead
>> change an invalid input character into a UNICODE_REPLACEMENT and press
>> ahead. I chose this because it was easier, and it's extremely unlikely
>> that the change would be something that the user wanted. If we know at
>> compile time that something won't work, why even try to execute? But
>> perhaps that is contrary to standard practice.
>
> I think being fatal in the lexer is good.
>
>> I included enough tests to exercise every leg I could figure out a way
>> to in the affected code. I didn't know where to put some of the tests
>> for \N in a double quoted string
>
> Looks like they should go in lib/charnames.t

In 5.12?
>
>

I'll wait to add the follow up patches, until we decide what all should
happen.
4) karl williamson My guess is that the reason it wouldn't compile wasn't the renaming, but because it needed to be...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Joshua ben Jore wrote:
>[snip]
>
> Regenerated the patch after Ricardo's commits to the pod/* for
> arrayref stuff caused a conflict. Also changed the function name to
> Perl_regcurly and regen'd. Wouldn't build without those steps. Going
> to poke at this a little, see if the behavior makes sense. You can
> pull it from http://github.com/jbenjore/perl/tree/bug_56444. I won't
> promise not to rebase that branch to follow blead.
>
> Josh
>

My guess is that the reason it wouldn't compile wasn't the renaming, but
because it needed to be regen'd to compile.  I was following the new 
rule in perlrepository to not submit the regen'd files.  Perhaps the 
rule should be more explicit to say that the commit and email messages
should say not just that regening should be done but that it won't
compile until you regen.
5) karl williamson Is the need for this documented somewhere?
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Dave Mitchell wrote:
> [snip]
>
> regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
> flag to embed.fnc to make it possible to call it as 'regcurly' within
> core. Otherwise you may get a name clash when perl is embedded/extended.
>

Is the need for this documented somewhere?
6) Dave Mitchell This was related to optimisation, with the idea that optimisations (such as constant folding)...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On Sun, Feb 07, 2010 at 08:15:54PM -0800, Joshua ben Jore wrote:
> On Sat, Feb 6, 2010 at 3:30 PM, karl williamson <public@khwilliamson.com> wrote:
> > A 'make regen' needs to be run on the changed embed.fnc.
> >
> > This patch should be carefully reviewed because it is to critical components
> > of Perl, and I certainly am not an expert in the tokenizer, and also may not
> > understand some underlying philosophies.
> >
> > For example I made some errors there fatal, whereas one could instead change
> > an invalid input character into a UNICODE_REPLACEMENT and press ahead.  I
> > chose this because it was easier, and it's extremely unlikely that the
> > change would be something that the user wanted.  If we know at compile time
> > that something won't work, why even try to execute?  But perhaps that is
> > contrary to standard practice.
>
> Some errors are discoverable at compilation time but should only be
> raised until runtime. Sometime in my recent memory Nick Clark fixed
> constant folding so that exceptions detected at compilation time are
> deferred to runtime:
>
> 1 / 0 # Compile-time exception
> eval { 1 / 0 } # Run-time exception

This was related to optimisation, with the idea that optimisations (such
as constant folding) shouldn't alter the expected pattern of execution:

$ perl580 -e '1/0; BEGIN { warn "not yet dead\n" }'
Illegal division by zero at -e line 1.
$ perl5100 -e '1/0; BEGIN { warn "not yet dead\n" }'
not yet dead
Illegal division by zero at -e line 1.

Since this isn't an optimisation, there's no reason for people *not*
to expect it to happen at compile time.

--
Lear: Dost thou call me fool, boy?
Fool: All thy other titles thou hast given away; that thou wast born with.
7) Dave Mitchell Barely, in perlguts under Internal Functions.
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On Mon, Feb 08, 2010 at 07:19:40AM -0700, karl williamson wrote:
> Dave Mitchell wrote:
>> [snip]
>>
>> regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
>> flag to embed.fnc to make it possible to call it as 'regcurly' within
>> core. Otherwise you may get a name clash when perl is embedded/extended.
>>
>
> Is the need for this documented somewhere?

Barely, in perlguts under Internal Functions.

--
Wesley Crusher gets beaten up by his classmates for being a smarmy git,
and consequently has a go at making some friends of his own age for a
change.
-- Things That Never Happen in "Star Trek" #18
8) Dave Mitchell I think we should briefly document it as existing but subject to change, and that it's for internal...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On Sun, Feb 07, 2010 at 09:38:54PM -0700, karl williamson wrote:
> Dave Mitchell wrote:
>> On Sat, Feb 06, 2010 at 04:30:50PM -0700, karl williamson wrote:
>>> This patch should be carefully reviewed because it is to critical
>>> components of Perl, and I certainly am not an expert in the
>>> tokenizer, and also may not understand some underlying philosophies.
>>
>> Thanks for this!
>>
>> I've had a look through and it seems generally ok. (Actually my eyes
>> glazed over after a while, especially in the toke.c bits, so if someone
>> else wants to inspect this too, feel free!)
>>
>> A few comments:
>>
>>> but I decided to not expose it externally for
>>> now in case we find reason to change it.
>>
>> At a quick glance, I think you mean that its exposed to the regex
>> compiler, but not to double-quoted strings in the lexer; i.e.
>>
>> this is illegal:
>>
>>     $x = "\N{U+11.22.33.44}"
>>
>> but this works:
>>
>>     $re = '\N{U+11.22.33.44}'
>>     /$re/;
>>
>> So it's partially exposed (but undocumented) in 5.12?
>
> What I was doing was showing my ignorance or naivety about how easily
> one can figure out how to use quoting methods to bypass most any
> checking in Perl. The possibility of your example just didn't occur to
> me. So is it better to document it and say don't use it, or don't
> document it and have that mean it is subject to change?

I think we should briefly document it as existing but subject to change,
and that it's for internal use only.

>> regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
>> flag to embed.fnc to make it possible to call it as 'regcurly' within
>> core. Otherwise you may get a name clash when perl is embedded/extended.
>
> Ok. I see Josh has done that. And, for efficiency would it be better
> to say:
> if (!e || (PL_lex_inpat && isDigit(*s) && regcurly(s)))
>
> to avoid the function call overhead since that is the first thing
> regcurly does, and the chances of changing quantifier syntax are small?
> That is what the original patch adding the new meaning of \N did,
> without using regcurly at all. I decided to use regcurly to minimize
> the potential conflicts with existing charnames handlers that accept
> names beginning with digits.

Well, having had another look at your patch, I don't think its doing what
you expect: at the point you call regcurly(), s already points to the char
*after* the '{', while regcurly expects s to point to the '{'. So I think
its always failing the test.

As for optimising, I don't think you need to worry about that. regcurly
will only get called once extra per literal "\N{" while compiling, and
that prefix is highly likely to trigger expensive charnames stuff anyway.

> I do think we should put some restrictions on those names, and deprecate
> those that don't meet them to avoid any future conflicts. My guess is
> that Jesse won't want this in 5.12.  Right?

What names are you referring to?

>> You've added a lot of useful explanatory text in perldelta and perldiag;
>> does some of this need to make its way into *re*.pod too?
>
> I will do that. Actually, I think I just removed stuff from delta. I
> don't know if this would be one of the Selected Bug Fixes to highlight
> there or not. I also don't know if we need to note there that \N{U+...}
> now implies utf8. The old implementation was inconsistent, with
> different rules of utf8ness based on the charnames handler and whether
> it was in a [...] class or not. None of that was documented.

Dunno. Perhaps someone lese has an opinion?

>>> For example I made some errors there fatal, whereas one could instead
>>> change an invalid input character into a UNICODE_REPLACEMENT and
>>> press ahead. I chose this because it was easier, and it's extremely
>>> unlikely that the change would be something that the user wanted.
>>> If we know at compile time that something won't work, why even try
>>> to execute? But perhaps that is contrary to standard practice.
>>
>> I think being fatal in the lexer is good.
>>
>>> I included enough tests to exercise every leg I could figure out a
>>> way to in the affected code. I didn't know where to put some of the
>>> tests  for \N in a double quoted string
>>
>> Looks like they should go in lib/charnames.t
>
> In 5.12?

charnames isn't a CPAN module, so I can't see any reason why not.


--
Lady Nancy Astor: If you were my husband, I would flavour your coffee
with poison.
Churchill: Madam - if I were your husband, I would drink it.
9) karl williamson Is it the case that one of certain mutually exclusive flags need to be used, like p vs m vs. s? If...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Dave Mitchell wrote:
> On Mon, Feb 08, 2010 at 07:19:40AM -0700, karl williamson wrote:
>> Dave Mitchell wrote:
>>> [snip]
>>>
>>> regcurly should be defined as Perl_regcurly in regcomp.c; then add the p
>>> flag to embed.fnc to make it possible to call it as 'regcurly' within
>>> core. Otherwise you may get a name clash when perl is embedded/extended.
>>>
>> Is the need for this documented somewhere?
>
> Barely, in perlguts under Internal Functions.
>

Is it the case that one of certain mutually exclusive flags need to be
used, like p vs m vs. s?  If so, it seems to me that that should be 
enforced in embed.pl.  I did something a while back, and Rafael informed 
me that it I also had to have the A flag; I don't remember the details.
  It would have been nice for embed to inform me that when I changed the 
flag from static, I needed to move the entry to elsewhere in the file; I
had to debug that.
10) A. Pagaltzis Might the passes be because when regcurly fails, the bit it should examine gets passed through, and...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
* karl williamson <public@khwilliamson.com> [2010-02-09 02:45]:
> >Well, having had another look at your patch, I don't think its
> >doing what you expect: at the point you call regcurly(),
> >s already points to the char *after* the '{', while regcurly
> >expects s to point to the '{'. So I think its always failing
> >the test.
>
> Thanks. That's what prompted the revised patch now. But it's
> worrisome that it passed these pre-existing tests from
> t/re_tests, which I thought covered the situation. I'll have
> to look later why
> they passed, but don't have time now.
> \N{1} abbbbc y $& a
> \N{1} abbbbc y $-[0] 0
> \N{1} abbbbc y $+[0] 1
> \N{3,4} abbbbc y $& abbb
> \N{3,4} abbbbc y $-[0] 0
> \N{3,4} abbbbc y $+[0] 4

Might the passes be because when regcurly fails, the bit it
should examine gets passed through, and that this happens to be
the correct result in these test cases anyway? (If these tests
are all there are, this would be a classic case of testing only
for successes but not for desired failures.)

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
11) karl williamson I'm afraid there is a more fundamental problem here which makes me wonder about lots of other...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Aristotle Pagaltzis wrote:
> * karl williamson <public@khwilliamson.com> [2010-02-09 02:45]:
>>> Well, having had another look at your patch, I don't think its
>>> doing what you expect: at the point you call regcurly(),
>>> s already points to the char *after* the '{', while regcurly
>>> expects s to point to the '{'. So I think its always failing
>>> the test.
>> Thanks. That's what prompted the revised patch now. But it's
>> worrisome that it passed these pre-existing tests from
>> t/re_tests, which I thought covered the situation. I'll have
>> to look later why
>> they passed, but don't have time now.
>> \N{1} abbbbc y $& a
>> \N{1} abbbbc y $-[0] 0
>> \N{1} abbbbc y $+[0] 1
>> \N{3,4} abbbbc y $& abbb
>> \N{3,4} abbbbc y $-[0] 0
>> \N{3,4} abbbbc y $+[0] 4
>
> Might the passes be because when regcurly fails, the bit it
> should examine gets passed through, and that this happens to be
> the correct result in these test cases anyway? (If these tests
> are all there are, this would be a classic case of testing only
> for successes but not for desired failures.)
>
> Regards,

I'm afraid there is a more fundamental problem here which makes me
wonder about lots of other apparently passing tests.

Here's what is happening as far as I've gotten:

re_tests is a data file used by a number of regex tests.  I've looked 
at one of them, regexp.t.  I can't find any use of charnames in it. 
When you run the first test by hand (on the code I originally submitted)
without a charnames, you get:

./perl -I../lib -E "'abbbbc' =~ m/\N{1}/"
Constant(\N{1}): $^H{charnames} is not defined at -e line 1, within pattern
Execution of -e aborted due to compilation errors.

The routine in toke.c that looks up character names never returns a
failure.  Instead, it returns the input, unchanged, after calling 
yyerror().

That means that the code that processes named characters in toke.c
thinks that the constant evaluated properly to 1, and turns that into
\N{U+31} and passes that on to regcomp.c which compiles it into an EXACT
node matching 1.

Then the fact that there was an error causes execution to be suppressed.

However, regexp.t wraps the above in a string eval, with some other
things.  That there is an error is somehow getting lost because of that, 
and execution goes ahead, and $& is being set to 'a'.  Both of these 
facts are very strange.  $@ is empty.
12) karl williamson Well it's not so bad as I thought. I figured out what is happening. The .t files turn these into...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
karl williamson wrote:
> Aristotle Pagaltzis wrote:
>> * karl williamson <public@khwilliamson.com> [2010-02-09 02:45]:
>>>> Well, having had another look at your patch, I don't think its
>>>> doing what you expect: at the point you call regcurly(),
>>>> s already points to the char *after* the '{', while regcurly
>>>> expects s to point to the '{'. So I think its always failing
>>>> the test.
>>> Thanks. That's what prompted the revised patch now. But it's
>>> worrisome that it passed these pre-existing tests from
>>> t/re_tests, which I thought covered the situation. I'll have
>>> to look later why
>>> they passed, but don't have time now.
>>> \N{1}    abbbbc    y    $&    a
>>> \N{1}    abbbbc    y    $-[0]    0
>>> \N{1}    abbbbc    y    $+[0]    1
>>> \N{3,4}    abbbbc    y    $&    abbb
>>> \N{3,4}    abbbbc    y    $-[0]    0
>>> \N{3,4}    abbbbc    y    $+[0]    4
>>
>> Might the passes be because when regcurly fails, the bit it
>> should examine gets passed through, and that this happens to be
>> the correct result in these test cases anyway? (If these tests
>> are all there are, this would be a classic case of testing only
>> for successes but not for desired failures.)
>>
>> Regards,
>
> I'm afraid there is a more fundamental problem here which makes me
> wonder about lots of other apparently passing tests.
> [snip]

Well it's not so bad as I thought.  I figured out what is happening. 
The .t files turn these into eg., the first one: "abbbbc" =~ m'\N{1}'.
The single quotes cause the pattern to be passed into the regex compiler
unchanged, unparsed for a named character.  The regex compiler looks at 
it and thinks, Oh, match a non-newline one time.  And so it does.

I'm not sure what to do about this, if anything.  If the input had 
instead been "abbbbc" =~ m'\N{SPACE}', it would have compiled as match a
  non-newline one time followed by the 7 characters { S P A C E }.  If 
you want a charname evaluated, the tokenizer can't be bypassed.  Perhaps 
other people can think of more implications.

I can add tests to another file that doesn't bypass the tokenizer to
verify the regcurly works in the tokenizer.  Anything else?
13) karl williamson I now realize that this patch, or any other, that moves things from the regex compiler to the lexer...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
karl williamson wrote:
> karl williamson wrote:
>> Aristotle Pagaltzis wrote:
>>> * karl williamson <public@khwilliamson.com> [2010-02-09 02:45]:
>>>>> Well, having had another look at your patch, I don't think its
>>>>> doing what you expect: at the point you call regcurly(),
>>>>> s already points to the char *after* the '{', while regcurly
>>>>> expects s to point to the '{'. So I think its always failing
>>>>> the test.
>>>> Thanks. That's what prompted the revised patch now. But it's
>>>> worrisome that it passed these pre-existing tests from
>>>> t/re_tests, which I thought covered the situation. I'll have
>>>> to look later why
>>>> they passed, but don't have time now.
>>>> \N{1}    abbbbc    y    $&    a
>>>> \N{1}    abbbbc    y    $-[0]    0
>>>> \N{1}    abbbbc    y    $+[0]    1
>>>> \N{3,4}    abbbbc    y    $&    abbb
>>>> \N{3,4}    abbbbc    y    $-[0]    0
>>>> \N{3,4}    abbbbc    y    $+[0]    4
>>>
>>> Might the passes be because when regcurly fails, the bit it
>>> should examine gets passed through, and that this happens to be
>>> the correct result in these test cases anyway? (If these tests
>>> are all there are, this would be a classic case of testing only
>>> for successes but not for desired failures.)
>>>
>>> Regards,
>>
>> I'm afraid there is a more fundamental problem here which makes me
>> wonder about lots of other apparently passing tests.
>> [snip]
>
> Well it's not so bad as I thought. I figured out what is happening. The
> .t files turn these into eg., the first one: "abbbbc" =~ m'\N{1}'. The
> single quotes cause the pattern to be passed into the regex compiler
> unchanged, unparsed for a named character. The regex compiler looks at
> it and thinks, Oh, match a non-newline one time. And so it does.
>
> I'm not sure what to do about this, if anything. If the input had
> instead been "abbbbc" =~ m'\N{SPACE}', it would have compiled as match a
> non-newline one time followed by the 7 characters { S P A C E }. If
> you want a charname evaluated, the tokenizer can't be bypassed. Perhaps
> other people can think of more implications.
>
> I can add tests to another file that doesn't bypass the tokenizer to
> verify the regcurly works in the tokenizer. Anything else?
>

I now realize that this patch, or any other, that moves things from the
regex compiler to the lexer will break code that uses single quoting of
the regex.
14) demerphq Hmm... In this particular case that is not what I would expect. I think the definition of "break"...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On 12 February 2010 03:24, karl williamson <public@khwilliamson.com> wrote:
> I now realize that this patch, or any other, that moves things from the
> regex compiler to the lexer will break code that uses single quoting of the
> regex.

Hmm... In this particular case that is not what I would expect.

I think the definition of "break" here is a bit fuzzy.  What happens
with such constructs in code that predates the moving of charnames
style escapes to the regex engine? If what happens now is what
happened before then i would say that "break" is a little strong.

Unfortunately my access to internet right now is restricted  but i
will have proper access soon, so I will comment further when i can
actually try your patch.

cheers
Yves



--
perl -Mre=debug -e "/just|another|perl|hacker/"
15) karl williamson I think it's true that this reverts to long pre-existing behavior, so that the only code that would...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
demerphq wrote:
> On 12 February 2010 03:24, karl williamson <public@khwilliamson.com> wrote:
>> I now realize that this patch, or any other, that moves things from the
>> regex compiler to the lexer will break code that uses single quoting of the
>> regex.
>
> Hmm... In this particular case that is not what I would expect.
>
> I think the definition of "break" here is a bit fuzzy. What happens
> with such constructs in code that predates the moving of charnames
> style escapes to the regex engine? If what happens now is what
> happened before then i would say that "break" is a little strong.
>
> Unfortunately my access to internet right now is restricted but i
> will have proper access soon, so I will comment further when i can
> actually try your patch.
>
> cheers
> Yves
>
>
>

I think it's true that this reverts to long pre-existing behavior, so
that the only code that would break has been written since we changed
it.  But such code will break; but I see no other alternative.

There is another problem with the patch.  regcurly needs an E flag in 
embed.fnc in order for 'use re' to be able to see it.
16) demerphq IMO such breakage is acceptable. The charnames feature (general statement) was not well thought...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
On 12 February 2010 17:53, karl williamson <public@khwilliamson.com> wrote:
> demerphq wrote:
>>
>> On 12 February 2010 03:24, karl williamson <public@khwilliamson.com>
>> wrote:
>>>
>>> I now realize that this patch, or any other, that moves things from the
>>> regex compiler to the lexer will break code that uses single quoting of
>>> the
>>> regex.
>>
>> Hmm... In this particular case that is not what I would expect.
>>
>> I think the definition of "break" here is a bit fuzzy.  What happens
>> with such constructs in code that predates the moving of charnames
>> style escapes to the regex engine? If what happens now is what
>> happened before then i would say that "break" is a little strong.
>>
>> Unfortunately my access to internet right now is restricted  but i
>> will have proper access soon, so I will comment further when i can
>> actually try your patch.
>>
>> cheers
>> Yves
>>
>>
>>
> I think it's true that this reverts to long pre-existing behavior, so that
> the only code that would break has been written since we changed it.  But
> such code will break; but I see no other alternative.

IMO such breakage is acceptable. The charnames feature (general
statement) was not well thought through and now we are paying the
price. Really charnames is the problem here, and we should acknowledge
that IT is broken, and document how and live with it or fix it.

<rant>
Historically there has been a trend to add cool new freatures to Perl
without *really* thinking them through and we see it in particular in
the regex engine a lot. Examples are code blocks, qr// constructs, etc
having very deep and serious consequences that only come clear when
people look into fixing some bug that they have exposed.
</rant>

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"
17) karl williamson I don't know how to fix it other than this. Perhaps we should propose that <rant> become an...
| +1 vote (Anchor)
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
demerphq wrote:
> On 12 February 2010 17:53, karl williamson <public@khwilliamson.com> wrote:
>> demerphq wrote:
>>> On 12 February 2010 03:24, karl williamson <public@khwilliamson.com>
>>> wrote:
>>>> I now realize that this patch, or any other, that moves things from the
>>>> regex compiler to the lexer will break code that uses single quoting of
>>>> the
>>>> regex.
>>> Hmm... In this particular case that is not what I would expect.
>>>
>>> I think the definition of "break" here is a bit fuzzy. What happens
>>> with such constructs in code that predates the moving of charnames
>>> style escapes to the regex engine? If what happens now is what
>>> happened before then i would say that "break" is a little strong.
>>>
>>> Unfortunately my access to internet right now is restricted but i
>>> will have proper access soon, so I will comment further when i can
>>> actually try your patch.
>>>
>>> cheers
>>> Yves
>>>
>>>
>>>
>> I think it's true that this reverts to long pre-existing behavior, so that
>> the only code that would break has been written since we changed it. But
>> such code will break; but I see no other alternative.
>
> IMO such breakage is acceptable. The charnames feature (general
> statement) was not well thought through and now we are paying the
> price. Really charnames is the problem here, and we should acknowledge
> that IT is broken, and document how and live with it or fix it.

I don't know how to fix it other than this.
>
> <rant>
> Historically there has been a trend to add cool new freatures to Perl
> without *really* thinking them through and we see it in particular in
> the regex engine a lot. Examples are code blocks, qr// constructs, etc
> having very deep and serious consequences that only come clear when
> people look into fixing some bug that they have exposed.
> </rant>

Perhaps we should propose that <rant> become an official HTML 7 tag.  I 
kind of like it :)

>
> Yves
>
>
spacer
View TopicPrint | Flat  Thread  Threaded
Home > Groups > Perl 5 Porters > Re: PATCH [perl #56444] delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage (17 posts)