FAQ
Just out of curiosity, which perl (if any) is doing the Right Thing
as regards the following code, which matches a char that case folds to two
chars:

# lc("\x{149}") => "\x{2bc}N"

print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;

print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;


5.8.0,
5.13.0,
blead:

ok PLAIN 3
ok ALT 1
ok ALT 3

5.10.0,
5.10.1,
5.12.0:

ok PLAIN 1
ok PLAIN 3
ok ALT 1
ok ALT 3

(This is in the context me me trying to understand and fix the trie code
for [perl #74484] Regex causing exponential runtime+mem usage.)



--
"You're so sadly neglected, and often ignored.
A poor second to Belgium, When going abroad."
-- Monty Python, "Finland"

Search Discussions

  • H.Merijn Brand at Apr 27, 2010 at 4:39 pm

    On Tue, 27 Apr 2010 17:17:18 +0100, Dave Mitchell wrote:

    Just out of curiosity, which perl (if any) is doing the Right Thing
    as regards the following code, which matches a char that case folds to two
    chars:

    # lc("\x{149}") => "\x{2bc}N"

    print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
    print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
    print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;

    print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
    print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
    print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;


    5.8.0,
    5.13.0,
    blead:

    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    5.10.0,
    5.10.1,
    5.12.0:

    ok PLAIN 1
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    (This is in the context me me trying to understand and fix the trie code
    for [perl #74484] Regex causing exponential runtime+mem usage.)
    Just for the sake of completeness:

    === base/perl5.8.1
    === base/perl5.8.2
    === base/perl5.8.3
    === base/perl5.8.4
    === base/perl5.8.5
    === base/perl5.8.6
    === base/perl5.8.7
    === base/perl5.8.8
    === base/perl5.8.9
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    === base/perl5.10.0
    === base/perl5.10.1
    === base/perl5.11.0
    === base/perl5.11.1
    === base/perl5.11.2
    === base/perl5.11.3
    === base/perl5.11.4
    === base/perl5.11.5
    === base/perl5.12.0
    ok PLAIN 1
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    --
    H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
    using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20, 11.00,
    11.11, 11.23, and 11.31, OpenSuSE 10.3, 11.0, and 11.1, AIX 5.2 and 5.3.
    http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
    http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
  • Karl williamson at Apr 27, 2010 at 5:43 pm

    Dave Mitchell wrote:
    Just out of curiosity, which perl (if any) is doing the Right Thing
    as regards the following code, which matches a char that case folds to two
    chars:

    # lc("\x{149}") => "\x{2bc}N"

    print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
    print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
    print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;

    print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
    print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
    print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;


    5.8.0,
    5.13.0,
    blead:

    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    5.10.0,
    5.10.1,
    5.12.0:

    ok PLAIN 1
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    (This is in the context me me trying to understand and fix the trie code
    for [perl #74484] Regex causing exponential runtime+mem usage.)

    It appears to me that only case 3 should be matched in both instances.
    The Unicode rule is simply that two strings match case insensitively iff
    fold($s1) eq fold($s2).

    My guess is that the improvement came from my very recent patch:
    commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
    Author: Karl Williamson (none)>
    Date: Tue Apr 13 21:25:36 2010 -0600

    * PATCH: [perl #72998] regex looping

    which causes a partially matched character (as is U+0149 in this
    instance) to not be a match. I don't know why it didn't fix the ALT 1
    case, except read the next paragraph:

    Let me say that the case insensitive matching in Perl of multi-char
    folded characters is badly broken, and I'm sure always has been. It's
    broken in many places. One big problem is that the optimizer doesn't
    understand this possibility. Yves came up with the FOLDCHAR regnode
    type to bypass the optimizer, but there are many more instances than it
    addresses that cause problems. And it doesn't address the case where
    the folded character is in the string to be matched, as opposed to be in
    the pattern.

    I started working on fixing the optimizer, but it was slow going. And I
    stopped working on that when Yves sent a message that he was working on
    a trie implementation of case insensitive matching. I had come to the
    conclusion that band-aids were not going to fix this properly.

    But, as I posted on this list not too long ago, there are semantic
    issues with the concept, as in this example that Nicholas called 'evil':
    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i; >
    what would $1 and $2 be, and
    @LAST_MATCH_START, @LAST_MATCH_END?
  • Dave Mitchell at May 3, 2010 at 3:58 pm

    On Tue, Apr 27, 2010 at 11:42:53AM -0600, karl williamson wrote:
    Dave Mitchell wrote:
    Just out of curiosity, which perl (if any) is doing the Right Thing
    as regards the following code, which matches a char that case folds to two
    chars:
    [snip]
    It appears to me that only case 3 should be matched in both instances.
    The Unicode rule is simply that two strings match case insensitively iff
    fold($s1) eq fold($s2).

    My guess is that the improvement came from my very recent patch:
    commit 7dcb3b25fc4113f0eeb68d0d3c47ccedd5ff3f2a
    Author: Karl Williamson (none)>
    Date: Tue Apr 13 21:25:36 2010 -0600

    * PATCH: [perl #72998] regex looping

    which causes a partially matched character (as is U+0149 in this
    instance) to not be a match. I don't know why it didn't fix the ALT 1
    case, except read the next paragraph:

    Let me say that the case insensitive matching in Perl of multi-char
    folded characters is badly broken, and I'm sure always has been. It's
    broken in many places. One big problem is that the optimizer doesn't
    understand this possibility. Yves came up with the FOLDCHAR regnode
    type to bypass the optimizer, but there are many more instances than it
    addresses that cause problems. And it doesn't address the case where
    the folded character is in the string to be matched, as opposed to be in
    the pattern.

    I started working on fixing the optimizer, but it was slow going. And I
    stopped working on that when Yves sent a message that he was working on
    a trie implementation of case insensitive matching. I had come to the
    conclusion that band-aids were not going to fix this properly.

    But, as I posted on this list not too long ago, there are semantic
    issues with the concept, as in this example that Nicholas called 'evil':
    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i;

    what would $1 and $2 be, and
    @LAST_MATCH_START, @LAST_MATCH_END?
    Ah, a can of worms!

    I've noticed some other similar things, e.g.:

    $ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(aS|xx)S/i'
    $ p -le 'print "matched" if "a\x{df}\x{100}" =~ /(a|xx)SS/i'
    matched
    $

    It looks like the temporary fold buffer used within the TRIE: maybe should
    be made available to the rest of the S_regmatch loop ???

    Anyway, I decided while working on my trie fix (just committed), that I
    would just leave folding as-is, and let someone else worry about it some
    time!

    --
    The Enterprise is captured by a vastly superior alien intelligence which
    does not put them on trial.
    -- Things That Never Happen in "Star Trek" #10
  • Karl williamson at Jun 3, 2010 at 7:51 pm

    Dave Mitchell wrote:
    Just out of curiosity, which perl (if any) is doing the Right Thing
    as regards the following code, which matches a char that case folds to two
    chars:

    # lc("\x{149}") => "\x{2bc}N"

    print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
    print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
    print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;

    print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
    print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
    print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;


    5.8.0,
    5.13.0,
    blead:

    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    5.10.0,
    5.10.1,
    5.12.0:

    ok PLAIN 1
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    (This is in the context me me trying to understand and fix the trie code
    for [perl #74484] Regex causing exponential runtime+mem usage.)

    I looked at this a little bit more, enough to realize that I don't want
    to learn this area of the code unless necessary at some later point.
    Anyway, earlier I wrote that cases 3 were the only ones where it should
    match. And In blead, the problem that remains is tries, so that ALT 1
    gets matched.

    The problem lies in REXEC_TRIE_READ_CHAR or its callers. They aren't
    calling ibcmp_utf8, and so don't get the benefit of its patch that fixed
    the ALT 1 case in 5.12. Now I don't know if they should be calling
    ibcmp_utf8, but the bottom line is that they should somehow guarantee
    that a partial character isn't matched.

    Maybe it's best to wait for Yves' work on fold matching, but then I seem
    to say a lot that it's the answer to all our problems. After he
    finishes it, he'll be ready to tackle world hunger, and other trifling
    issues :)
  • Karl williamson at Jun 3, 2010 at 9:25 pm

    karl williamson wrote:
    Dave Mitchell wrote:
    Just out of curiosity, which perl (if any) is doing the Right Thing
    as regards the following code, which matches a char that case folds to
    two
    chars:

    # lc("\x{149}") => "\x{2bc}N"

    print "ok PLAIN 1\n" if "\x{149}" =~ /\x{2bc}/i;
    print "ok PLAIN 2\n" if "\x{149}" =~ /N/i;
    print "ok PLAIN 3\n" if "\x{149}" =~ /\x{2bc}N/i;

    print "ok ALT 1\n" if "\x{149}" =~ /\x{2bc}|ZZZZ/i;
    print "ok ALT 2\n" if "\x{149}" =~ /N|ZZZZ/i;
    print "ok ALT 3\n" if "\x{149}" =~ /\x{2bc}N|ZZZZ/i;


    5.8.0,
    5.13.0,
    blead:

    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    5.10.0,
    5.10.1,
    5.12.0:

    ok PLAIN 1
    ok PLAIN 3
    ok ALT 1
    ok ALT 3

    (This is in the context me me trying to understand and fix the trie code
    for [perl #74484] Regex causing exponential runtime+mem usage.)

    I looked at this a little bit more, enough to realize that I don't want
    to learn this area of the code unless necessary at some later point.
    Anyway, earlier I wrote that cases 3 were the only ones where it should
    match. And In blead, the problem that remains is tries, so that ALT 1
    gets matched.

    The problem lies in REXEC_TRIE_READ_CHAR or its callers. They aren't
    calling ibcmp_utf8, and so don't get the benefit of its patch that fixed
    the ALT 1 case in 5.12. Now I don't know if they should be calling
    ibcmp_utf8, but the bottom line is that they should somehow guarantee
    that a partial character isn't matched.

    Maybe it's best to wait for Yves' work on fold matching, but then I seem
    to say a lot that it's the answer to all our problems. After he
    finishes it, he'll be ready to tackle world hunger, and other trifling
    issues :)
    I meant to say, rather, other comparatively easy issues.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedApr 27, '10 at 4:17p
activeJun 3, '10 at 9:25p
posts6
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase