FAQ
Hi all,

As all the regular readers of this list know perl uses two string
encodings internally. One is essentially latin1, and the other is utf8
encoded unicode.

Unicode has an interesting feature that some people might not be
familiar with, the way it stipulates that an implementer handle
case-insentive matches includes the requirement that in some sitations
a single char can match several chars. The typical example of this is
\xDF, aka LATIN SMALL LETTER SHARP S, which when case folded ends up
as 'ss'.

The two encodings have different semantics, under latin1 \xDF case
insensitively matches only \xDF. Under utf8/unicode it matches 'ss' as
i already said.

Overall this isnt a problem. If the pattern or string is utf8/unicode
then perl uses unicode semantics and things work out pretty much as
one might expect (given one knows about the two encodings and their
differing semantics).

Now it turns out that there is a bug in the regex engine optimiser
related to \xDF in that the behaviour of

"\xDF"=~/\xDF/i

and

"ss"=~/\xDF/i

is not very predicatable. Depending on whether the pattern or the
string is utf8 the pattern will match differently. One would assume
that unicode semantics would be obeyed when either the string or
pattern was unicode, and that latin1 semantics (for lack of a better
term) would be followed only when neither were unicode.

Thus it would seem reasonable to expect that "ss" matches \xDF case
insensitively only when one or the other or both were unicode, and
that \xDF would match \xDF insensitively always. Except it doesnt. The
problem turns out the be minlen checking, and would apparently affect
ALL case-insensitive unicode matches where the fold-case version of a
codepoint is a multi-codepoint sequence.

The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Which obviously cases problems matching a latin-1 \xDF
which is only one byte. Amusingly another bug in the regex engine
allows this to work out ok when the string is unicode. utf8 \xDF is
two bytes long, and the regex engine has some issues with the
distinction between "byte length" and "codepoint length", so it sees
the two bytes of the single codepoint as being sufficient length, and
then uses unicode folding to convert the strings \xDF to 'ss' and
everything works out. But this is fluke, im positive that there are
other fold case scenarios where we cant rely on this bug saving the
day. If the fold case version was longer (in bytes) than the utf8
version of the original it would not work out.

This probably doesnt show up on too many peoples radars as most times
you would be matching against a string that is quite a bit longer than
the pattern. But for cases like the above there is definitely a bug.

At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character
string.

Thats a pretty big hammer for such a case, but its about the best i
can think of.

Other ideas anyone?

cheers,
Yves
ps: Actually I have to say the minlen/startclass optimisations are
pretty crufty and are clearly not properly unicode aware. There is a
serious need to completely rewrite study_chunk(), probably as several
routines so that sanity can be restored. But thats a big project, one
that would probably be sufficiently large that it would need to be
funded by TPF, assuming somebody had time to do it at all.













--
perl -Mre=debug -e "/just|another|perl|hacker/"

Search Discussions

  • Juerd Waalboer at Apr 24, 2007 at 10:03 am

    demerphq skribis 2007-04-24 11:37 (+0200):
    One would assume that unicode semantics would be obeyed when either
    the string or pattern was unicode, and that latin1 semantics (for lack
    of a better term) would be followed only when neither were unicode.
    If I didn't know Perl, I would assume that it would always use Unicode
    semantics, or never, because I read somewhere that Perl only has one
    string type.
    The problem is that the optimiser thinks that /\xDF/i under unicode is
    really 'ss' and therefore that the minimum length string that can
    match is 2. Ouch.
    At this point the only solution I can think of is to disable minlen
    checks when a character is encountered that folds to a multi-character
    string.
    I think correctness is more important than performance, especially when
    it is needed for real world languages like German.
    --
    korajn salutojn,

    juerd waalboer: perl hacker <juerd@juerd.nl> <http://juerd.nl/sig>
    convolution: ict solutions and consultancy <sales@convolution.nl>
  • Demerphq at Apr 24, 2007 at 10:35 am

    On 4/24/07, Juerd Waalboer wrote:
    demerphq skribis 2007-04-24 11:37 (+0200):
    One would assume that unicode semantics would be obeyed when either
    the string or pattern was unicode, and that latin1 semantics (for lack
    of a better term) would be followed only when neither were unicode.
    If I didn't know Perl, I would assume that it would always use Unicode
    semantics, or never, because I read somewhere that Perl only has one
    string type.
    The problem is that the optimiser thinks that /\xDF/i under unicode is
    really 'ss' and therefore that the minimum length string that can
    match is 2. Ouch.
    At this point the only solution I can think of is to disable minlen
    checks when a character is encountered that folds to a multi-character
    string.
    I think correctness is more important than performance, especially when
    it is needed for real world languages like German.
    Turns out this nbug affects Greek and German, three codepoints in total:

    00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
    0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
    03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

    The fact that it doesnt affect any of the other 106 special case
    foldings in the unicode 5 spec is IMO a miracle perched on top of a
    bug perched on top of a melting ice-cream-cone.

    cheers,
    Yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Gerard Goossen at Apr 24, 2007 at 1:40 pm

    On Tue, Apr 24, 2007 at 11:37:52AM +0200, demerphq wrote:
    [...]

    The problem is that the optimiser thinks that /\xDF/i under unicode is
    really 'ss' and therefore that the minimum length string that can
    match is 2. Which obviously cases problems matching a latin-1 \xDF
    which is only one byte. Amusingly another bug in the regex engine
    allows this to work out ok when the string is unicode. utf8 \xDF is
    two bytes long, and the regex engine has some issues with the
    distinction between "byte length" and "codepoint length", so it sees
    the two bytes of the single codepoint as being sufficient length, and
    then uses unicode folding to convert the strings \xDF to 'ss' and
    everything works out. But this is fluke, im positive that there are
    other fold case scenarios where we cant rely on this bug saving the
    day. If the fold case version was longer (in bytes) than the utf8
    version of the original it would not work out.
    I encountered the same problem with kurila, a bit stronger because there
    are lengths are in bytes. Thus even single characters would
    fold to multiple values. And indeed that are many places where the
    minlen is used, and the minlen has to be correct, otherwise you get very
    unpredictable results.
    This probably doesnt show up on too many peoples radars as most times
    you would be matching against a string that is quite a bit longer than
    the pattern. But for cases like the above there is definitely a bug.

    At this point the only solution I can think of is to disable minlen
    checks when a character is encountered that folds to a multi-character
    string.
    You also have to include the characters which are in the same folding
    class/group (can't remember the terminology) as the multi-character.
    i.e. include 's' and 'S', because "\xDF"=~/ss/i should also match.
    In kurila I didn't bother about that and just always disabled minlen
    checking with case folding.

    Also iirc there are some fixed length assumptions about EXACTF (sorry
    can't remember where), so just decreasing minlen might not work and
    break more things than it would solve.
    Thats a pretty big hammer for such a case, but its about the best i
    can think of.

    Other ideas anyone?
    If we want to get performance out of the engine, I think the most interest
    approach would be to "unfold" the string at compile time, something
    like:
    /e/i -> /(e|E)/
    /foo/i -> /(f|F)(o|O)(o|O)/
    /ss/i -> /(((s|S)(s|S))|\xDF)/
    /\xDF/i -> /ss/i -> /(((s|S)(s|S))|\xDF)/

    Then let trie, study, etc do all the optimalization.
    cheers,
    Yves
    ps: Actually I have to say the minlen/startclass optimisations are
    pretty crufty and are clearly not properly unicode aware. There is a
    serious need to completely rewrite study_chunk(), probably as several
    routines so that sanity can be restored. But thats a big project, one
    that would probably be sufficiently large that it would need to be
    funded by TPF, assuming somebody had time to do it at all.
  • Demerphq at Apr 24, 2007 at 2:46 pm

    On 4/24/07, demerphq wrote:
    The problem is that the optimiser thinks that /\xDF/i under unicode is
    really 'ss' and therefore that the minimum length string that can
    match is 2. Which obviously cases problems matching a latin-1 \xDF
    which is only one byte. Amusingly another bug in the regex engine
    allows this to work out ok when the string is unicode. utf8 \xDF is
    two bytes long, and the regex engine has some issues with the
    distinction between "byte length" and "codepoint length", so it sees
    the two bytes of the single codepoint as being sufficient length, and
    then uses unicode folding to convert the strings \xDF to 'ss' and
    everything works out. But this is fluke, im positive that there are
    other fold case scenarios where we cant rely on this bug saving the
    day. If the fold case version was longer (in bytes) than the utf8
    version of the original it would not work out. [...]
    At this point the only solution I can think of is to disable minlen
    checks when a character is encountered that folds to a multi-character
    string.
    Well i have a better solution it looks like. Ive created a new regop
    FOLDCHAR that will be used to handle the three problematic codepoints
    properly. This way the regex engine doesnt see them as normal text and
    therefore the optimiser can do the right thing and everything works
    out properly.

    Sigh, so much trouble for one character. (The other two are just bonus material)

    Its actually possible to detect codepoints that will have this problem
    so its probably smart to put something in mktables that will detect
    and warn if any new one come up. Or we can just do it by hand when
    updating the unicode data files.

    Patch is attached.

    cheers,
    Yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Rafael Garcia-Suarez at Apr 26, 2007 at 10:24 am

    On 24/04/07, demerphq wrote:
    Well i have a better solution it looks like. Ive created a new regop
    FOLDCHAR that will be used to handle the three problematic codepoints
    properly. This way the regex engine doesnt see them as normal text and
    therefore the optimiser can do the right thing and everything works
    out properly.

    Sigh, so much trouble for one character. (The other two are just bonus material)
    Thanks, applied as #31081.
  • Jerry D. Hedden at Apr 26, 2007 at 1:17 pm
    Thanks, applied as #31081.
    I get failures in t/op/pat.t with this patch on Cygwin Perl:

    ok 1903 - "ss"!~/\xDF/i (str is latin, pat is latin) Line 4376
    ok 1904 - "ss"=~/\xDF/i (str is latin, pat is utf8) Line 4371
    not ok 1905 - "ss"=~/\xDF/i (str is utf8, pat is latin) Line 4371
    # Failed test at line 4371
    ok 1906 - "ss"=~/\xDF/i (str is utf8, pat is utf8) Line 4371
    ok 1907 - "sS"!~/\xDF/i (str is latin, pat is latin) Line 4376
    ok 1908 - "sS"=~/\xDF/i (str is latin, pat is utf8) Line 4371
    not ok 1909 - "sS"=~/\xDF/i (str is utf8, pat is latin) Line 4371
    # Failed test at line 4371
    ok 1910 - "sS"=~/\xDF/i (str is utf8, pat is utf8) Line 4371
    ok 1911 - "Ss"!~/\xDF/i (str is latin, pat is latin) Line 4376
    ok 1912 - "Ss"=~/\xDF/i (str is latin, pat is utf8) Line 4371
    not ok 1913 - "Ss"=~/\xDF/i (str is utf8, pat is latin) Line 4371
    # Failed test at line 4371
    ok 1914 - "Ss"=~/\xDF/i (str is utf8, pat is utf8) Line 4371
    ok 1915 - "SS"!~/\xDF/i (str is latin, pat is latin) Line 4376
    ok 1916 - "SS"=~/\xDF/i (str is latin, pat is utf8) Line 4371
    not ok 1917 - "SS"=~/\xDF/i (str is utf8, pat is latin) Line 4371
    # Failed test at line 4371
    ok 1918 - "SS"=~/\xDF/i (str is utf8, pat is utf8) Line 4371
    ok 1919 - "\xDF"=~/\xDF/i (str is latin, pat is latin) Line 4371
    not ok 1920 - "\xDF"=~/\xDF/i (str is latin, pat is utf8) Line 4371
    # Failed test at line 4371
    ok 1921 - "\xDF"=~/\xDF/i (str is utf8, pat is latin) Line 4371
    ok 1922 - "\xDF"=~/\xDF/i (str is utf8, pat is utf8) Line 4371


    Could this be the result of something peculiar in my configuration:

    Summary of my perl5 (revision 5 version 9 subversion 5 patch 31081)
    configuration:
    Platform:
    osname=cygwin, osvers=1.5.24(0.15642), archname=cygwin-thread-multi-64int
    uname='cygwin_nt-5.0 pn100-02-2-054p 1.5.24(0.15642) 2007-01-31
    10:57 i686 cygwin '
    config_args='-de -Dusedevel -Dversiononly=no -Dinstallusrbinperl
    -Duse64bitint -Dusethreads -Uusemymalloc -A define:optimize=-O3 -pipe
    -funit-at-a-time -mtune=pentium4m -march=pentium4 -mfpmath=sse
    -mieee-fp -mmmx -msse -msse2 -A define:ld=/usr/bin/ld2 -A
    append:ccflags= -DPERL_DONT_CREATE_GVSV -DNO_MATHOMS'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    Compiler:
    cc='gcc', ccflags ='-DPERL_USE_SAFE_PUTENV -U__STRICT_ANSI__
    -DPERL_DONT_CREATE_GVSV -DNO_MATHOMS -fno-strict-aliasing -pipe',
    optimize='-O3 -pipe -funit-at-a-time -mtune=pentium4m
    -march=pentium4 -mfpmath=sse -mieee-fp -mmmx -msse -msse2',
    cppflags='-DPERL_USE_SAFE_PUTENV -U__STRICT_ANSI__
    -DPERL_DONT_CREATE_GVSV -DNO_MATHOMS -fno-strict-aliasing -pipe'
    ccversion='', gccversion='3.4.4 (cygming special, gdc 0.12, using
    dmd 0.125)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
    Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
    Linker and Libraries:
    ld='/usr/bin/ld2', ldflags =' -Wl,--enable-auto-import -s -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /lib
    libs=-lgdbm -ldl -lcrypt -lgdbm_compat
    perllibs=-ldl -lcrypt -lgdbm_compat
    libc=/usr/lib/libc.a, so=dll, useshrplib=true, libperl=libperl.a
    gnulibc_version=''
    Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' -s'
    cccdlflags=' ', lddlflags=' -s -L/usr/local/lib'


    Characteristics of this binary (from libperl):
    Compile-time options: MULTIPLICITY NO_MATHOMS PERL_DONT_CREATE_GVSV
    PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
    PERL_USE_SAFE_PUTENV USE_64_BIT_INT USE_ITHREADS
    USE_LARGE_FILES USE_PERLIO USE_REENTRANT_API
    Locally applied patches:
    DEVEL
    31081
    Built under cygwin
    Compiled at Apr 26 2007 08:31:52
    %ENV:
    PERLIO="perlio"
    CYGWIN="ntsec"
    @INC:
    lib
    /usr/lib/perl5/5.9/cygwin
    /usr/lib/perl5/5.9
    .
  • Demerphq at Apr 26, 2007 at 1:40 pm

    On 4/26/07, Jerry D. Hedden wrote:
    Thanks, applied as #31081.
    I get failures in t/op/pat.t with this patch on Cygwin Perl: [snip]

    Could this be the result of something peculiar in my configuration:
    not sure. The patch wasnt correct tho, im working on a follow up.

    Yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Jerry D. Hedden at Apr 26, 2007 at 2:03 pm

    On 4/26/07, demerphq wrote:
    On 4/26/07, Jerry D. Hedden wrote:
    Thanks, applied as #31081.
    I get failures in t/op/pat.t with this patch on Cygwin Perl: [snip]
    Could this be the result of something peculiar in my configuration:
    not sure. The patch wasnt correct tho, im working on a follow up.
    Seems to have been fixed by 31085.
  • Rafael Garcia-Suarez at Apr 26, 2007 at 2:13 pm

    On 26/04/07, Jerry D. Hedden wrote:
    Seems to have been fixed by 31085.
    But now t/uni/folds.t fails for me.
  • Jerry D. Hedden at Apr 26, 2007 at 2:27 pm

    On 4/26/07, Rafael Garcia-Suarez wrote:
    On 26/04/07, Jerry D. Hedden wrote:
    Seems to have been fixed by 31085.
    But now t/uni/folds.t fails for me.
    Me, too.
  • Demerphq at Apr 26, 2007 at 3:15 pm

    On 4/26/07, Jerry D. Hedden wrote:
    On 4/26/07, Rafael Garcia-Suarez wrote:
    On 26/04/07, Jerry D. Hedden wrote:
    Seems to have been fixed by 31085.
    But now t/uni/folds.t fails for me.
    Me, too.
    Dont worry. Ive got it under control. Patch will be ready when its
    baked. Currently its still in the mixing bowl. :-)

    yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Demerphq at Apr 27, 2007 at 9:32 am

    On 4/26/07, demerphq wrote:
    On 4/26/07, Jerry D. Hedden wrote:
    On 4/26/07, Rafael Garcia-Suarez wrote:
    On 26/04/07, Jerry D. Hedden wrote:
    Seems to have been fixed by 31085.
    But now t/uni/folds.t fails for me.
    Me, too.
    Dont worry. Ive got it under control. Patch will be ready when its
    baked. Currently its still in the mixing bowl. :-)
    Attached is a new patch that should fix things. It includes a pretty
    much completely rewritten and much more flexible regcharclass.pl,
    which in turn means a new regcharclass.h.

    The updated patch passes all tests on my machine. This time I
    doublechecked, then had some coffee and checked again. :-)

    I really have no idea what happened before, theres obviously no way I
    could have run a full test with the previous patch, yet I'm sure I
    did. The only thing i can think of is that I ran it in the wrong
    directory due to stupidity and didn't notice. :-(

    Anyway, sorry for the breakage folks, your regular service should
    resume with this patch.

    Oh, I updated the win32 Makefile to automatically regen regcharclass.h
    when Porting\regcharclass.pl has changed, and established a dependency
    from regcomp.obj and regexec.obj to regcharclass.h and regnodes.h
    (which is itself dependent on regcomp.sym), this ensures that the
    right things are rebuilt if any of the regex engine config data has
    been updated. As usual I've punted on modifying the other makefiles.
    (Not sure if its relevent to anybody that isnt directly hacking on the
    regex engine anyway.) Actually, im not much of a makefile hacker so
    its possible they arent quite right. They seem to work fine here tho.

    cheers,
    Yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Demerphq at Apr 27, 2007 at 2:10 pm

    On 4/27/07, demerphq wrote:
    On 4/26/07, demerphq wrote:
    On 4/26/07, Jerry D. Hedden wrote:
    On 4/26/07, Rafael Garcia-Suarez wrote:
    On 26/04/07, Jerry D. Hedden wrote:
    Seems to have been fixed by 31085.
    But now t/uni/folds.t fails for me.
    Me, too.
    Dont worry. Ive got it under control. Patch will be ready when its
    baked. Currently its still in the mixing bowl. :-)
    Attached is a new patch that should fix things. It includes a pretty
    much completely rewritten and much more flexible regcharclass.pl,
    which in turn means a new regcharclass.h.

    The updated patch passes all tests on my machine. This time I
    doublechecked, then had some coffee and checked again. :-)

    I really have no idea what happened before, theres obviously no way I
    could have run a full test with the previous patch, yet I'm sure I
    did. The only thing i can think of is that I ran it in the wrong
    directory due to stupidity and didn't notice. :-(

    Anyway, sorry for the breakage folks, your regular service should
    resume with this patch.

    Oh, I updated the win32 Makefile to automatically regen regcharclass.h
    when Porting\regcharclass.pl has changed, and established a dependency
    from regcomp.obj and regexec.obj to regcharclass.h and regnodes.h
    (which is itself dependent on regcomp.sym), this ensures that the
    right things are rebuilt if any of the regex engine config data has
    been updated. As usual I've punted on modifying the other makefiles.
    (Not sure if its relevent to anybody that isnt directly hacking on the
    regex engine anyway.) Actually, im not much of a makefile hacker so
    its possible they arent quite right. They seem to work fine here tho.
    Sigh. Third time lucky as they say.

    Yves

    --
    perl -Mre=debug -e "/just|another|perl|hacker/"
  • Rafael Garcia-Suarez at Apr 27, 2007 at 2:19 pm

    On 27/04/07, demerphq wrote:
    Sigh. Third time lucky as they say.
    Thanks, applied as #31102.
  • Rafael Garcia-Suarez at Apr 28, 2007 at 10:17 am

    On 27/04/07, Rafael Garcia-Suarez wrote:
    On 27/04/07, demerphq wrote:
    Sigh. Third time lucky as they say.
    Thanks, applied as #31102.
    On the other hand, it now seems to trigger problems with locales where
    ß isn't represented by \xDF...

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedApr 24, '07 at 9:38a
activeApr 28, '07 at 10:17a
posts16
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase