FAQ

On 4/24/07, Juerd Waalboer wrote:
demerphq skribis 2007-04-24 11:37 (+0200):
One would assume that unicode semantics would be obeyed when either
the string or pattern was unicode, and that latin1 semantics (for lack
of a better term) would be followed only when neither were unicode.
If I didn't know Perl, I would assume that it would always use Unicode
semantics, or never, because I read somewhere that Perl only has one
string type.
The problem is that the optimiser thinks that /\xDF/i under unicode is
really 'ss' and therefore that the minimum length string that can
match is 2. Ouch.
At this point the only solution I can think of is to disable minlen
checks when a character is encountered that folds to a multi-character
string.
I think correctness is more important than performance, especially when
it is needed for real world languages like German.
Turns out this nbug affects Greek and German, three codepoints in total:

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS

The fact that it doesnt affect any of the other 106 special case
foldings in the unicode 5 spec is IMO a miracle perched on top of a
bug perched on top of a melting ice-cream-cone.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 16 | next ›
Discussion Overview
groupperl5-porters @
categoriesperl
postedApr 24, '07 at 9:38a
activeApr 28, '07 at 10:17a
posts16
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase