On Fri, Aug 03, 2001 at 11:39:33AM +0100, Daniel P. Berrange wrote:
I'm in the process of converting my employeer's perl applications
to use UTF-8 throughout and have come across a couple of
interesting bugs when working with UTF-8 strings and perl 5.7.2.
The first is in the Perl_mg_length function, which causes the
string length to be reported in bytes rather than characters,
even though the UTF-8 flag is set. I've attached a patch
(against 5.7.2) containing a fix & new test case for t/op/length.t
I'm in the process of converting my employeer's perl applications
to use UTF-8 throughout and have come across a couple of
interesting bugs when working with UTF-8 strings and perl 5.7.2.
The first is in the Perl_mg_length function, which causes the
string length to be reported in bytes rather than characters,
even though the UTF-8 flag is set. I've attached a patch
(against 5.7.2) containing a fix & new test case for t/op/length.t
http://public.activestate.com/cgi-bin/perlbrowse
)
The second, in the regex engine, causes '.' to match against
bytes rather than characters when using the /s operator for
the regex match. I thought I had a suitable patch, unfortunately
it merely succeeded in breaking \C instead :-( I've attached
it anyway as it may help someone else develop a proper patch
for this problem. Also attached a script to demo the problem.
Will investigate, thanks for the demo script. (The \C is Evil.)bytes rather than characters when using the /s operator for
the regex match. I thought I had a suitable patch, unfortunately
it merely succeeded in breaking \C instead :-( I've attached
it anyway as it may help someone else develop a proper patch
for this problem. Also attached a script to demo the problem.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen