2009/11/13 karl williamson <public@khwilliamson.com>:
> Attached is a patch that adds code to almost fix the case changing portio=n
> of the "Unicode bug" (see> http://rt.perl.org/rt3/Public/Bug/Display.html?id=3D58182). =A0This is a> significant portion of perltodo's UTF-8 revamp.Thanks, applied as 00f254e235ff10d6223aa9a402ad5b7a85689829.
> What this means is that characters whose ordinals on ASCII machines are i=n
> the 128-255 range will have Unicode semantics as far as case changing goe=s
> regardless of whether they are encoded in utf8 or not. =A0The reason I sa=y it
> 'almost' fixes this is that any user-defined case mapping is still not> called unless the scalar is in utf8. =A0See below for a discussion on tha=t.
>> I am leaving the legs that implement this behavior disabled by default un=til
> the smokes show that the new stuff doesn't break anything; then I'll flip> the bit to enable it. =A0If you want to play with it in the meantime, you= can
> say 'no legacy "unicode8bit"';Prudence is good. (so currently in blead the legacy-unicode8bit pragma
has the reverse meaning than the one it will have in 5.12)
> I've proposed changing the name of this; so that may happen, but it doesn='t
> affect the heart of the code, so I'm delivering it now; it's easy to chan=ge
> the name later.I still like that name; the documentation might need improvements
though. When I read this line in the SYNOPSIS :
use legacy ':5.10'; # Keeps semantics the same as in perl 5.10
as a perl user, I'm not certain of how that works, because it's not
clear that the doc itself depends on a version strictly greater than
5.10. Maybe it's not such a good idea to use versioned bundles like
feature.pm does.
> I don't understand Perl magic. =A0So I have tried to avoid touching anyth=ing
> around that. =A0But there are a couple comments from the old code that ma=ke me
> think I really don't understand what's going on in regard to that. =A0I'm= now
> thinking the comments are obsolete. =A0If someone would care to look at t=hem,
> they are at lines 3963 and 4227 in pp.c and both say the same thing:> "Overloaded values may have toggled the UTF-8 flag on source, so we need =to
> check DO_UTF8 again here". =A0This doesn't make sense to me based on the =code
> earlier in the functions they occur in, so I think they are wrong.I'll look, but I don't promise to understand either.
> There are minor changes in several files: macros are created in headers t=o
> access the bit and look up the case change mappings. =A0perl.h has two ta=bles
> added with the mappings for all 256 Latin1 characters that give> respectively, 1) their lowercased values; and 2) their upper and title ca=sed
> values. =A0I removed trailing blank space in all the submitted files.>> There are significant changes in the casing functions in pp.c to accommod=ate
> this new behavior. =A0Basically, if the bit is set, the case change is lo=oked
> up via these tables instead of the existing mechanism. =A0If the bit is o=ff or
> 'use locale' or 'use bytes' is in effect the existing mechanism is used.>> Complications arise because three characters in the latin1 range require> special handling when title and uppercasing them. =A0Two of them have the= case
> change out-of-range, which means that the result has to be converted to> utf8, and the other expands to two in-range characters. The code in the> ucfirst() function was rearranged to compute the case change first, so as= to
> know if the length of the result changes. =A0If it doesn't change, and th=e
> scalar isn't read-only etc, then only the first character needs to be> touched. =A0Also, the uc() function has to cope with the possibility that= in
> mid-stream it will have to decide to upgrade the result to utf8.>> I tried to make this efficient, and not slow things down from what they a=re
> now. =A0It may actually be faster than the current implementation in gene=ral
> because it does a table look up, which avoids some tests. =A0One test was> added in the inner loop for upper casing, but then again the table lookup> took out two tests>> There are two things I started to implement, but left #ifdef'd out> currently:>> One of them implements the context sensitive casing that Unicode defines.> =A0This is not implemented because I need more time to think about things=; for
> one, Unicode has also recently revised their guidelines on this and I> haven't looked at the new ones.>> The other would change the case of a utf8-encoded character in the Latin1> range by using the built-in tables without having to go to the swash. =A0=It is
> disabled because it would break the ability of user-defined case mappings> overriding the default behavior for characters in that range.>> Which brings us to the topic of these user-defined case-changing mappings=.
> =A0I just haven't gotten around to figuring out how to tell if such a fun=ction
> is in existence or not. =A0Currently, they must be in main::. =A0Since th=is is a
> very obscure corner of the language, I'm deferring fixing it until later.> =A0Rafael gave me one hint about how to figure out if such a mapping func=tion
> exists, but I haven't pursued it yet. =A0I understand that Zefram has bee=n
> working on lexically-scoped subroutines, so that could affect this.I agree with the deferring. Is anyone aware about some code using
user-defined case mappings, by the way ?