On Jul 22, 2014, at 8:55 PM, Karl Williamson wrote:

We have a backwards compatibility problem here. Corrigendum 9 is controversial, and the wording has not been incorporated into the text of Unicode 7.0 because that hasn't been published yet (the data has, but not the text of the standard).

Noncharacters are still supposed to be used only for internal purposes. The genesis of #9 was that ICU and CLDR were having trouble with off-the-shelf editors and version control systems rejecting their code that used them legitimately (though it appears that there are some poor design decisions involving their use).

I sent a query about things to the Unicode mailing list some months ago, and it stirred up quite a bit of resentment about the #9 decision. It was made without public input, and during a single meeting, so there wasn't time to consider all the ramifications.
Huh. So much tempest!
One of my points was that we have a gatekeeper that has kept non-characters out of input. Code that uses non-characters internally has relied on that gatekeeper to prevent conflicts. If we change the gatekeeper to allow noncharacters, there is a potential security hole. Even the people on the Unicode list that were the promulgators of the change given by #9 agree that any existing code that excludes noncharacters should not be changed to allow them.
Well, for now, for my purposes, I put this into our code:

     use constant PERL514 => $] >= 5.014;
     # ... later in that same file…
         unless (PERL514) {
             # Replace noncharacters with the UNICODE REPLACEMENT character.
             $json =~ s/\xEF(?:\xBF[\xBF\xBE]|\xB7[\x90-\xAF])/\xEF\xBF\xBD/g;

Which fixes the immediate issue for us on 5.10.1 (Thanks RedHat!) and should allow it to keep working once we get on a more modern Perl. This is because JSON(::XS)? on 5.14 and higher is okay with noncharacters, even if `decode("UTF-8", $json)` isn’t.

As for where the “EF BF BF” is coming from: JavaScript and Flash running in a browser. Cool, right? FWIW, neither Java, JavaScript, nor Postgres complain about this noncharacter. I guess they tend to behave more like `decode_utf8`.

As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).

Thanks for the detailed reply.



Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 13 | next ›
Discussion Overview
groupperl5-porters @
postedJul 16, '14 at 10:03p
activeSep 19, '14 at 4:23p



site design / logo © 2018 Grokbase