FAQ
Porters,

I have a script:

     use v5.10;
     use warnings;
     use JSON;
     use Encode qw(encode_utf8 decode_utf8);

     my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
     my $parser = JSON->new->utf8;

     my $data = $parser->decode($json);
     say encode_utf8 $data->{FFONTS};

On Perl 5.12 and earlier, this dies:

     malformed UTF-8 character in JSON string, at character offset 23 (before "\x{ffff}ALIC"}")

It does not die on 5.14, which I assume is due to the addition of Unicode 6 support. But oddly, while JSON complains on 5.12 and earlier, Encode does not:

     use v5.10;
     use warnings;
     use JSON;
     use Encode qw(encode_utf8 decode_utf8);

     my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
     $json = decode_utf8 $json, Encode::FB_CROAK;

     my $parser = JSON->new;

     my $data = $parser->decode($json);
     say encode_utf8 $data->{FFONTS};

This dies with the same error from JSON.pm, but note that the call to decode_utf8() worked. I’m left wondering why JSON and Encode seem to disagree on the validity of those bytes as UTF-8 in Perl 5.12. Ideas?

Thanks,

David

Search Discussions

  • Aristotle Pagaltzis at Jul 18, 2014 at 3:01 am
    Hi David,

    * David E. Wheeler [2014-07-17 00:05]:
    I have a script:

    use v5.10;
    use warnings;
    use JSON;
    use Encode qw(encode_utf8 decode_utf8);

    my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
    my $parser = JSON->new->utf8;

    my $data = $parser->decode($json);
    say encode_utf8 $data->{FFONTS};

    On Perl 5.12 and earlier, this dies:

    malformed UTF-8 character in JSON string, at character offset 23 (before "\x{ffff}ALIC"}")

    It does not die on 5.14, which I assume is due to the addition of
    Unicode 6 support.
    why do you assume that? As far as I can tell, Unicode 6 has no changes
    of any kind WRT U+FFFF.
    But oddly, while JSON complains on 5.12 and earlier, Encode does not:

    use v5.10;
    use warnings;
    use JSON;
    use Encode qw(encode_utf8 decode_utf8);

    my $json = qq{{"FFONTS":"HOLIDAYBOLDI\xEF\xBF\xBFALIC"}};
    $json = decode_utf8 $json, Encode::FB_CROAK;

    my $parser = JSON->new;

    my $data = $parser->decode($json);
    say encode_utf8 $data->{FFONTS};

    This dies with the same error from JSON.pm, but note that the call to
    decode_utf8() worked. I’m left wondering why JSON and Encode seem to
    disagree on the validity of those bytes as UTF-8 in Perl 5.12. Ideas?
    Sounds to me like it’s the behaviour of JSON that changes between 5.12
    and 5.14 rather than that of Encode?

    What I can say is that U+FFFF is a non-character, but EF BF BF is the
    correct encoding of that codepoint. Using decode_utf8(...) is short for
    decode("utf8", ...), which is completely permissive. As long as it can
    decode the octet sequence according to the UTF-8 encoding, it will not
    complain. In contrast, if you do decode("UTF-8", ...) then you will get
    charset checking too. And *that* *will* reject your attempt to smuggle
    a U+FFFF into the string.

    So that’s why Encode behaves as it does.

    Why does JSON go from rejecting to accepting the string if you go from
    5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
    to the other based on the version of JSON; you haven’t specified whether
    you have the same version of it installed in your 5.12 vs 5.14 perls.)

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • David E. Wheeler at Jul 18, 2014 at 5:19 am

    On Jul 17, 2014, at 8:00 PM, Aristotle Pagaltzis wrote:

    Hi David,
    Hey Aristotle, many thanks for your reply. Super helpful.
    It does not die on 5.14, which I assume is due to the addition of
    Unicode 6 support.
    why do you assume that? As far as I can tell, Unicode 6 has no changes
    of any kind WRT U+FFFF.
    It was a guess.
    Sounds to me like it’s the behaviour of JSON that changes between 5.12
    and 5.14 rather than that of Encode? Yes.
    What I can say is that U+FFFF is a non-character, but EF BF BF is the
    correct encoding of that codepoint. Using decode_utf8(...) is short for
    decode("utf8", ...), which is completely permissive. As long as it can
    decode the octet sequence according to the UTF-8 encoding, it will not
    complain. In contrast, if you do decode("UTF-8", ...) then you will get
    charset checking too. And *that* *will* reject your attempt to smuggle
    a U+FFFF into the string.
    Ah, yes, quite right. I keep forgetting that utf8 is so permissive.
    So that’s why Encode behaves as it does.
    So this data came from a Java app, which serialized the string "HOLIDAYBOLDI\xEF\xBF\xBFALIC" into JSON. This tells me that our Java app needs to be a little more careful about what it considers UTF-8, and perhaps replace bogus characters/bytes. But I am unable to get it to choke on \uFFFF at all on Java 6 or 7. This does not throw an exception:

         "\uFFFF".getBytes("UTF-8");

    I Googled around a bit, and found this SO answer:

       http://stackoverflow.com/a/16619933/79202

    Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?
    Why does JSON go from rejecting to accepting the string if you go from
    5.12 to 5.14? That, I have no idea about. (Or maybe it is goes from one
    to the other based on the version of JSON; you haven’t specified whether
    you have the same version of it installed in your 5.12 vs 5.14 perls.)
    I used JSON 2.90 and JSON::XS 3.01 in all my tests.

    Best,

    David
  • David E. Wheeler at Jul 18, 2014 at 6:37 am

    On Jul 17, 2014, at 10:19 PM, David E. Wheeler wrote:

    Which suggests that, according to [Corrigendum 9](http://www.unicode.org/versions/corrigendum9.html), reserved non-characters now *are* allowed to appear in a UTF-8 string. Which makes me think I will never be able to get the Java server to clean up its act. Should Perl, Encode, and JSON relax things a bit with regard to these characters, then?
    Actually, now that I think about it, it seems that JSON on Perl 5.14 and higher has already relaxed that distinction. It’s only Encode that is still strict about non-characters.

    Best,

    David
  • Aristotle Pagaltzis at Jul 18, 2014 at 7:57 am
    Hi David,

    * David E. Wheeler [2014-07-18 08:40]:
    On Jul 17, 2014, at 10:19 PM, David E. Wheeler wrote:
    Which suggests that, according to [Corrigendum
    9](http://www.unicode.org/versions/corrigendum9.html), reserved
    non-characters now *are* allowed to appear in a UTF-8 string. Which
    makes me think I will never be able to get the Java server to clean
    up its act. Should Perl, Encode, and JSON relax things a bit with
    regard to these characters, then?
    Actually, now that I think about it, it seems that JSON on Perl 5.14
    and higher has already relaxed that distinction. It’s only Encode that
    is still strict about non-characters.
    there is a ticket about that:
    https://rt.perl.org/Public/Bug/Display.html?id=121937


    * David E. Wheeler [2014-07-18 07:20]:
    On Jul 17, 2014, at 8:00 PM, Aristotle Pagaltzis wrote:
    Why does JSON go from rejecting to accepting the string if you go
    from 5.12 to 5.14? That, I have no idea about. (Or maybe it is goes
    from one to the other based on the version of JSON; you haven’t
    specified whether you have the same version of it installed in your
    5.12 vs 5.14 perls.)
    I used JSON 2.90 and JSON::XS 3.01 in all my tests.
    So that leaves the question open as it was: why does JSON.pm exhibit one
    behaviour under 5.12 and another under 5.14?


    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • David E. Wheeler at Jul 20, 2014 at 4:59 am

    On Jul 18, 2014, at 12:56 AM, Aristotle Pagaltzis wrote:

    there is a ticket about that:
    https://rt.perl.org/Public/Bug/Display.html?id=121937
    Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:

         perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
         utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.

    In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.
    So that leaves the question open as it was: why does JSON.pm exhibit one
    behaviour under 5.12 and another under 5.14?
    Yes, very curious.

    Best,

    David
  • David E. Wheeler at Jul 21, 2014 at 6:22 pm

    On Jul 19, 2014, at 9:58 PM, David E. Wheeler wrote:

    there is a ticket about that:
    https://rt.perl.org/Public/Bug/Display.html?id=121937
    Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:

    perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
    utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.

    In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.
    I did so, here:

       https://rt.cpan.org/Ticket/Display.html?id=97358

    Dan replied to report that it’s UTF8_DISALLOW_ILLEGAL_INTERCHANGE from the Perl core that’s at fault:
    If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core:

    http://perldoc.perl.org/perlapi.html#Unicode-Support

    In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.

    ISTM that, given the change in Corrigendum 9, UTF8_DISALLOW_ILLEGAL_INTERCHANGE should exclude UTF8_DISALLOW_NONCHAR.

    Is this part of of the same issue as that described in RT-97358? Or should I start a new issue?

    Best,

    David
  • Karl Williamson at Jul 23, 2014 at 3:55 am

    On 07/21/2014 12:22 PM, David E. Wheeler wrote:
    On Jul 19, 2014, at 9:58 PM, David E. Wheeler wrote:

    there is a ticket about that:
    https://rt.perl.org/Public/Bug/Display.html?id=121937
    Ah, interesting. I had not run into that warning. What I ran into with Encode I now think should be changed:

    perl -MEncode -E 'say Encode::decode("UTF-8", "\xEF\xBF\xBF", Encode::FB_CROAK)'
    utf8 "\xFFFF" does not map to Unicode at /usr/local/lib/perl5/site_perl/5.20.0/darwin-thread-multi-2level/Encode.pm line 175.

    In fact it *does* map to Unicode, IIUC Corrigendum 9 correctly. I’ll file a bug with Dan.
    I did so, here:

    https://rt.cpan.org/Ticket/Display.html?id=97358

    Dan replied to report that it’s UTF8_DISALLOW_ILLEGAL_INTERCHANGE from the Perl core that’s at fault:
    If it were are a bug, it belongs to perl core because the strictness of UTF8 is #defined in the value of UTF8_DISALLOW_ILLEGAL_INTERCHANGE which is defined in perl core:

    http://perldoc.perl.org/perlapi.html#Unicode-Support

    In other words, Encode faithfully believes perl core with that respect. And I want to leave Encode that way. If it is to be fixed, it should be fixed by redefining UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR in perl core.

    ISTM that, given the change in Corrigendum 9, UTF8_DISALLOW_ILLEGAL_INTERCHANGE should exclude UTF8_DISALLOW_NONCHAR.

    Is this part of of the same issue as that described in RT-97358? Or should I start a new issue?

    Best,

    David
    We have a backwards compatibility problem here. Corrigendum 9 is
    controversial, and the wording has not been incorporated into the text
    of Unicode 7.0 because that hasn't been published yet (the data has, but
    not the text of the standard).

    Noncharacters are still supposed to be used only for internal purposes.
       The genesis of #9 was that ICU and CLDR were having trouble with
    off-the-shelf editors and version control systems rejecting their code
    that used them legitimately (though it appears that there are some poor
    design decisions involving their use).

    I sent a query about things to the Unicode mailing list some months ago,
    and it stirred up quite a bit of resentment about the #9 decision. It
    was made without public input, and during a single meeting, so there
    wasn't time to consider all the ramifications.

    One of my points was that we have a gatekeeper that has kept
    non-characters out of input. Code that uses non-characters internally
    has relied on that gatekeeper to prevent conflicts. If we change the
    gatekeeper to allow noncharacters, there is a potential security hole.
    Even the people on the Unicode list that were the promulgators of the
    change given by #9 agree that any existing code that excludes
    noncharacters should not be changed to allow them.
  • David E. Wheeler at Jul 23, 2014 at 5:38 am

    On Jul 22, 2014, at 8:55 PM, Karl Williamson wrote:

    We have a backwards compatibility problem here. Corrigendum 9 is controversial, and the wording has not been incorporated into the text of Unicode 7.0 because that hasn't been published yet (the data has, but not the text of the standard).

    Noncharacters are still supposed to be used only for internal purposes. The genesis of #9 was that ICU and CLDR were having trouble with off-the-shelf editors and version control systems rejecting their code that used them legitimately (though it appears that there are some poor design decisions involving their use).

    I sent a query about things to the Unicode mailing list some months ago, and it stirred up quite a bit of resentment about the #9 decision. It was made without public input, and during a single meeting, so there wasn't time to consider all the ramifications.
    Huh. So much tempest!
    One of my points was that we have a gatekeeper that has kept non-characters out of input. Code that uses non-characters internally has relied on that gatekeeper to prevent conflicts. If we change the gatekeeper to allow noncharacters, there is a potential security hole. Even the people on the Unicode list that were the promulgators of the change given by #9 agree that any existing code that excludes noncharacters should not be changed to allow them.
    Well, for now, for my purposes, I put this into our code:

         use constant PERL514 => $] >= 5.014;
         # ... later in that same file…
             unless (PERL514) {
                 # Replace noncharacters with the UNICODE REPLACEMENT character.
                 $json =~ s/\xEF(?:\xBF[\xBF\xBE]|\xB7[\x90-\xAF])/\xEF\xBF\xBD/g;
             }

    Which fixes the immediate issue for us on 5.10.1 (Thanks RedHat!) and should allow it to keep working once we get on a more modern Perl. This is because JSON(::XS)? on 5.14 and higher is okay with noncharacters, even if `decode("UTF-8", $json)` isn’t.

    As for where the “EF BF BF” is coming from: JavaScript and Flash running in a browser. Cool, right? FWIW, neither Java, JavaScript, nor Postgres complain about this noncharacter. I guess they tend to behave more like `decode_utf8`.

    As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).

    Thanks for the detailed reply.

    Best,

    David
  • David E. Wheeler at Sep 19, 2014 at 12:28 am

    On Jul 22, 2014, at 10:38 PM, David E. Wheeler wrote:

    As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).
    Karl, what say you?

    Best,

    David
  • Karl Williamson at Sep 19, 2014 at 1:00 am

    On 09/18/2014 06:28 PM, David E. Wheeler wrote:
    On Jul 22, 2014, at 10:38 PM, David E. Wheeler wrote:

    As I’ve solved my immediate problem, I’m fine to let you guys decide whether or not to change UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR. Do you want a ticket to track the issue, or is https://rt.perl.org/Public/Bug/Display.html?id=121937 sufficient (I can add a comment there if you’d like, access controls allowing).
    Karl, what say you?

    Best,

    David
    Background: It turns out that the Corrigendum #9 is controversial in
    the Unicode community. It was done during the course of a single
    meeting, and not subjected to the usual public review. The wording of
    the Standard in regards to this has not been finalized.

    We cannot just change this. It would open up security holes.
    Applications likely have been written assuming Non-characters will not
    be in the input, and thus are usable as sentinels, without fear of
    encountering one from user-data. If we were to make this change that
    would no longer be true, and a long-standing module could silently be
    exposed to an attack.

    The feedback from Unicode on this was unanimous, even from the people
    who were the ones who pushed for #9. If you have an existing library
    (as essentially we do) that excluded non-chars, you have to continue to
    exclude them to prevent security holes from opening up.

    The way out of this is to have some API to tell Encode that
    non-characters are acceptable.
  • David E. Wheeler at Sep 19, 2014 at 4:08 pm

    On Sep 18, 2014, at 5:59 PM, Karl Williamson wrote:

    The way out of this is to have some API to tell Encode that non-characters are acceptable.
    Encode-only? Is there a way to do it with the IO layers? Or is that just Encode, too?

    Dan, should we re-open this bug to request an interface for setting telling UTF8_DISALLOW_ILLEGAL_INTERCHANGE to exclude UTF8_DISALLOW_NONCHAR?

       https://rt.cpan.org/Ticket/Display.html?id=97358#txn-1388686

    Thanks,

    David
  • Karl Williamson at Sep 19, 2014 at 4:23 pm

    On 09/19/2014 10:08 AM, David E. Wheeler wrote:
    Encode-only? Is there a way to do it with the IO layers? Or is that just Encode, too?
    I keep hoping we will soon get a new :utf8 layer that will allow this
    sort of thing.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedJul 16, '14 at 10:03p
activeSep 19, '14 at 4:23p
posts13
users3
websiteperl.org

People

Translate

site design / logo © 2017 Grokbase