FAQ
I'm sorry, but this ticket should instead be against the CPAN module
Encode, and not Perl itself, and what you are requesting is actually an
enhancement; it isn't actually a bug. The url to enter Encode tickets is:
https://rt.cpan.org/Public/Bug/Report.html?Queue=Encode

Note that U+FDD3 and the other non-character code points are not
strictly "illegal", but they are "illegal for open interchange", and so
actually Encode is working as required by the Unicode standard, since it
could be getting this from open interchange. What you are really asking
for is a new input method that does not do strict checking for these
code points, perhaps with "lax" in its name. It is important to do the
strict checking by default, as one of the non-character code points is
U+FFFE, which could be used in UTF-16 as the first character in an
attack to cause the platform to think that the byte order is swapped
from what it actually is.

--Karl Williamson

Search Discussions

  • Andrew Pimlott at Jan 14, 2011 at 10:33 pm
    I will refile the issue with Encode. However, specifying the encoding
    of filehandles is effectively a perl core feature. So I think it's
    important to have intuitive and consistent behavior, assuming the user
    doesn't know anything about Encode behind the curtains. I would
    suggest:

    - that all encodings treat these characters in the same way as the perl
    core
    - that the default behavior be a warning (with the same text as the core
    warning--currently it is a little different)
    - that the warning be controlled by 'use warnings'

    The last requires that the warning flags be passed on to Encode. This
    is more work but is more consistent to the programmer than adding a
    "lax" encoding name. Also, when he looks up the error in perldiag(1),
    that's what he is told to do.

    Interesting point about the security implication. But U+FEFF could as
    well be used maliciously, but it is accepted. And an attack might be
    routed through UTF-8 or another encoding, so by the same reasoning it
    should be an error there. (It might even be routed through a perl
    program that constructs strings with chr(), so chr(0xFDD0) should be an
    error.)

    And it's a little strong to say that Encode is "working as required",
    because Unicode says "Applications are free to use any of these
    noncharacter code points internally", and Encode doesn't know anything
    about the context of the program. (I probably biased this bug report
    against myself by calling these characters "illegal", following perl's
    terminology, instead of "noncharacters" ;-) )

    Andrew
  • Karl Williamson at Jan 22, 2011 at 9:24 pm

    On 01/14/2011 03:15 PM, Andrew Pimlott wrote:
    I will refile the issue with Encode. However, specifying the encoding
    of filehandles is effectively a perl core feature. So I think it's
    important to have intuitive and consistent behavior, assuming the user
    doesn't know anything about Encode behind the curtains.
    I'm sorry that you're being exposed to Perl's internal organizational
    structure. Our excuse would be that it's essentially an all volunteer
    organization, with no funding to do things "right." (But I've noticed
    that this is true as well for companies that would have the resources to
    avoid this, should they choose.) In any event, Encode is a CPAN module
    that exists and is maintained independently of the Perl core, and works
    on any number of Perl versions. The code that causes the behavior you
    don't like is in Encode. The functionality that Encode provides has
    been deemed so useful that it's now shipped with the core; and, as is
    often the case, the fact that it's independent only shows up as an issue
    when there is a flaw.

    As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that
    spelling of the encoding, you get the same strict behavior you do with
    UTF-16. It's just that there are alternative spellings for 8 bit
    encoding that give laxer behavior. I am of the opinion that those
    spellings should be different than they are, and involve something like
    "lax" to indicate to the user that they are getting that.

    Security is serious business. I'm unhappy that Perl, including the
    modules it ships with, has too many instances of not treating it that
    way. I view it as unconscionable.

    I would
    suggest:

    - that all encodings treat these characters in the same way as the perl
    core
    - that the default behavior be a warning (with the same text as the core
    warning--currently it is a little different)
    - that the warning be controlled by 'use warnings'

    The last requires that the warning flags be passed on to Encode. This
    is more work but is more consistent to the programmer than adding a
    "lax" encoding name. Also, when he looks up the error in perldiag(1),
    that's what he is told to do.
    The default behavior can't be just a warning when a server is facing the
    wide-world of hackers. The current behavior of failing when warnings
    are enabled has been considered a bug, and is now changed in 5.13.9.
    Interesting point about the security implication. But U+FEFF could as
    well be used maliciously, but it is accepted. And an attack might be
    routed through UTF-8 or another encoding, so by the same reasoning it
    should be an error there. (It might even be routed through a perl
    program that constructs strings with chr(), so chr(0xFDD0) should be an
    error.)
    I'm not sure if you were being ironic here, as it is self-contradictory;
    so I have to assume you were being straight. You said in an earlier
    post that the non-characters are legal internally. So chr() has to
    accept them. I don't see how U+FEFF or encoding them in UTF-8 could be
    security implications.
    And it's a little strong to say that Encode is "working as required",
    because Unicode says "Applications are free to use any of these
    noncharacter code points internally", and Encode doesn't know anything
    about the context of the program. (I probably biased this bug report
    against myself by calling these characters "illegal", following perl's
    terminology, instead of "noncharacters" ;-) )
    Since Encode doesn't know anything about the context, it has to assume
    the worst case. To do otherwise is to leave users unknowingly open to
    attack.

    There was no biasing involved. I've been trying to root out all
    instances of calling these "illegal" in the 5.13.X series.
  • Andrew Pimlott at Jan 30, 2011 at 7:00 pm

    Excerpts from Karl Williamson's message of Sat Jan 22 13:16:45 -0800 2011:
    I'm sorry that you're being exposed to Perl's internal organizational
    structure.
    No prob about that. I've filed the bug with Encode:
    https://rt.cpan.org/Public/Bug/Display.html?id=64788

    I'm just suggesting that some coordination with the core perl maintaiers
    is warranted, since Encode is so closely integrated.
    As an aside, UTF-16 parallels UTF-8, in that if you use Encode with that
    spelling of the encoding, you get the same strict behavior you do with
    UTF-16.
    I don't think that's accurate--see the original example I posted:

    binmode(STDIN, ':encoding(UTF-8)');
    while (<STDIN>) { }

    Input is EF B7 93, which decodes to U+FDD3, a noncharacter. There is no
    diagnostic. (Unless this is changed in a recent dev version.)
    The default behavior can't be just a warning when a server is facing the
    wide-world of hackers.
    That's a fair point, but consider the other side: I write code that is
    correct according to Unicode and my application's semantics. One day,
    my application fails because Encode surprisingly considers some valid
    input illegal. It's a judgement call, and don't think the best default
    policy is obvious. A warning is a reasonable compromise, IMO.

    Even better would be to document the situation clearly:

    As a security precaution, the following encodings consider Unicode
    "noncharacters" to be malformed. If you want to decode Unicode
    noncharacters, ...
    Interesting point about the security implication. But U+FEFF could as
    well be used maliciously, but it is accepted. And an attack might be
    routed through UTF-8 or another encoding, so by the same reasoning it
    should be an error there. (It might even be routed through a perl
    program that constructs strings with chr(), so chr(0xFDD0) should be an
    error.)
    I'm not sure if you were being ironic here, as it is self-contradictory;
    so I have to assume you were being straight. You said in an earlier
    post that the non-characters are legal internally. So chr() has to
    accept them.
    I'd put it differently: perl and Encode have no idea what data is
    "internal". Input being read from a filehandle might well be internal,
    eg. from a file the application itself produced. An argument to chr()
    might well be external, eg a number supplied by a potentially malicious
    agent. Saying Encode::decode() should be more strict than chr() is
    merely a rough heuristic.

    By the way, to be clear, chr(0xFDD0) does throw a warning, so someone
    already decided that this should be checked here. I say, the best thing
    for the programmer is to be consistent: Have one way to say whether
    noncharacters should be errors, warings, or non-issues, and honor it
    everywhere. The "no warnings 'utf8'" pragma would be the natural way to
    do this.
    I don't see how U+FEFF or encoding them in UTF-8 could be
    security implications.
    Oops, I got that one wrong. But according to my understanding, U+FFFE
    is the only character with this security implication. So this rationale
    should not be used to brand other noncharacters "illegal".
    Since Encode doesn't know anything about the context, it has to assume
    the worst case. To do otherwise is to leave users unknowingly open to
    attack.
    I appreciate your caution. But it's a judgement call as to how paranoid
    you should be, when you may cause unexpected errors in valid programs.
    "has to assume the worst case" is an extreme philosophy.
    There was no biasing involved. I've been trying to root out all
    instances of calling these "illegal" in the 5.13.X series.
    Great--clear diagnostics and documentation will make these issues less
    trouble for everyone.

    Andrew
  • Aristotle Pagaltzis at Jan 31, 2011 at 6:34 am

    * Andrew Pimlott [2011-01-30 20:10]:
    That's a fair point, but consider the other side: I write
    code that is correct according to Unicode and my application's
    semantics. One day, my application fails because Encode
    surprisingly considers some valid input illegal. It's
    a judgement call, and don't think the best default policy
    is obvious. A warning is a reasonable compromise, IMO.
    I don’t think compromises are desirable here. And programs that
    do something useful invariably shuffle data through both internal
    and external interfaces, so it must be possible to affect each
    interface individually, rather than via some action-a-distance
    quasi-/global setting like toggling warnings that effects them
    all. It should be possible to specifically request strict or lax
    treatment of input per interface (eg. by parametrising I/O layers
    differently).

    And that is already the plan for the future. (The :utf8 layer
    will be safe by default and there’ll be ways to request decoding
    that differ.) Other inconsistencies with insufficient checking
    (or non-checking!) of input such as with :encoding(UTF-8) should
    get cleared up over time. That Perl isn’t very clean in this area
    is known. Karl Williamson has been putting tremendous energy into
    scrubbing it all down for the last year or so.

    For things like `chr` the situation is more difficult since perl
    can’t tell which operation belongs to what interface. The warning
    we have is probably the only feasible solution.

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedJan 13, '11 at 9:58p
activeJan 31, '11 at 6:34a
posts5
users4
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase