FAQ
Perl has never handled multi-byte locales, including utf8 ones. But it
appears that more and more locales come as utf8 variants, such as
ZA.utf8. I did some research on the standards behind them, and it
appears that because of various objections, including that they weren't
general enough, nothing was ever fully approved.

It's a lot of work to handle multi-byte locales in general, but Perl
already knows how to handle Unicode utf8. This leads to my proposal: If
under "use locale", a locale name ends in '.utf8', then Perl treats it
for purposes of cytpe-only as regular Unicode. We would not actually
inspect the locale's rules, but use the Unicode ones, as if the locale
were a properly specified and implemented Unicode locale.

For purposes of other things that Perl does with locale, such as the
decimal point separator, Perl would use the locale rules, just as currently.

The advantages to a user are that Perl would start to accept this
single, common, multi-byte locale and they would get the correct results.

Since we don't currently handle utf8 locales, I don't know any downsides
for the user.

Search Discussions

  • Leon Timmermans at Jun 22, 2011 at 11:06 pm

    On Wed, Jun 22, 2011 at 11:27 PM, Karl Williamson wrote:
    Perl has never handled multi-byte locales, including utf8 ones.  But it
    appears that more and more locales come as utf8 variants, such as ZA.utf8. I
    did some research on the standards behind them, and it appears that because
    of various objections, including that they weren't general enough, nothing
    was ever fully approved.

    It's a lot of work to handle multi-byte locales in general, but Perl already
    knows how to handle Unicode utf8.  This leads to my proposal: If under "use
    locale", a locale name ends in '.utf8', then Perl treats it for purposes of
    cytpe-only as regular Unicode.  We would not actually inspect the locale's
    rules, but use the Unicode ones, as if the locale were a properly specified
    and implemented Unicode locale.

    For purposes of other things that Perl does with locale, such as the decimal
    point separator, Perl would use the locale rules, just as currently.

    The advantages to a user are that Perl would start to accept this single,
    common, multi-byte locale and they would get the correct results.

    Since we don't currently handle utf8 locales, I don't know any downsides for
    the user.
    This actually sounds like something resembling sanity, something
    uncommon in both locales and unicode. I like it.

    Leon
  • Zefram at Jun 26, 2011 at 8:33 am

    Karl Williamson wrote:
    It's a lot of work to handle multi-byte locales in general, but Perl
    already knows how to handle Unicode utf8. This leads to my proposal: If
    under "use locale", a locale name ends in '.utf8', then Perl treats it
    for purposes of cytpe-only as regular Unicode.
    This sounds wrong: it'll be a source of double-encoding bugs.
    Locale-encoded text input will, for a UTF-8 locale, be a sequence
    of octets obeying UTF-8 syntactic rules. If you treat those octets
    as Unicode characters, using Perl's aliasing of octets to characters
    U+00 to U+ff, then they'll look like very strange character sequences
    (with lots of C1 controls), on which case folding (for example) won't
    give locale-correct results. Outputting Unicode text will often not
    generate correct locale-encoded text output.

    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle. To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding. The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.

    -zefram
  • Leon Timmermans at Jun 26, 2011 at 11:17 am

    On Sun, Jun 26, 2011 at 10:33 AM, Zefram wrote:
    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle.  To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding.  The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.
    There is such a module on CPAN, but it's currently broken by design. I
    think I just fixed that in my repo, though that involved a complete
    rewrite. I think this module may be a good candidate for core in 5.16.

    Leon
  • Rafael Garcia-Suarez at Jun 26, 2011 at 1:11 pm

    On 26 June 2011 13:17, Leon Timmermans wrote:
    On Sun, Jun 26, 2011 at 10:33 AM, Zefram wrote:
    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle.  To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding.  The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.
    There is such a module on CPAN, but it's currently broken by design. I
    think I just fixed that in my repo, though that involved a complete
    rewrite. I think this module may be a good candidate for core in 5.16.
    I just uploaded PerlIO::locale 0.07 with Leon's changes to CPAN.
  • Gisle Aas at Jun 26, 2011 at 8:32 pm

    On Jun 26, 2011, at 15:10 , Rafael Garcia-Suarez wrote:
    On 26 June 2011 13:17, Leon Timmermans wrote:
    On Sun, Jun 26, 2011 at 10:33 AM, Zefram wrote:
    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle. To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding. The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.
    There is such a module on CPAN, but it's currently broken by design. I
    think I just fixed that in my repo, though that involved a complete
    rewrite. I think this module may be a good candidate for core in 5.16.
    I just uploaded PerlIO::locale 0.07 with Leon's changes to CPAN.
    My attempt at this is the Encode::Locale module[1]. This also tries to handle
    the fact that on Windows the console "locale" might be different and that the
    "locale" to use for file names have divergent rules.

    That module has also accumulated some hacks to deal with mappings where there
    isn't the strait correspondence between what langinfo() returns and the encoding
    names Encode knows about.

    --Gisle


    [1] http://search.cpan.org/dist/Encode-Locale/lib/Encode/Locale.pm
  • Karl Williamson at Jun 26, 2011 at 6:46 pm

    On 06/26/2011 05:17 AM, Leon Timmermans wrote:
    On Sun, Jun 26, 2011 at 10:33 AM, Zeframwrote:
    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle. To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding. The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.
    There is such a module on CPAN, but it's currently broken by design. I
    think I just fixed that in my repo, though that involved a complete
    rewrite. I think this module may be a good candidate for core in 5.16.
    +1 to this, though the documentation needs more beefing up. I was left
    wondering about how good the locale guessing is, and what happens if it
    guesses wrong, or does it always give a warning or failure in such
    circumstances. Also, I wonder about tainting issues. And, there should
    be a way to automatically load it.
  • Karl Williamson at Jun 26, 2011 at 7:37 pm

    On 06/26/2011 02:33 AM, Zefram wrote:
    Karl Williamson wrote:
    It's a lot of work to handle multi-byte locales in general, but Perl
    already knows how to handle Unicode utf8. This leads to my proposal: If
    under "use locale", a locale name ends in '.utf8', then Perl treats it
    for purposes of cytpe-only as regular Unicode.
    This sounds wrong: it'll be a source of double-encoding bugs.
    Locale-encoded text input will, for a UTF-8 locale, be a sequence
    of octets obeying UTF-8 syntactic rules. If you treat those octets
    as Unicode characters, using Perl's aliasing of octets to characters
    U+00 to U+ff, then they'll look like very strange character sequences
    (with lots of C1 controls), on which case folding (for example) won't
    give locale-correct results. Outputting Unicode text will often not
    generate correct locale-encoded text output.

    We should discourage the use of locale-encoded strings within Perl space.
    We should encourage decoding on input, encoding on output, and using
    native Unicode representation in the middle. To this end, there should
    be a PerlIO layer :locale, which {de,en}codes according to the locale's
    preferred encoding. The locale's encoding may perfectly well be UTF-8,
    and in *this* context we can handle it in an entirely regular manner,
    on a par with ISO-8859-*.

    -zefram
    I think you're misunderstanding my proposal, or I don't grok what you're
    saying. People still use locales to get, e.g., the proper date format,
    for their location. But they can't currently do that if the locale is
    utf8 because regex matching and casing don't work well with those. I
    was proposing something that would fix that (and not have the downsides
    that you think it does), but rather than push that for now, a much
    easier to implement proposal that would work for most people would be to
    modify the locale pragma to have a single parameter, something like either

    use locale "NO_CTYPE";

    or

    use locale "utf8";

    in which the parameter indicates that the user promises that they won't
    use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes,
    and in fact that strings can be assumed to be encoded as Unicode
    characters. That would mean that a regex compiled under such a pragma
    would automatically have /u instead of /l. Casing would also assume
    Unicode characters.

    The consequence is that if the user used a non-UTF-8 locale or switched
    at run-time to such, Perl wouldn't notice in regards to matching and
    casing. I don't think that's a big loss.

    What I was originally proposing would work well with the :locale I/O
    layer. This restricts it, but would work for most practical situations,
    and the original proposal could be implemented later, if desired.
  • Zefram at Jun 26, 2011 at 10:32 pm

    Karl Williamson wrote:
    People still use locales to get, e.g., the proper date format,
    for their location. But they can't currently do that if the locale is
    utf8 because regex matching and casing don't work well with those.
    Are you saying that LC_TIME et al aspects of locales don't work if the
    locale's character set can't be handled? If so, I support making LC_TIME
    et al work independently of character encoding.
    in which the parameter indicates that the user promises that they won't
    use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes,
    and in fact that strings can be assumed to be encoded as Unicode
    characters.
    In this you appear to be restating the fundamental brokenness that I
    initially perceived. If the locale prefers UTF-8 encoding, that means
    we're liable to see UTF-8-encoded text in the input. For example,
    the orange one's name, which includes an e-acute character (U+e9), will
    appear as the string "L\xc3\xa9on". Ideally we'd like the /l modifier in
    this situation to make "\xc3\xa9" match as a single alphabetic character,
    and match /\xc3\x89/ if we also apply the /i modifier. As I understand
    it, you are proposing that /l behave identically to /u, and thus that
    we treat the "L\xc3\xa9on" string as containing an A-tilde (U+c3) and
    copyright symbol (U+a9), which in character classes and case-insensitive
    matching will give a very different effect.

    The only locale charset for which we can ignore locale encoding is
    Latin-1. This is because Latin-1 encoding and decoding, as viewed
    by Perl, is the identity function (modulo encoding range errors).
    UTF-8 encoding and decoding are distinctly non-identity operations.
    What I was originally proposing would work well with the :locale I/O
    layer.
    Surely any particular behaviour for regexp character classes will be
    utterly broken by any change in the encoding regime that generates
    the strings on which it operates. /u works naturally with :locale, by
    virtue of having character strings always represented in native Unicode
    form where visible to the program. Any form of locale-encoded-string
    handling, such as /l, however, is fundamentally predicated on *not*
    decoding inputs that are expected to be locale-encoded.

    -zefram
  • Karl Williamson at Jun 27, 2011 at 4:03 am

    On 06/26/2011 04:32 PM, Zefram wrote:
    Karl Williamson wrote:
    People still use locales to get, e.g., the proper date format,
    for their location. But they can't currently do that if the locale is
    utf8 because regex matching and casing don't work well with those.
    Are you saying that LC_TIME et al aspects of locales don't work if the
    locale's character set can't be handled? If so, I support making LC_TIME
    et al work independently of character encoding.
    I am essentially saying that, and that is what this proposal was meant
    to address.
    in which the parameter indicates that the user promises that they won't
    use a non-utf8 locale, and so Perl can ignore CTYPE for locale purposes,
    and in fact that strings can be assumed to be encoded as Unicode
    characters.
    In this you appear to be restating the fundamental brokenness that I
    initially perceived. If the locale prefers UTF-8 encoding, that means
    we're liable to see UTF-8-encoded text in the input. For example,
    the orange one's name, which includes an e-acute character (U+e9), will
    appear as the string "L\xc3\xa9on". Ideally we'd like the /l modifier in
    this situation to make "\xc3\xa9" match as a single alphabetic character,
    and match /\xc3\x89/ if we also apply the /i modifier. As I understand
    it, you are proposing that /l behave identically to /u, and thus that
    we treat the "L\xc3\xa9on" string as containing an A-tilde (U+c3) and
    copyright symbol (U+a9), which in character classes and case-insensitive
    matching will give a very different effect.

    The only locale charset for which we can ignore locale encoding is
    Latin-1. This is because Latin-1 encoding and decoding, as viewed
    by Perl, is the identity function (modulo encoding range errors).
    UTF-8 encoding and decoding are distinctly non-identity operations.
    What I was originally proposing would work well with the :locale I/O
    layer.
    Surely any particular behaviour for regexp character classes will be
    utterly broken by any change in the encoding regime that generates
    the strings on which it operates. /u works naturally with :locale, by
    virtue of having character strings always represented in native Unicode
    form where visible to the program. Any form of locale-encoded-string
    handling, such as /l, however, is fundamentally predicated on *not*
    decoding inputs that are expected to be locale-encoded.
    Currently, under locale, the user is warranting that the strings are
    correctly encoded in the specified locale. If the real encoding is
    Hebrew and we are told that it is Greek, the results are almost certain
    to be wrong. My proposal had nothing to do with input/output. "use
    locale" as far as documented has no current effect on that. What I was
    saying was that under utf8 locales, which are currently documented as
    not working, the regex engine and the casing functions would assume that
    their strings were properly Unicode-encoded. It's up to the user to get
    them that way. The :locale layer would be a convenient way to do this.
       So, sure, if the string is in utf8, but the utf8 flag is not set (or
    vice versa), the results will be wrong. The proposal is not an
    end-to-end solution where suddenly "use locale" takes on more meaning
    than it currently does. It is "if are in a utf8 locale, and you've
    arranged things so that the strings are Unicode-encoded, then operations
    on them will work correctly" which is not the case currently.
  • Zefram at Jun 27, 2011 at 11:01 am

    Karl Williamson wrote:
    Currently, under locale, the user is warranting that the strings are
    correctly encoded in the specified locale. ...
    under utf8 locales, which are currently documented as
    not working, the regex engine and the casing functions would assume that
    their strings were properly Unicode-encoded.
    So you're proposing that the meaning of "use locale", with respect to the
    expected encoding of strings, be completely different for UTF-8 locales
    from what it is for other locales. I oppose this. If the programmer is
    working with strings in native Unicode form, ey should declare this with
    "use feature 'unicode_strings'" or equivalent, not with "use locale".

    -zefram
  • Karl Williamson at Jun 27, 2011 at 3:36 pm

    On 06/27/2011 05:01 AM, Zefram wrote:
    Karl Williamson wrote:
    Currently, under locale, the user is warranting that the strings are
    correctly encoded in the specified locale. ...
    under utf8 locales, which are currently documented as
    not working, the regex engine and the casing functions would assume that
    their strings were properly Unicode-encoded.
    So you're proposing that the meaning of "use locale", with respect to the
    expected encoding of strings, be completely different for UTF-8 locales
    from what it is for other locales. I oppose this. If the programmer is
    working with strings in native Unicode form, ey should declare this with
    "use feature 'unicode_strings'" or equivalent, not with "use locale".

    -zefram
    It appears to me that you've got it completely backwards. In all cases,
    the programmer is warranting that the string is correctly encoded in the
    specified locale. It's just that UTF-8 locales ARE in native Unicode
    form. The expected encoding for it is its encoding, just as the
    expected coding for any other locale is its encoding. I think you're
    throwing red herrings at this proposal; I don't know how to explain it
    more clearly.

    Right now the programmer has a choice: 1) to manipulate strings properly
    with those locales by using the unicode_strings feature; or 2) to get
    proper LC_TIME, etc. handling by using locale. The programmer cannot
    currently get both.

    To get them the ability to do both, simplest to implement is the :locale
    layer which converts all I/O so that internally things are native
    Unicode, and "use locale 'NO_CTYPE'" which divorces LC_CTYPE from the
    rest of locale handling, so that those remain, but native Unicode is
    used for string operations.
  • Zefram at Jun 27, 2011 at 4:07 pm

    Karl Williamson wrote:
    In all cases,
    the programmer is warranting that the string is correctly encoded in the
    specified locale. It's just that UTF-8 locales ARE in native Unicode
    form.
    This sounds wrong. Perl's native format for Unicode characters is
    the *decoded* form. Any encoded form (except Latin-1 where the encoding
    is identity) is not native. (I'm referring to Perl-visible encoding,
    of course, not the internal encoding.)
    I don't know how to explain it more clearly.
    Try specific examples. Taking the example I used in my previous message,
    suppose Acme inputs his name in the locale-appropriate manner, in
    an ISO-646-FR locale and in a UTF-8 locale. Perl will read the name
    from STDIN and store the input without altering it, thus yielding a
    locale-encoded string. Then consider what character semantics a /u
    regexp would assign to it.

                            ISO-646-FR UTF-8
                            ---------- -----
    octets on STDIN 4c 7b 6f 6e 4c c3 a9 6f 6e
    Perl-visible string "L{on" "L\xc3\xa9on"
    how /u interprets it open brace A-tilde, copyright

    In both cases, Perl's native representation of the name is "L\xe9on",
    which is different from the locale-encoded representation. In both cases,
    /u perceives a character or characters other than the desired e-acute.

    As I understand it, you are claiming that in the UTF-8 case /u will give
    the correct character semantics to a locale-encoded string. I claim that
    it will be just as incorrect as it would be with an ISO-646-FR locale.

    If I have misunderstood your proposal, please explain with a worked
    example.

    -zefram
  • Karl Williamson at Jun 28, 2011 at 12:09 am

    On 06/27/2011 10:07 AM, Zefram wrote:
    Karl Williamson wrote:
    In all cases,
    the programmer is warranting that the string is correctly encoded in the
    specified locale. It's just that UTF-8 locales ARE in native Unicode
    form.
    This sounds wrong. Perl's native format for Unicode characters is
    the *decoded* form. Any encoded form (except Latin-1 where the encoding
    is identity) is not native. (I'm referring to Perl-visible encoding,
    of course, not the internal encoding.)
    I don't know how to explain it more clearly.
    Try specific examples. Taking the example I used in my previous message,
    suppose Acme inputs his name in the locale-appropriate manner, in
    an ISO-646-FR locale and in a UTF-8 locale. Perl will read the name
    from STDIN and store the input without altering it, thus yielding a
    locale-encoded string. Then consider what character semantics a /u
    regexp would assign to it.

    ISO-646-FR UTF-8
    ---------- -----
    octets on STDIN 4c 7b 6f 6e 4c c3 a9 6f 6e
    Perl-visible string "L{on" "L\xc3\xa9on"
    how /u interprets it open brace A-tilde, copyright

    In both cases, Perl's native representation of the name is "L\xe9on",
    which is different from the locale-encoded representation. In both cases,
    /u perceives a character or characters other than the desired e-acute.

    As I understand it, you are claiming that in the UTF-8 case /u will give
    the correct character semantics to a locale-encoded string. I claim that
    it will be just as incorrect as it would be with an ISO-646-FR locale.

    If I have misunderstood your proposal, please explain with a worked
    example.

    -zefram
    Perhaps you are forgetting about the -C option.

    If perl is called with the -C option, it will take the appropriate
    action based on the user's locale. It distinguishes between utf8 and
    non-utf8 locales, adding a UTF8 layer automatically if called for.

    I tested your example under a utf8 locale. The string that is read in
    is in fact the 6 octets: \x4c\xc3\xa9\x6f\x6e\x0a (the test had a
    trailing \n), but since it is marked as encoded in utf8 format, it is
    interpreted correctly as the 5 characters \x4c\xe9\x6f\x6e\x0a.

    In a non-utf8 locale, the -C option should read in the octets you
    mentioned for the non-utf8 case, and we can hope that the platform's
    locale software interprets it correctly.

    My original proposal allows the "-C + use locale" combination to work
    correctly for utf8 locales. Right now it doesn't, because /l doesn't
    work correctly for them.

    The "use locale 'NO_CTYPE'" proposal allows a :locale layer to work and
    still get LC_TIME, etc.
  • Zefram at Jun 28, 2011 at 12:00 pm

    Karl Williamson wrote:
    Perhaps you are forgetting about the -C option.
    Ah, this is what I was missing. You didn't mention -C before.
    If perl is called with the -C option, it will take the appropriate
    action based on the user's locale.
    Not in the general case. The "appropriate action based on the user's
    locale" would involve the :locale layer. -C doesn't do that: the only
    layer it ever applies is the :utf8 layer. Its locale awareness consists
    of making the addition of the :utf8 layer conditional on whether the
    locale is a UTF-8 locale.

    So in this respect Perl is already treating locales inconsistently,
    specifically treating UTF-8 locales in a qualitatively different way
    from other locales. Your proposal to treat UTF-8 locales differently
    in regexps now makes some sense. Your model is that all I/O must take
    place through a :utf8-or-identity layer mediated by -CL. For non-UTF-8
    locales strings will internally be locale-encoded, but for UTF-8 locales
    strings will internally be decoded (native Unicode). In this situation it
    would be sane for regexps to operate on the locale charset for non-UTF-8
    locales but on native Unicode for UTF-8 locales.

    I have two areas of concern about this scheme. Firstly about the
    practicality of the -C option, and secondly about what happens when -C
    is not used.

    -C does both too much and too little. It does too much because, where it
    is applied to streams, its effect is global and so applies whether the
    code opening/using a stream is expecting it or not. This would be fine
    if all streams ever opened were for text, and text were always encoded
    according to the prevailing standard of the host system. But not only
    are these conditions not true, they're not even close to being true.
    Unix involves quite a lot of binary files. A default :utf8 layer is only
    OK if the whole program, including all loaded modules, is expecting it,
    either by virtue of doing only text I/O or by taking explicit action to
    squash the default layer for binary I/O.

    Meanwhile, -C does too little because it only affects streams and @ARGV.
    To maintain a consistent picture, any I/O by other routes has to apply
    a matching layer. That's not too difficult when the layer required is
    just UTF-8 encoding, but in the -CL case it needs to be UTF-8 encoding
    or nothing, conditional on the locale. Do we even have a name for the
    UTF-8-encode-or-nothing transform? It's looking rather as though a
    program using -C, and especially -CL, needs to be very aware of that
    option; it's not something you can freely turn on for a transparent
    effect.

    If -CL or a similarly favoured option were to apply a default :locale I/O
    layer, rather than :utf8-or-nothing, this would ameliorate the problem
    with non-stream routes of I/O. It would mean that the {en,de}coding that
    the program needs to apply to other kinds of I/O is consistently locale
    charset {en,de}coding. (For which there ought to be {en,de}code_locale()
    functions, and may well already be.) But it wouldn't help with the
    general problem of needing to know whether there is a default text
    transformation on I/O, nor of having to selectively disable it.

    So all in all -C, and especially -CL, seems lacking in the usability
    department. This raises the obvious question of whether we should be
    encouraging its use. The proposal for /l to conditionally behave like
    /u would constitute an encouragement to use -CL.

    And this brings us to the fundamental issue that -C is an option.
    Not only is it not the default, it's not necessary or even particularly
    useful in dealing with either UTF-8 or locales. It is normal for
    Unicode-processing, locale-aware programs to not use -C (or especially
    -CL). The correctness of the proposed /l behaviour is predicated
    entirely on the use of -CL, and so enshrining that /l behaviour would
    force the use of this complicated mechanism where it might reasonably
    be otherwise eschewed. When the maybe-locale-encoded mechanism is not
    being used, the /l behaviour would be wrong and cause subtle bugs.

    As I've previously said, I think we should encourage the use of :locale
    and /u, which would make /l irrelevant. The question about the proper
    behaviour of /l is relevant only for programs that do not do the :locale
    decode-on-input encode-on-output dance. Do these programs do the -CL
    conditional-default-{en,de}coding, or do they not {en,de}code their I/O
    at all? I don't have numbers for this, but I'm inclined to expect that
    they'd favour the traditional consistently-locale-encoded arrangement.

    -zefram
  • Karl Williamson at Dec 10, 2013 at 9:27 pm
    I haven't given up on this proposal. To refresh your memory, the
    proposal is for Perl to check if the current locale is a UTF-8 one, and
    if so, treat strings for LC_CTYPE purposes like strings normally are in
    Perl, and without looking at the actual locale data. This works because
    UTF-8 is an underlying Perl string data type. The original thread was at
    http://markmail.org/message/q4vorzd2xcxbm43y

    I reiterated this proposal in the discussion of
    https://rt.perl.org/rt3/Ticket/Display.html?id=117787

    (which this would fix) and got no responses. I have a branch which has
    it mostly implemented, but the bitrot needs to be cleaned up.

    I have a further proposal. And that is to use, on machines that have
    it, wcsxfrm(), for LC_COLLATE. Unicode publishes high-quality POSIX
    locale definitions, and this would use them to avoid the need and
    slowdown from using Unicode::Collate for many cases.

    To summarize the proposal.
    When Perl does a locale-sensitive operation within the scope of 'use
    locale' it would check if the locale is a UTF-8 one or not. If not, it
    would behave as it currently does. Under a UTF-8 locale, for LC_CTYPE
    operations, it would behave as if it weren't under 'use locale'. Thus
    the LC_CTYPE operations within a UTF-8 locale are indistinguishable from
    non-locale operations. (This means there's not much to implement, as we
    are just using existing code paths for the most part.) For LC_COLLATE
    operations under UTF-8 locales, the wide character transform would be
    used on platforms where it is available. This is slower than the
    existing but gives much better results, as currently things just don't
    work at all under these locales, as Tom Christiansen has lamented.

    To be clear, Perl has never said it supports non-8bit locales, so this
    is an enhancement. But on Linux, at least, these days most of our users
    seem to be using these unsupported locales, so it seems right that we
    should support them, especially as the implementation cost is not high.
  • Tom Christiansen at Dec 10, 2013 at 10:48 pm
    Karl Williamson wrote things
    that sound reasonable.

    I'd like to try it out, see what it costs, breaks, etc.

    thanks,

    --tom
  • Ricardo Signes at Dec 24, 2013 at 3:14 am
    * Tom Christiansen [2013-12-10T17:48:34]
    Karl Williamson <public@khwilliamson.com> wrote things
    that sound reasonable.

    I'd like to try it out, see what it costs, breaks, etc.
    Same here, but I'm hopeful.

    --
    rjbs
  • Karl Williamson at Jan 28, 2014 at 6:16 am

    On 12/23/2013 08:14 PM, Ricardo Signes wrote:
    * Tom Christiansen [2013-12-10T17:48:34]
    Karl Williamson <public@khwilliamson.com> wrote things
    that sound reasonable.

    I'd like to try it out, see what it costs, breaks, etc.
    Same here, but I'm hopeful.
    The LC_CTYPE changes are now in blead,
    31f05a37c4e9c37a7263491f2fc0237d836e1a80

    LC_COLLATE where the real performance penalty will be are probably going
    to have to wait until 5.21
  • Paul Marquess at Dec 11, 2013 at 12:26 pm
    If I'm reading the proposal correctly it covers the contents of files only
    and does not include the encoding applied to filenames on a filesystem?

    Paul
  • Karl Williamson at Dec 11, 2013 at 3:55 pm

    On 12/11/2013 05:26 AM, Paul Marquess wrote:
    If I'm reading the proposal correctly it covers the contents of files only
    and does not include the encoding applied to filenames on a filesystem?

    Paul
    Correct. I believe the filesystem issues are a much more intractable
    problem. Ideas welcome.
  • Dr.Ruud at Dec 11, 2013 at 5:45 pm

    On 2013-12-11 16:55, Karl Williamson wrote:
    On 12/11/2013 05:26 AM, Paul Marquess wrote:

    If I'm reading the proposal correctly it covers the contents of files
    only
    and does not include the encoding applied to filenames on a filesystem?
    Correct. I believe the filesystem issues are a much more intractable
    problem. Ideas welcome.
    Use a module per variant. There are also length restrictions, reserved
    characters, reserved names, etc.

    Many details:
    http://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations
    http://nedbatchelder.com/blog/201106/filenames_with_accents.html
    http://www.j3e.de/linux/convmv/man/
    http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

    --
    Ruud
  • Leon Timmermans at Dec 11, 2013 at 7:52 pm

    On Wed, Dec 11, 2013 at 4:55 PM, Karl Williamson wrote:

    Correct. I believe the filesystem issues are a much more intractable
    problem. Ideas welcome.
    Most importantly, different operating systems have different ideas on
    octets versus characters. Unix is generally completely byte based, though
    OSX does some strange non-standard normalization. Windows does UTF-16. Try
    making sense of that!

    Leon
  • Paul Marquess at Dec 11, 2013 at 11:45 pm

    From: Karl Williamson
    On 12/11/2013 05:26 AM, Paul Marquess wrote:
    If I'm reading the proposal correctly it covers the contents of files
    only and does not include the encoding applied to filenames on a
    filesystem?
    Paul
    Correct. I believe the filesystem issues are a much more intractable
    problem. Ideas welcome.

    Yep, unfortunately that's my understanding as well.

    Makes it very difficult to automagically do the right thing when you need to
    know if the filesystem uses utf8 (or any other encoding for that matter).

    Sorry for sidetracking the thread.

    Paul

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedJun 22, '11 at 9:28p
activeJan 28, '14 at 6:16a
posts24
users9
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase