FAQ

Locales: An Analysis

Chip Salzenberg
Feb 4, 2000 at 4:35 am
OK, I think I get 'use byte' vs. 'use utf8' vs. neither. But, there
are still locales. To illustrate the issues, let me get concrete here
for a minute. (Please, no construction jokes.)

Consider the apparently straightforward:
@a = sort @b;

AFAIK, in the absence of pragmas, Perl 5.6 will compare the elements
of @b character-by-character, regardless of whether the characters in
question have one- or multi-byte representations. This behavior
should be well-defined even if some elements of @b are Unicode and
some aren't. (I'm glad that this is the default.)

Now, from the sublime to the ridiculous:

C locales define various characters characteristics. But they are so
limited as to be very difficult to use. More to the point, they are
defined in terms of the C type 'char', so there can be no support for
multi-byte character encodings unless your 'char' type is larger than
one byte. Furthermore, C makes no provision for cross-locale
processing -- you can't meaningfully collate e.g. English and Hebrew.

Therefore, if C<sort> is told to use locales, must consider all
strings to contain characters from the current (run-time) locale.

Now, comes the key bit: It's entirely possible for characters, even
characters considered to be from the current locale, to be encoded in
a number of ways. In other words: STRING ENCODING AND CHARACTER SET
ARE ORTHOGONAL. So I propose that OPs' string encoding states and
character set states be orthogonal in a user-visible way.

(Granted, C locales are limited to one-byte characters, so anything
with ord() > 255 has no place in a locale-charset string. But Topaz
uses C++, and C++ locales apply not only to 'char's, but also to
'wchar_t's -- wide characters. AFAIK, there is no technical obstacle
to a C++ implementation's providing Unicode-compatible locales.
Besides, some intrepid user may want to use a non-Unicode character
set while obeying each string's current encoding.)

So: The _string_encoding_ state of each OP must be one of these:

0. the default -- follow each string's current encoding
1. "use byte" -- all strings are one-byte
2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)

And the _character_set_ state of each OP must one of these:

0. the default -- characters are Latin-1, UTF-8 is Unicode
1. "use locale" -- characters are $ENV{LANG} (set at runtime)

If you just want to stop here, then please consider the above as a
proposed specification for the interaction of UTF-8 and locales.

{{ NEW FEATURE ALERT }}

Seeing the above list of pragmas triggers my generalization reflex.
So, how about this:

0. C<no encoding> == the default
1. C<use encoding 'utf8'> == C<use utf8>
2. C<use encoding 'byte'> == C<use byte>

Combined with this:

0. C<no charset> == the default
1. C<use charset 'locale'> == C<use locale>

This interface would also provide a hook for any encodings we might
support in future:
use encoding 'byte2big'; == force two-byte big-endian characters,
without forcing their charset
or:
use encoding 'byte4big'; == force four-byte big-endian characters
use charset 'iso10646'; == force ISO 10646 (Unicode superset)

So, what do you think?
--
Chip Salzenberg - a.k.a. - <chip@valinux.com>
"He's Mr. Big of 'Big And Tall' fame." // MST3K
reply

Search Discussions

29 responses

  • Ilya Zakharevich at Feb 4, 2000 at 6:36 am

    On Thu, Feb 03, 2000 at 08:35:04PM -0800, Chip Salzenberg wrote:
    So: The _string_encoding_ state of each OP must be one of these:

    0. the default -- follow each string's current encoding
    1. "use byte" -- all strings are one-byte
    2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)

    And the _character_set_ state of each OP must one of these:

    0. the default -- characters are Latin-1, UTF-8 is Unicode
    1. "use locale" -- characters are $ENV{LANG} (set at runtime)
    Too complicated, very few advantages (if any).

    It was discussed already (and many times). Results as I remember
    them: locale-ness should be addressed during i/o operations (here i/o
    is understood in wide sense). `use locale' (or better, its
    equivalent) should better forget about C locales. This is better be
    just a hint for (default?) i/o conversions to Unicode.

    Now: please look around. What nice properties of your favorite locale
    will be lost by interpreting it as a hint to Unicode conversion?

    Ilya
  • Chip Salzenberg at Feb 4, 2000 at 8:12 am

    According to Ilya Zakharevich:
    On Thu, Feb 03, 2000 at 08:35:04PM -0800, Chip Salzenberg wrote:
    So: The _string_encoding_ state of each OP must be one of these:
    0. the default -- follow each string's current encoding
    1. "use byte" -- all strings are one-byte
    2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)

    And the _character_set_ state of each OP must one of these:
    0. the default -- characters are Latin-1, UTF-8 is Unicode
    1. "use locale" -- characters are $ENV{LANG} (set at runtime)
    Too complicated
    For users, or for us?
    very few advantages (if any).
    Granted there are not likely to be many users for anything but utf8
    and byte. Further arguments get into language design, which I've
    found not to be my strongest skill....
    locale-ness should be addressed during i/o operations (here i/o is
    understood in wide sense).
    Agreed that this sort of conversion will be vital; it's where
    Unicode::Map8 (or its equivalent) comes into play. But there
    are other issues....
    `use locale' (or better, its equivalent) should better forget about
    C locales. [...] Now: please look around. What nice properties of
    your favorite locale will be lost by interpreting it as a hint to
    Unicode conversion?
    We have to at least provide effective access to C locale features like
    collation. Surely _some_ people are using C locales for e.g. sorting
    words in Russian dictionary order. And of course those people won't
    like having a language feature pulled out from under them.
    --
    Chip Salzenberg - a.k.a. - <chip@valinux.com>
    "He's Mr. Big of 'Big And Tall' fame." // MST3K
  • Jarkko Hietaniemi at Feb 4, 2000 at 8:23 am

    `use locale' (or better, its equivalent) should better forget about
    C locales. [...] Now: please look around. What nice properties of
    your favorite locale will be lost by interpreting it as a hint to
    Unicode conversion?
    >
    We have to at least provide effective access to C locale features like
    collation. Surely _some_ people are using C locales for e.g. sorting
    And numeric formatting, so that "3,1415927" is an approximation of the pi.

    And that %[aAbB] of POSIX::strftime continue to say ma, måndag, Dez,
    décembre as appropriate. Or is that last one décembre...?
    words in Russian dictionary order. And of course those people won't
    And, yes, collation, too. Us Finns would get really mad if
    'ä' were sorted after 'a' and before 'b', for example...

    The locales are a major pain (I, if anyone, know that) but
    unfortunately they solve some problems adequately well.

    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API: what I mean by this is that Perl
    distribution itself would contain the necessary locale definitions
    (collation, numeric and date formatting, character classes (from
    Unicode), et cetera), and the perl core would then use those
    definitions.
    like having a language feature pulled out from under them.
    --
    Chip Salzenberg - a.k.a. - <chip@valinux.com>
    "He's Mr. Big of 'Big And Tall' fame." // MST3K
    --
    $jhi++; # http://www.iki.fi/jhi/
    # There is this special biologist word we use for 'stable'.
    # It is 'dead'. -- Jack Cohen
  • Chip Salzenberg at Feb 4, 2000 at 8:34 am

    According to Jarkko Hietaniemi:
    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API ...
    I'll do you one better:

    We create Locales The Way They Should Have Been as a library with
    complete bindings for at least C, C++, Perl, Java, and (your favorite
    language here). I'm sure that some existing project has something we
    could build. Perl can be the primary guinea pig.

    Then comes the good part: We release the library. It becomes a de
    facto standard throughout the software world. We thus eradicate the
    horror of ANSI locales from the collective memory of the net.

    "To dream the impossible dream...."

    PS: It's been done once before, with time zones.
    --
    Chip Salzenberg - a.k.a. - <chip@valinux.com>
    "He's Mr. Big of 'Big And Tall' fame." // MST3K
  • Jarkko Hietaniemi at Feb 4, 2000 at 8:40 am

    Chip Salzenberg writes:
    According to Jarkko Hietaniemi:
    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API ...
    >
    I'll do you one better:
    I like being bested like this.
    We create Locales The Way They Should Have Been as a library with
    complete bindings for at least C, C++, Perl, Java, and (your favorite
    language here). I'm sure that some existing project has something we
    ..., Python, Tcl, at least.
    could build. Perl can be the primary guinea pig. >
    Then comes the good part: We release the library. It becomes a de
    facto standard throughout the software world. We thus eradicate the
    horror of ANSI locales from the collective memory of the net. >
    "To dream the impossible dream...."
    I'm all for this, if somebody just gives me 30 extra hours for each day
    to work on it.
    PS: It's been done once before, with time zones.
    --
    $jhi++; # http://www.iki.fi/jhi/
    # There is this special biologist word we use for 'stable'.
    # It is 'dead'. -- Jack Cohen
  • Ilya Zakharevich at Feb 4, 2000 at 5:49 pm

    On Fri, Feb 04, 2000 at 10:22:58AM +0200, Jarkko Hietaniemi wrote:
    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API: what I mean by this is that Perl
    distribution itself would contain the necessary locale definitions
    (collation, numeric and date formatting, character classes (from
    Unicode), et cetera), and the perl core would then use those
    definitions.
    All we need is mapping byte => unicode. All the rest can be deduced
    by some C tests, right?

    Ilya
  • Jarkko Hietaniemi at Feb 4, 2000 at 5:54 pm

    Ilya Zakharevich writes:
    On Fri, Feb 04, 2000 at 10:22:58AM +0200, Jarkko Hietaniemi wrote:
    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API: what I mean by this is that Perl
    distribution itself would contain the necessary locale definitions
    (collation, numeric and date formatting, character classes (from
    Unicode), et cetera), and the perl core would then use those
    definitions.
    >
    All we need is mapping byte => unicode. All the rest can be deduced
    by some C tests, right?
    The whole Perl 5 is "deduced by some C tests"... :-) Please be more
    specific in your proposal(?).
    Ilya
    --
    $jhi++; # http://www.iki.fi/jhi/
    # There is this special biologist word we use for 'stable'.
    # It is 'dead'. -- Jack Cohen
  • Ilya Zakharevich at Feb 4, 2000 at 6:18 pm

    On Fri, Feb 04, 2000 at 07:53:46PM +0200, Jarkko Hietaniemi wrote:
    The only clean way I see out of the locale mess is to replace them
    with a clean *Perl-specific* API: what I mean by this is that Perl
    distribution itself would contain the necessary locale definitions
    (collation, numeric and date formatting, character classes (from
    Unicode), et cetera), and the perl core would then use those
    definitions.
    All we need is mapping byte => unicode. All the rest can be deduced
    by some C tests, right?
    The whole Perl 5 is "deduced by some C tests"... :-) Please be more
    specific in your proposal(?).
    The stuff to-do-by-hand is the tables I mentioned. After you know
    them, you can convert any locale-specific rules to Unicode.

    Say, to collate 'foo1bar' where f o b a r are locale-representable
    Unicode chars, and 1 is not: convert foo to locale, collate, same with
    bar. Keep '1' AS IS, sorting above any locale-accessible stuff.

    Is it more clear now?

    Ilya
  • Ilya Zakharevich at Feb 4, 2000 at 5:48 pm

    On Fri, Feb 04, 2000 at 12:12:25AM -0800, Chip Salzenberg wrote:
    So: The _string_encoding_ state of each OP must be one of these:
    0. the default -- follow each string's current encoding
    1. "use byte" -- all strings are one-byte
    2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)

    And the _character_set_ state of each OP must one of these:
    0. the default -- characters are Latin-1, UTF-8 is Unicode
    1. "use locale" -- characters are $ENV{LANG} (set at runtime)
    Too complicated
    For users, or for us?
    This is an interesting question indeed. For us: definitely. But for
    users? Maybe too... Anyone having ideas?

    Ilya
  • Larry Wall at Feb 4, 2000 at 6:33 pm
    Ilya Zakharevich writes:
    : On Fri, Feb 04, 2000 at 12:12:25AM -0800, Chip Salzenberg wrote:
    : > > > So: The _string_encoding_ state of each OP must be one of these:
    : > > > 0. the default -- follow each string's current encoding
    : > > > 1. "use byte" -- all strings are one-byte
    : > > > 2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)
    : > > >
    : > > > And the _character_set_ state of each OP must one of these:
    : > > > 0. the default -- characters are Latin-1, UTF-8 is Unicode
    : > > > 1. "use locale" -- characters are $ENV{LANG} (set at runtime)
    : > >
    : > > Too complicated
    : >
    : > For users, or for us?
    :
    : This is an interesting question indeed. For us: definitely. But for
    : users? Maybe too... Anyone having ideas?

    Perl made the non-mistake of waiting to get invented until the IEEE
    standardized floating point. I think it should also make the non-mistake
    of using the emerging UTF-8 bandwagon as the lens through which to
    view legacy character sets. I believe this is a place where Perl should
    strive to make things as simple as possible for the programmer, but no
    simpler. If we can move the complexity to the interfaces rather than
    to the opcodes, Perl stays conceptually cleaner. Then as the interfaces
    of the world converge on UTF-8, Perl's interfaces become cleaner.

    That being said, as Tim pointed out, a large portion of the world is
    slightly discriminated against by UTF-8, and may settle on UTF-16 for
    their own internal communications, even if international traffic tends
    to settle on UTF-8. If that comes to pass, we can make a version of
    Perl for those countries that swaps out the UTF-8 implementation for
    something else, and as long as the interfaces are well specified, we
    can do it transparently, more or less.

    But I think we should be prepared for a long period in which the
    internal communications are done in the legacy character set, and
    international in UTF-8. There's not much psychological reason for
    people to switch from one 16-bit character set to another, actually.

    Larry
  • Larry Wall at Feb 4, 2000 at 7:28 am
    Chip Salzenberg writes:
    : So: The _string_encoding_ state of each OP must be one of these:
    :
    : 0. the default -- follow each string's current encoding
    : 1. "use byte" -- all strings are one-byte
    : 2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)

    There is no 2.

    : And the _character_set_ state of each OP must one of these:
    :
    : 0. the default -- characters are Latin-1, UTF-8 is Unicode
    : 1. "use locale" -- characters are $ENV{LANG} (set at runtime)

    I would actually like to avoid locales if at all possible. They are
    not the right approach to sorting or much of anything else. I recommend
    the Unicode Consortium reports for a thorough discussion of what's needed
    at the higher levels of abstraction.

    : {{ NEW FEATURE ALERT }}
    :
    : Seeing the above list of pragmas triggers my generalization reflex.
    : So, how about this:
    :
    : 0. C<no encoding> == the default
    : 1. C<use encoding 'utf8'> == C<use utf8>

    Again, that doesn't seem to do what you think anymore.

    : 2. C<use encoding 'byte'> == C<use byte>
    :
    : Combined with this:
    :
    : 0. C<no charset> == the default
    : 1. C<use charset 'locale'> == C<use locale>

    "Use locale!?! Slowly I turned...step by step...inch by inch..."

    : This interface would also provide a hook for any encodings we might
    : support in future:
    : use encoding 'byte2big'; == force two-byte big-endian characters,
    : without forcing their charset
    : or:
    : use encoding 'byte4big'; == force four-byte big-endian characters
    : use charset 'iso10646'; == force ISO 10646 (Unicode superset)

    Not really a superset anymore, unless you're into defining your own
    characters outside of U+10FFFF.

    : So, what do you think?

    Nothing we are doing precludes doing that eventually, should we happen
    to find it interesting some day, which does not seem to be today.
    UTF-8 is taking over the world, and is quite capable of representing
    all of ISO-10646 already. It's also already capable of representing
    other character sets with not much tweaking.

    As for other encodings, I'm just not terribly interested in rewriting
    all the opcodes to support them all simultaneously. That's what
    Unicode is supposed to be getting us away from, after all.

    Earlier I indicated that mostly only Asians will be interesed in "OEM"
    character sets, but I have to back up a bit and admit that I could be
    wrong about that. It's possible we might apply the same legacy character
    set processing to handling I/O channels that want a "legacy" of UTF-16
    or UCS-4.

    I think if we ever do support fixed-width wide characters in Perl
    internally, we might just jump straight to 32 bits. I'd love to forget
    all that characters-fit-in-16-bits-except-when-they-don't crapola. Not
    to mention the fact that some character codes can't be encoded in it
    because they're reserved for half a surrogate character, so it's
    entirely Unicode specific.

    The more I see of UTF-16 the better I dislike it. BOMs away...

    [Tell us what you really think, Larry.]

    At any rate, in the unlikely event that we do ever go with fixed-width
    characters internally, I suspect we'll try to make it as transparent as
    possible, like we're doing now with UTF-8. But I really think the
    interfaces are where the battle will be fought, and I think UTF-8 will
    prevail there in the long run, despite the early example of Java.
    Linux is going to be all UTF-8, and between Linux and Java I think Java
    will find itself being wagged.

    Larry
  • Chip Salzenberg at Feb 4, 2000 at 8:25 am

    According to Larry Wall:
    Chip Salzenberg writes:
    : So: The _string_encoding_ state of each OP must be one of these:
    : 0. the default -- follow each string's current encoding
    : 1. "use byte" -- all strings are one-byte
    : 2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!)
    There is no 2.
    <reads perllocale.pod>
    Oh.
    I have shamed p5p by not R'ing TFM.
    For penance, I shall re-implement Perl in C++.
    I would actually like to avoid locales if at all possible.
    I'm allowed to drop only deprecated features. Should C<use locale> be
    documented as being, eventually, doomed? Or shall we just mention it
    in the release notes? }:-)
    : use charset 'iso10646'; == force ISO 10646 (Unicode superset)
    Not really a superset anymore, unless you're into defining your own
    characters outside of U+10FFFF.
    I don't understand... Could someone point me to a description of the
    current Unicode <-> ISO 10646 relationship?
    I think if we ever do support fixed-width wide characters in Perl
    internally, we might just jump straight to 32 bits.
    "SUBSCRIBE"
    --
    Chip Salzenberg - a.k.a. - <chip@valinux.com>
    "He's Mr. Big of 'Big And Tall' fame." // MST3K
  • Larry Wall at Feb 4, 2000 at 5:08 pm
    Chip Salzenberg writes:
    : > Not really a superset anymore, unless you're into defining your own
    : > characters outside of U+10FFFF.
    :
    : I don't understand... Could someone point me to a description of the
    : current Unicode <-> ISO 10646 relationship?

    Well, http://www.unicode.org/unicode/standard/principles.html says:

    The Unicode Standard is the universal character encoding standard used
    for representation of text for computer processing. It is fully
    compatible with the International Standard ISO/IEC 10646-1; 1993, and
    contains all the same characters and encoding points as ISO/IEC 10646.
    The Unicode Standard also provides additional information about the
    characters and their use. Any implementation that is conformant to
    Unicode is also conformant to ISO/IEC 10646.

    Of course, that's the Unicode Consortium spin. Offhand, I don't know
    whether the ISO folks have a different spin.

    Larry
  • Russ Allbery at Feb 4, 2000 at 1:06 pm

    Larry Wall writes:

    At any rate, in the unlikely event that we do ever go with fixed-width
    characters internally, I suspect we'll try to make it as transparent as
    possible, like we're doing now with UTF-8. But I really think the
    interfaces are where the battle will be fought, and I think UTF-8 will
    prevail there in the long run, despite the early example of Java. Linux
    is going to be all UTF-8, and between Linux and Java I think Java will
    find itself being wagged.
    FWIW, from the standards front, the next revision of the news standards
    will almost certainly be standardizing on UTF-8 as the character set for
    headers (headers being particularly tricky since while you can use MIME to
    specify a character set for the body, doing the same thing for the headers
    requires relying on header order, which isn't a good idea), mostly based
    on the expectation that the next revision of mail will do the same thing.

    If we're right about where mail is going, and we haven't had objections
    from the IETF at least so far, that's a pretty big chunk of text-based
    Internet protocol that's going to UTF-8.

    --
    Russ Allbery (rra@stanford.edu) <URL:http://www.eyrie.org/~eagle/>
  • Larry Wall at Feb 4, 2000 at 5:23 pm
    Russ Allbery writes:
    : FWIW, from the standards front, the next revision of the news standards
    : will almost certainly be standardizing on UTF-8 as the character set for
    : headers (headers being particularly tricky since while you can use MIME to
    : specify a character set for the body, doing the same thing for the headers
    : requires relying on header order, which isn't a good idea), mostly based
    : on the expectation that the next revision of mail will do the same thing.

    Just glancing through my spam mailbox, I'd say that the vast majority of
    international messages either misuse or fail to use MIME.

    : If we're right about where mail is going, and we haven't had objections
    : from the IETF at least so far, that's a pretty big chunk of text-based
    : Internet protocol that's going to UTF-8.

    Well, I hope they enforce it. We're starting to get all sorts of
    gobbledygook in the subjects of mail messages. I'd love it if mailers
    rejected messages whose headers contain illegal UTF-8 sequences.

    Larry
  • Tom Christiansen at Feb 4, 2000 at 7:16 pm

    Well, I hope they enforce it. We're starting to get all sorts of
    gobbledygook in the subjects of mail messages. I'd love it if mailers
    rejected messages whose headers contain illegal UTF-8 sequences.
    That's not too hard to do. :-)

    --tom
  • Larry Wall at Feb 4, 2000 at 7:23 pm
    Tom Christiansen writes:
    : >Well, I hope they enforce it. We're starting to get all sorts of
    : >gobbledygook in the subjects of mail messages. I'd love it if mailers
    : >rejected messages whose headers contain illegal UTF-8 sequences.
    :
    : That's not too hard to do. :-)

    Technologically, yes. But culturally we have to get buy in from
    everyone who currently sends Latin-1 in headers.

    At the moment I am personally running a compromise solution. If a
    subject has more than 50% high-bit characters in the subject, it goes
    straight into my spam mailbox without trying any of the other heuristics.

    Larry
  • Tom Christiansen at Feb 4, 2000 at 7:26 pm

    At the moment I am personally running a compromise solution. If a
    subject has more than 50% high-bit characters in the subject, it goes
    straight into my spam mailbox without trying any of the other heuristics.
    That's probably a good idea. I'll add it to my looks_like_spam
    incoming mailer function. But in practice, these tend to get
    +spamboxed anyway out of other heuristics.

    --tom
  • Ilya Zakharevich at Feb 4, 2000 at 9:38 pm

    On Fri, Feb 04, 2000 at 11:21:36AM -0800, Larry Wall wrote:
    Tom Christiansen writes:
    : >Well, I hope they enforce it. We're starting to get all sorts of
    : >gobbledygook in the subjects of mail messages. I'd love it if mailers
    : >rejected messages whose headers contain illegal UTF-8 sequences.
    :
    : That's not too hard to do. :-)

    Technologically, yes. But culturally we have to get buy in from
    everyone who currently sends Latin-1 in headers.
    Or KOI-8. Or win-125..

    Ilya
  • Johan Vromans at Feb 5, 2000 at 10:58 am

    Larry Wall writes:

    If a subject has more than 50% high-bit characters in the subject,
    it goes straight into my spam mailbox without trying any of the
    other heuristics.
    I use 'more than 5 high-bit characters in a row'. If so, the message
    is immediately dumped in the bit bucket. It doesn't even get a chance
    to end up in the spam box. Using this criterium I hardly ever see
    chinese (et al.) mail anymore. And I have yet to find a 'serious'
    message in my bit bucket ;-)

    -- Johan
  • Larry Wall at Feb 5, 2000 at 4:19 pm
    Johan Vromans writes:
    : Larry Wall <larry@wall.org> writes:
    :
    : > If a subject has more than 50% high-bit characters in the subject,
    : > it goes straight into my spam mailbox without trying any of the
    : > other heuristics.
    :
    : I use 'more than 5 high-bit characters in a row'.

    That might work better than 50%, thanks.

    : If so, the message
    : is immediately dumped in the bit bucket. It doesn't even get a chance
    : to end up in the spam box. Using this criterium I hardly ever see
    : chinese (et al.) mail anymore.

    Well, I oversimplified what I do. I actually peruse the subjects in my
    spam mailbox daily because occasionally my friends send me things that
    get classified as spam. All it takes is for the message to contain too
    many words in all caps, like QUICK, and GUARANTEE, and DBI.

    Or just too many mentions of unpleasant subjects like sex and money. :-)

    But I'll also throw a message all the way into the bit bucket if it
    looks like gobbledygook (ignoring headers, MIME, HTML, and such).

    if (not $suppress) { # Find Chinese spam.
    my $tmp = $body;
    $tmp =~ s/=([0-9A-F]{2})/chr hex $1/eg;
    $tmp =~ s/^Content.*\n//g;
    $tmp =~ s/^This is a MIME.*\n*//g;
    $tmp =~ s/----.*\n*//g;
    $tmp =~ s/====.*\n*//g;
    $tmp =~ s/\*\*\*\*.*\n*//g;
    $tmp =~ s/\s+/ /g;
    $tmp =~ s/<script>.*?<\/script>\s*//sg;
    $tmp =~ s/<[^>]*>\s*//g;
    $tmp =~ s/&\w+;\s*//g;
    $tmp =~ s{http://[-.\w/]*\s*}{};
    $engbytes = $tmp =~ tr/ ,.;:'"()a-zA-Z0-9//;
    $allbytes = length($tmp);
    $suppress = int(100 * $engbytes / $allbytes) . "%"
    if $engbytes < $allbytes / 2;
    }

    : And I have yet to find a 'serious' message in my bit bucket ;-)

    Maybe you haven't found one, but I have. I'm paranoid enough that I
    can even read my bit bucket (though I seldom do).

    Actually, that's not quite true--what I can always read is the original
    mailbox. I normally read my mail with trn, and the code above just
    prevents the transfer of the message from my normal mailbox to the news
    system. But I keep a copy of every thing that comes in. So if you want
    something backed up forever, just mail it to me. :-)

    Larry
  • Tim Bray at Feb 4, 2000 at 5:27 pm
    [some horribly-unstructured random data points]
    At 12:25 AM 2/4/00 -0800, Chip Salzenberg wrote:
    : use charset 'iso10646'; == force ISO 10646 (Unicode superset)
    Not really a superset anymore, unless you're into defining your own
    characters outside of U+10FFFF.
    I don't understand... Could someone point me to a description of the
    current Unicode <-> ISO 10646 relationship?
    It appears in one of the appendices of the [excellent, go buy it from
    unicode.org] Unicode spec. Essentially, they are the same spec, but this
    is achieved by an elaborate parallel structure of committees and
    working groups who always magically and independently do the same thing;
    of course many of the people serve in both processes.

    There is one conceptual difference; 10646 says in theory you can have
    2^31 characters. Unicode only recognizes 2^16 + 2^20 (BMP + 16 expansion
    planes). I wonder if in the year 2345, they'll be cursing the short-sighted
    21st-century Unicode morons whose 17 planes didn't leave room for the
    dialects of the Lesser Magellanic cloud worlds. Well, do like Larry says
    and use 4 bytes and that should get us through most of the millenium.

    I feel that one of the nice things about using perl is that you shouldn't
    have to worry about things like UTF-16's [rather reasonable I think]
    extension mechanism, or about the hideous bit-packing bogosities of UTF-8,
    which are only defensible in a world whose basic technical infrastructure
    depends heavily on strlen() and strcpy(), but that's the world we happen to
    live in. Note that UTF-8 is kinda bigoted in that us pink-skinned roundeyes
    get to store most of our characters in one byte per, leaving it to the other
    75% of the world's population to pay the price, in extra bytes, for
    generality.

    BTW, should ord($c) return different values depending on whether or not
    I've said "use utf8;"?

    It should be noted that over in Java-land, UTF-16 is more or less the
    native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh.

    Also, there are lots of non-Asian non-Unicode non-8859 character sets;
    probably the best known is KOIsomething for Cyrillic.

    Over in XML-land, including the increasingly popular XML::Parser
    module, the data can come in in a variety of flavors, but the
    programmer only sees Unicode, and if you write it out you find you've
    written UTF-8. Having all programmers see only Unicode all the time is
    a big enough win, even though when programmers first see it you
    tend to get some whining about Stupid MS Code Page Tricks not working.
    On the other hand, the data magically transmogrifying itself from
    JIS or EBCDIC or whatever to UTF-8 as a result of a trip through perl
    is kinda off-putting... pardon the digression.

    Using Unicode takes care of some but by no means all of the collation-
    sequence problems that locale used to help with. Hmm

    There are some general-purpose international character-munging libraries
    out there, the best known being GNU iconv and ICU from IBM alphaworks.
    ICU is C++. They each depend on meg after meg of tables so you can
    wire in support for various legacy encodings. If you could skip the
    legacy-encoding stuff and just do collating and other locale-ish stuff,
    you could do it relatively compactly. What do Locales Done The Right
    Way need to do? -T.
  • Larry Wall at Feb 4, 2000 at 5:53 pm
    Tim Bray writes:
    : BTW, should ord($c) return different values depending on whether or not
    : I've said "use utf8;"?

    The short answer is no.

    The medium answer is that you'll have to say "use byte" if you want ord($c)
    to return the first byte rather than the first character.

    The long answer is that we're phasing out the experimental "use utf8"
    declaration. We might possibly keep it to cause interfaces within a
    lexical scope to default to utf8, but it will no longer force opcodes
    to behave monomorphically. All the builtin operators will be
    polymorphic and know how to deal with mixed 8-bit and utf8
    representations of Unicode. (The choice will be data driven, just as
    conversions between numeric and string are currently data driven.) The
    only monomorphic declaration will be "use byte", which is useful to get
    back to the old 8-bit semantics in a particular lexical scope.

    Larry
  • Gurusamy Sarathy at Feb 4, 2000 at 6:33 pm

    On Fri, 04 Feb 2000 09:52:20 PST, Larry Wall wrote:
    The long answer is that we're phasing out the experimental "use utf8"
    declaration.
    The status as of 640 is that only two things are affected by
    C<use utf8>: interpretation of literals/identifiers in the source text;
    and how REs are compiled. Both should go away.

    Having it affect the interpretation of identifiers is a bit bogus,
    since high-bit chars have never been allowed in them before, so
    we could just always interpret them as utf8.

    Treating literals as utf8 is a bit of a compatibility issue, but
    I think we should get around that by treating the lex input stream
    as any other discipline. IOW, default PL_rsfp to byte mode,
    and let users push a utf8/utf16/whatever discipline on it if they
    wanna. (This would apply to identifiers as well.)

    Converting the RE code to compile down to polymorphic ops still needs a
    bit of work, by my reckoning. Ilya, you hearing me? :-)


    Sarathy
    gsar@ActiveState.com
  • Larry Wall at Feb 4, 2000 at 6:58 pm
    Gurusamy Sarathy writes:
    : Treating literals as utf8 is a bit of a compatibility issue, but
    : I think we should get around that by treating the lex input stream
    : as any other discipline. IOW, default PL_rsfp to byte mode,
    : and let users push a utf8/utf16/whatever discipline on it if they
    : wanna. (This would apply to identifiers as well.)

    They can always push a discipline explcitly, but I think we should just
    auto-recognize it by default. Start with a generic discipline that
    doesn't commit, then when you see a high bit, look and see if you've
    got illegal utf8. If you do, it's not utf8. If you don't, it is
    99.99% certain (in Latin-1 countries) to be utf8, and you can look ahead
    some more if you want to be more certain. In non-Latin-1 countries
    you probably have to push an explicit discipline anyway to tell it which
    of many character sets you're using if you're not using utf8, so it
    should still default to utf8 if it sees legal utf8.

    If you'd like to think of it in stronger terms, the script is in utf8
    until proven otherwise. Just start with the utf8 discipline by default
    and make it smart enough to recover from errors by switching to an
    8-bit/binary discipline if we haven't already committed too much to utf8.

    This will also be a useful discipline for files of unknown provenance,
    I think, but we probably wouldn't make it the default discipline for
    ordinary files. It might possibly be the default utf8 discipline,
    though there might be some call for a utf8_darnit discipline that would
    puke on non-utf8 rather than trying to switch.

    Larry
  • Ilya Zakharevich at Feb 4, 2000 at 9:37 pm

    On Fri, Feb 04, 2000 at 10:33:53AM -0800, Gurusamy Sarathy wrote:

    Converting the RE code to compile down to polymorphic ops still needs a
    bit of work, by my reckoning. Ilya, you hearing me? :-)
    Only for some value of "hearing"...

    Ilya
  • Bart Schuller at Feb 5, 2000 at 12:37 pm

    On Fri, Feb 04, 2000 at 09:21:04AM -0800, Tim Bray wrote:
    It should be noted that over in Java-land, UTF-16 is more or less the
    native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh.
    I was just reading up on the Java Native Interface and there they don't
    talk about UTF-16, but only about their own version of UTF-8 (sigh).
    So it's not as bad as it seems when interfacing with C.

    http://java.sun.com/products/jdk/1.2/docs/guide/jni/spec/types.doc.html#16542

    --
    The idea is that the first face shown to people is one they can readily
    accept - a more traditional logo. The lunacy element is only revealed
    subsequently, via the LunaDude. [excerpted from the Lunatech Identity Manual]
  • Larry Wall at Feb 5, 2000 at 4:44 pm
    Bart Schuller writes:
    : On Fri, Feb 04, 2000 at 09:21:04AM -0800, Tim Bray wrote:
    : > It should be noted that over in Java-land, UTF-16 is more or less the
    : > native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh.
    :
    : I was just reading up on the Java Native Interface and there they don't
    : talk about UTF-16, but only about their own version of UTF-8 (sigh).
    : So it's not as bad as it seems when interfacing with C.
    :
    : http://java.sun.com/products/jdk/1.2/docs/guide/jni/spec/types.doc.html#16542

    Indeed, JPL (Java Perl Lingo) uses JNI for communicating with Java.
    Everything just naturally comes out in UTF-8. More or less.

    So to spin it towards UTF-8ism, we can see that even Java has hedged
    its bets on whether UTF-8 will be the preferred interface.

    (Actually, you can get at the UTF-16 through JNI as well, but of course
    it looks like an array to C. (Not that the UTF-8 doesn't :-))

    Larry
  • Ilya Zakharevich at Feb 5, 2000 at 8:07 pm

    Bart Schuller writes:
    On Fri, Feb 04, 2000 at 09:21:04AM -0800, Tim Bray wrote:
    It should be noted that over in Java-land, UTF-16 is more or less the
    native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh.
    I was just reading up on the Java Native Interface and there they don't
    talk about UTF-16, but only about their own version of UTF-8 (sigh).
    But this format looks compatible with Perl's input (which does not
    check that utf8 data has "minimal possible" encoding).

    On output one needs some equivalent of

    { use byte; s/\0/\x80\xc0/g; }


    Ilya