FAQ

On Thursday, Jan 9, 2003, at 06:37 Asia/Tokyo, Jarkko Hietaniemi wrote:
Your analysis seems correct. But is it necessary to set ${^ENCODING}
to *anything* if we are in a Unicode encoding? With the below patch

--- perl/ext/Encode/encoding.pm.~1~ Wed Jan 8 23:34:35 2003
+++ perl/ext/Encode/encoding.pm Wed Jan 8 23:34:35 2003
@@ -28,8 +28,9 @@
require Carp;
Carp::croak("Unknown encoding '$name'");
}
- unless ($arg{Filter}){
- ${^ENCODING} = $enc; # this is all you need, actually.
+ unless ($arg{Filter}) {
+ ${^ENCODING} = $enc # this is all you need, actually.
+ unless $name =~ /^(?:utf-?(?:8|16|32)|ucs-?(?:2|4))(?:[bl]e)?$/i;
$HAS_PERLIO or return 1;
for my $h (qw(STDIN STDOUT)){
if ($arg{$h}){

this works okay:

$ env LC_ALL=en_US.UTF-8 ./perl -Ilib -e 'print chr(128)' | env ./perl
-Ilib -Mencoding=utf8 -le '$a=<STDIN>;printf "%x\n", ord($a)'
80
$

(Without the patch we get the U+FFFD in UTF-8, as shown before.)
Thanks, applied now. Now I have enough improvements/fixes in my
repository to $Encode::VERSION++. New version will be released soon.

Dan the Encode Maintainer

Search Discussions

  • Dan Kogai at Jan 10, 2003 at 12:09 pm
    Porters,

    This is the first update to Encode in 2003. Available as:

    http://www.dan.co.jp/~dankogai/Encode-1.84.tar.gz and CPAN

    And here are Changes

    $Revision: 1.84 $ $Date: 2003/01/10 12:00:16 $
    ! encoding.pm
    ${^ENCODING} is no longer set for utf so encoding is no longer fun :)
    (That is to prevent duplicate encoding first by IO then ${^ENCODING})
    Message-Id: <20030108213737.GK331043@lyta.hut.fi>
    ! Unicode/Unicode.xs
    %_ fixes saves the resulting .so .05% smaller, by NC
    Message-Id: <20021226225709.GF284@Bagpuss.unfortu.net>
    ! Encode.pm
    Silence Encode on undef, by Andreas
    Message-Id: <m3smwrohd1.fsf@k242.linux.bogus>
    Message-Id: <m3of7fo7np.fsf@k242.linux.bogus>
    ! Unicode/Unicode.xs
    s/regognised/recognised/ . British spelling left intact to pay
    respect to two British Nicks :)
    Message-Id: <20021203020454.GK2274@kosh.hut.fi>

    A happy new year with a happy encoding.

    Dan the Encode Maintainer
  • Jarkko Hietaniemi at Jan 10, 2003 at 4:12 pm

    $ env LC_ALL=en_US.UTF-8 ./perl -Ilib -e 'print chr(128)' | env ./perl
    -Ilib -Mencoding=utf8 -le '$a=<STDIN>;printf "%x\n", ord($a)'
    80
    $

    (Without the patch we get the U+FFFD in UTF-8, as shown before.)
    Thanks, applied now. Now I have enough improvements/fixes in my
    repository to $Encode::VERSION++. New version will be released soon.
    You might want to rewrite the regex matching the UTFs and UCSs a bit,
    though, to match the set of Unicode encodings really supported by Encode,
    I don't think what I gave is quite right. Also, this:

    $ ./perl -Ilib -Mencoding=utf8 -e 'print chr(128)' | ./perl -Ilib -Mencoding=utf8 -le '$a=<STDIN>;printf "%x\n", ord($a)'

    is a better test. With the fix you get '80', without you'll get
    malformed UTF-8.

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Dan Kogai at Jan 10, 2003 at 5:22 pm

    On Saturday, Jan 11, 2003, at 01:12 Asia/Tokyo, Jarkko Hietaniemi wrote:
    You might want to rewrite the regex matching the UTFs and UCSs a bit,
    though, to match the set of Unicode encodings really supported by
    Encode,
    That came across my mind RIGHT AFTER I released 1.84. It should've
    been more explicit -- that is, instead of regex we may as well use the
    explicit match. i.e.

    my %NO_CARAT_ENCODING = map {$_=>1}
    qw(UCS-2BE
    UCS-2LE
    UTF-16
    UTF-16BE
    UTF-16LE
    UTF-32
    UTF-32BE
    UTF-32LE
    utf8);

    #....

    ${^ENCODING} = $enc unless $NO_CARAT_ENCODING{$enc->$name};

    # end of example

    We should also check perlio-savviness as well (so UTF-16 and UTF-32 are
    ruled out).
    I don't think what I gave is quite right.
    Well, IMHO it is okay enough but explicit version I like better.
    Also, this:
    $ ./perl -Ilib -Mencoding=utf8 -e 'print chr(128)' | ./perl -Ilib
    -Mencoding=utf8 -le '$a=<STDIN>;printf "%x\n", ord($a)'

    is a better test. With the fix you get '80', without you'll get
    malformed UTF-8.
    Then let's include the new test suite for 1.85. *.t files welcome.

    Dan the Encode Maintainer
  • Jarkko Hietaniemi at Jan 10, 2003 at 6:21 pm

    my %NO_CARAT_ENCODING = map {$_=>1}
    s/RAT/RET/ (and s/\$_/lc $_/, or pre-lc the qw keys).
    ${^ENCODING} = $enc unless $NO_CARAT_ENCODING{$enc->$name};
    lc($enc->$name)
    Then let's include the new test suite for 1.85. *.t files welcome.
    I've got something unfinished lying around...

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Jarkko Hietaniemi at Jan 11, 2003 at 1:03 pm

    On Fri, Jan 10, 2003 at 08:21:51PM +0200, Jarkko Hietaniemi wrote:
    my %NO_CARAT_ENCODING = map {$_=>1}
    s/RAT/RET/ (and s/\$_/lc $_/, or pre-lc the qw keys).
    ${^ENCODING} = $enc unless $NO_CARAT_ENCODING{$enc->$name};
    lc($enc->$name)
    Duh. Since we want both 'utf8' and 'utf-8' to work, maybe we should
    stick to the regex after all.

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Jarkko Hietaniemi at Jan 10, 2003 at 6:45 pm
    The included enc_utf8.t should at least get us started.

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Jarkko Hietaniemi at Jan 10, 2003 at 9:33 pm
    Some little email gremlins are apparently growing fat from eating my
    outgoing attachments, so here's a resubmit of the suggested enc_utf8.t
    test file, inlined.

    BEGIN {
    require Config; import Config;
    if ($Config{'extensions'} !~ /\bEncode\b/) {
    print "1..0 # Skip: Encode was not built\n";
    exit 0;
    }
    unless (find PerlIO::Layer 'perlio') {
    print "1..0 # Skip: PerlIO was not built\n";
    exit 0;
    }
    if (ord("A") == 193) {
    print "1..0 # encoding pragma does not support EBCDIC platforms\n";
    exit(0);
    }
    }

    use encoding 'utf8';

    my @c = (127, 128, 255, 256);

    print "1.." . (scalar @c + 1) . "\n";

    my @f;

    for my $i (0..$#c) {
    push @f, "f$i";
    open(F, ">f$i") or die "$0: failed to open 'f$i' for writing: $!";
    binmode(F, ":utf8");
    print F chr($c[$i]);
    close F;
    }

    my $t = 1;

    for my $i (0..$#c) {
    open(F, "<f$i") or die "$0: failed to open 'f$i' for reading: $!";
    binmode(F, ":utf8");
    my $c = <F>;
    my $o = ord($c);
    print $o == $c[$i] ? "ok $t\n" : "not ok $t # $o != $c[$i]\n";
    $t++;
    }

    my $f = "f4";

    push @f, $f;
    open(F, ">$f") or die "$0: failed to open '$f' for writing: $!";
    binmode(F, ":raw"); # Output raw bytes.
    print F chr(128); # Output illegal UTF-8.
    close F;
    open(F, $f) or die "$0: failed to open '$f' for reading: $!";
    binmode(F, ":encoding(utf-8)");
    {
    local $^W = 1;
    local $SIG{__WARN__} = sub { $a = shift };
    eval { <F> }; # This should get caught.
    }
    print $a =~ qr{^utf8 "\\x80" does not map to Unicode} ?
    "ok $t\n" : "not ok $t: $a\n";

    END {
    1 while unlink @f;
    }


    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Nicholas Clark at Jan 10, 2003 at 9:55 pm

    On Fri, Jan 10, 2003 at 11:33:04PM +0200, Jarkko Hietaniemi wrote:
    Some little email gremlins are apparently growing fat from eating my
    outgoing attachments, so here's a resubmit of the suggested enc_utf8.t
    Step 1: steal attachments

    Step 2: ???

    Step 3: Profit!

    Nicholas Clark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedJan 10, '03 at 11:53a
activeJan 11, '03 at 1:03p
posts9
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase