FAQ

[Perl-unicode] Re: Converting UTF-EBCDIC to UTF-8

Brian DePradine
Mar 18, 2003 at 3:33 pm

Thank you for your report.

I was careless about the trap on a non-ASCII platform
like that ("a" eq "\x61") is not true. >
So the failed tests are fixed, and some tests are added.
Ver. 0.20 is available from there: >
[TAR-GZ, HTML-ized POD]
http://homepage1.nifty.com/nomenclator/perl/Unicode-Transform-0.20.tar.gz
http://homepage1.nifty.com/nomenclator/perl/Unicode-Transform.html >
SADAHIRO Tomoyuki
I have now run this latest version on z/OS with the following results:

/defects/brian/unicode/Unicode-Transform-0.20:>make test
PERL_DL_NONLAZY=1 /defects/brian/nonthreaded/perl-5.8.0/perl
"-I/defects/brian/n
onthreaded/perl-5.8.0/lib" "-I/defects/brian/nonthreaded/perl-5.8.0/lib"
"-MExtU
tils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/handler....ok
t/test.......FAILED tests 25, 27
Failed 2/28 tests, 92.86% okay
Failed Test Stat Wstat Total Fail Failed List of Failed
-------------------------------------------------------------------------------
t/test.t 28 2 7.14% 25 27
Failed 1/2 test scripts, 50.00% okay. 2/38 subtests failed, 94.74% okay.
make: *** [test_dynamic] Error 121

I also had the warnings similar to the following when compiling on z/OS:

WARNING CCN3196 ./Transform.c:21 Initialization between types
"unsigned int(*)
(unsigned char*,unsigned int,unsigned int*)" and "unsigned
long(*)(unsigned char*,unsigned int,unsigned int*)"
is not allowed.
FSUM3065 The COMPILE step ended with return code 4.

I got rid of these warnings by changing the return type all of the
ord_in_* functions to STRLEN instead of UV, in the file unitrans.h. I
choose this because the return type of the app_in_* functions were
already STRLEN and they didn't generate warnings.

Brian
reply

Search Discussions

13 responses

  • SADAHIRO Tomoyuki at Mar 18, 2003 at 5:56 pm
    Thank you.

    I want to know what UTF8-flag-on string (SvUTF8(sv) is true)
    is equivalent with "\xef\xb9\xbf",
    but I would misunderstand usage of pack('U*', LIST).

    Are numbers < 256 in LIST
    Unicode code points or native ASCII/EBCDIC code points?

    Here is a draft of tests 25..28.
    I guess this should succeses if pack(U) would take Unicode code points,
    but I might be still in the wrong.

    ### TESTS START
    $utf8_fe7f_upgraded = ord("A") != 0x41
    ? pack('U*', 213, 190, 215) # EBCDIC "\xef\xb9\xbf"
    : pack('U*', 239, 185, 191); # ASCII "\xef\xb9\xbf"

    $utf8_fe7f_bytes = pack('C*', 239, 185, 191);

    print "\x{fe7f}" eq utf8_to_unicode($utf8_fe7f_upgraded)
    ? "ok" : "not ok", " 25\n";

    print "\x{fe7f}" eq utf8_to_unicode($utf8_fe7f_bytes)
    ? "ok" : "not ok", " 26\n";

    print $utf8_fe7f_upgraded eq unicode_to_utf8("\x{fe7f}")
    ? "ok" : "not ok", " 27\n";

    print $utf8_fe7f_bytes eq unicode_to_utf8("\x{fe7f}")
    ? "ok" : "not ok", " 28\n";
    ### TESTS END

    And many thanks for discovering STRLEN-UV type mismatchings, too.

    SADAHIRO Tomoyuki


    Thank you for your report.

    I was careless about the trap on a non-ASCII platform
    like that ("a" eq "\x61") is not true.

    So the failed tests are fixed, and some tests are added.
    Ver. 0.20 is available from there:

    [TAR-GZ, HTML-ized POD]
    http://homepage1.nifty.com/nomenclator/perl/Unicode-Transform-0.20.tar.gz
    http://homepage1.nifty.com/nomenclator/perl/Unicode-Transform.html

    SADAHIRO Tomoyuki
    I have now run this latest version on z/OS with the following results:

    /defects/brian/unicode/Unicode-Transform-0.20:>make test
    PERL_DL_NONLAZY=1 /defects/brian/nonthreaded/perl-5.8.0/perl
    "-I/defects/brian/n
    onthreaded/perl-5.8.0/lib" "-I/defects/brian/nonthreaded/perl-5.8.0/lib"
    "-MExtU
    tils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
    t/handler....ok
    t/test.......FAILED tests 25, 27
    Failed 2/28 tests, 92.86% okay
    Failed Test Stat Wstat Total Fail Failed List of Failed
    -------------------------------------------------------------------------------
    t/test.t 28 2 7.14% 25 27
    Failed 1/2 test scripts, 50.00% okay. 2/38 subtests failed, 94.74% okay.
    make: *** [test_dynamic] Error 121

    I also had the warnings similar to the following when compiling on z/OS:

    WARNING CCN3196 ./Transform.c:21 Initialization between types
    "unsigned int(*)
    (unsigned char*,unsigned int,unsigned int*)" and "unsigned
    long(*)(unsigned char*,unsigned int,unsigned int*)"
    is not allowed.
    FSUM3065 The COMPILE step ended with return code 4.

    I got rid of these warnings by changing the return type all of the
    ord_in_* functions to STRLEN instead of UV, in the file unitrans.h. I
    choose this because the return type of the app_in_* functions were
    already STRLEN and they didn't generate warnings.

    Brian
  • Brian DePradine at Mar 21, 2003 at 11:51 am

    ### TESTS START
    $utf8_fe7f_upgraded = ord("A") != 0x41
    ? pack('U*', 213, 190, 215) # EBCDIC "\xef\xb9\xbf"
    : pack('U*', 239, 185, 191); # ASCII "\xef\xb9\xbf" >
    $utf8_fe7f_bytes = pack('C*', 239, 185, 191); >
    print "\x{fe7f}" eq utf8_to_unicode($utf8_fe7f_upgraded)
    ? "ok" : "not ok", " 25\n"; >
    print "\x{fe7f}" eq utf8_to_unicode($utf8_fe7f_bytes)
    ? "ok" : "not ok", " 26\n"; >
    print $utf8_fe7f_upgraded eq unicode_to_utf8("\x{fe7f}")
    ? "ok" : "not ok", " 27\n"; >
    print $utf8_fe7f_bytes eq unicode_to_utf8("\x{fe7f}")
    ? "ok" : "not ok", " 28\n";
    ### TESTS END
    I have replace test cases 25 - 28 with the ones listed above. And the
    results were as follows:

    /defects/brian/unicode/Unicode-Transform-0.20:>make test
    PERL_DL_NONLAZY=1 /defects/brian/nonthreaded/perl-5.8.0/perl
    "-I/defects/brian/n
    onthreaded/perl-5.8.0/lib" "-I/defects/brian/nonthreaded/perl-5.8.0/lib"
    "-MExtU
    tils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
    t/handler....ok
    t/test.......ok
    All tests successful.
    Files=2, Tests=38, 3 wallclock secs ( 0.37 cusr + 0.12 csys = 0.49 CPU)

    One more thing though. I am writing a XS module myself and I would like
    to convert C strings from UTF-EBCDIC to UTF-8 and back. Your module
    works well from Perl space, but do you have any ideas of how I can do
    this from C?

    Thanks
    Brian
  • Mark Lewellen at Mar 21, 2003 at 3:47 pm
    Hi-
    I'm looking for recommendations on how to warn about and record
    problems
    with ill-formed data. Specifically, I'm reading in Big5 data from
    multiple files
    and converting it to Perl's utf8, and some of the Big5 double-byte
    combinations
    are illegal (they appear to be user-defined special symbols). I'd like
    to be able
    to write code to handle lines with ill-formed data. So, if I start with
    code like:

    open( IN_FH, '<:encoding(big5)', $inputFile ) or die...
    while( $line = <IN_FH> ) {

    or

    open( IN_FH, $inputFile ) or die...
    while( $line = decode('big5', <IN_FH> ) ) {

    I'd like to add logic such as:

    if( <$line has an error> )
    record the line number and file name
    record the error and the entire line
    map error to user-defined character (dependent on error) and process
    the modified line

    Could I get recommendations on how to do this? Thanks-

    Mark

    PS The STDERR "does not map to Unicode" warning on my version (5.8.0)
    lists only
    the input file's line number; is it possible to add the input file name
    as well?
  • SADAHIRO Tomoyuki at Mar 21, 2003 at 5:51 pm
    An example, but it's still raw.

    use Encode;
    open( IN_FH, $inputFile ) or die;
    while( $line = <IN_FH> ) {
    eval { $line = decode('big5', $line, Encode::FB_CROAK ) };
    if ($@) {
    warn "ill-formed line at line $. in $inputFile.\n";
    printf ERRORLOG "File %s (line %d): %s", $inputFile, $., $line;
    # $line (in big-5, here) should be \n-terminated.
    $line = decode('big5', $line );
    }
    # $line is in utf8 on the process following...
    }

    P.S. Another problem.
    How can it be determined whether that user-defined character
    (UDC hereafter) is single-byte or double-byte?

    The file big5-eten.ucm does not contain
    how to determin the character length in bytes for an unmapped UDC.

    Of course (but I don't know it's easy or not),
    you can define a *new* encoding as big-5 with mapping of
    UDCs at any code points by preparing a new .ucm file.
    This method may relieve error due to the appearance of UDCs.

    SADAHIRO Tomoyuki


    On Fri, 21 Mar 2003 10:52:07 -0500
    "Mark Lewellen" wrote:
    Hi-
    I'm looking for recommendations on how to warn about and record
    problems
    with ill-formed data. Specifically, I'm reading in Big5 data from
    multiple files
    and converting it to Perl's utf8, and some of the Big5 double-byte
    combinations
    are illegal (they appear to be user-defined special symbols). I'd like
    to be able
    to write code to handle lines with ill-formed data. So, if I start with
    code like:

    open( IN_FH, '<:encoding(big5)', $inputFile ) or die...
    while( $line = <IN_FH> ) {

    or

    open( IN_FH, $inputFile ) or die...
    while( $line = decode('big5', <IN_FH> ) ) {

    I'd like to add logic such as:

    if( <$line has an error> )
    record the line number and file name
    record the error and the entire line
    map error to user-defined character (dependent on error) and process
    the modified line

    Could I get recommendations on how to do this? Thanks-

    Mark

    PS The STDERR "does not map to Unicode" warning on my version (5.8.0)
    lists only
    the input file's line number; is it possible to add the input file name
    as well?
  • David Graff at Mar 21, 2003 at 10:23 pm

    SADAHIRO Tomoyuki said:

    P.S. Another problem. How can it be determined whether that
    user-defined character (UDC hereafter) is single-byte or double-byte?

    The file big5-eten.ucm does not contain how to determin the character
    length in bytes for an unmapped UDC.
    As I understand it, the "parsing" rules for big5 involve stepping
    through the character stream one byte at a time, and:

    - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one
    complete character (*); otherwise:

    - when the byte just taken is in the range [\xA1-\xFE], you have the
    first half of a 16-bit big5 character, and you need to get the next
    byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE],
    then you now have a complete big5 code point

    - an initial byte in the range [\x80-\xA0\xFF] is presumably some form
    of noise, and should be discarded; likewise, when expecting the second
    byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
    is also noise, and presumably both this byte and the one preceding it
    should be discarded. (**)

    footnotes:

    (*) If reading a plain text file, you would of course expect (hope) that
    the ASCII codes are limited to just white-space and [\x21-\x7E] (and
    maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc;
    still, if these occur, they should behave as ASCII for purposes of
    parsing the characters.

    (**) I'm really just guessing about what sort of action should be taken
    when a stream violates the rules; discarding one or two bytes at a time
    when they happen to be out of bounds should be the "safest" approach.

    There is still the issue that those rules map out a very large range of
    potential code points, many of which are not in fact used or defined in
    Chinese. Also, there must be some number of big5 code points that are
    used/defined (at least by some big5 applications), but are not mapped to
    Unicode. How Perl "decode()" handles these cases may be a problem where
    developers still have some work to do to fix things...

    Dave Graff
  • SADAHIRO Tomoyuki at Mar 22, 2003 at 1:15 am

    SADAHIRO Tomoyuki said:
    P.S. Another problem. How can it be determined whether that
    user-defined character (UDC hereafter) is single-byte or double-byte?

    The file big5-eten.ucm does not contain how to determin the character
    length in bytes for an unmapped UDC.
    As I understand it, the "parsing" rules for big5 involve stepping
    through the character stream one byte at a time, and:

    - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one
    complete character (*); otherwise:

    - when the byte just taken is in the range [\xA1-\xFE], you have the
    first half of a 16-bit big5 character, and you need to get the next
    byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE],
    then you now have a complete big5 code point

    - an initial byte in the range [\x80-\xA0\xFF] is presumably some form
    of noise, and should be discarded; likewise, when expecting the second
    byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
    is also noise, and presumably both this byte and the one preceding it
    should be discarded. (**)
    Right, but such a noise may be due to confusion
    with CP-950 or BIG-5 HKSCS (or others?).
    They have some character mapping in the area of leading byte \x81-\xA0.
    We can use decode 'cp950' or decode 'big5-hkscs', though.

    Well, the problem is possibly due to "big-5" has many, many variants.
    (cf. http://i18n.linux.org.tw/openi18n/big5/index_en.html )
    footnotes: (snip)
    There is still the issue that those rules map out a very large range of
    potential code points, many of which are not in fact used or defined in
    Chinese. Also, there must be some number of big5 code points that are
    used/defined (at least by some big5 applications), but are not mapped to
    Unicode. How Perl "decode()" handles these cases may be a problem where
    developers still have some work to do to fix things...

    Dave Graff
    For example, Microsoft defines mapping
    of extended UDC (EUDC) to Private Use Area (PUA) in Unicode.
    These mapping can be computed algorithmically like following.

    sub eudc2pua { # E000..F848
    my $cp = shift;

    if ($cp =~ /^([\x81-\x8D])([\x40-\x7E\xA1-\xFE])/) { # EEB8..F6B0
    my $le = ord($1);
    my $tr = ord($2);
    return 0xeeb8 +
    ($le - 0x81) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\x8E-\xA0])([\x40-\x7E\xA1-\xFE])/) { # E311..EEB7
    my $le = ord($1);
    my $tr = ord($2);
    return 0xe311 +
    ($le - 0x8e) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^\xC6([\xA1-\xFE])/) { # F6B1..F70E
    my $tr = ord($1);
    return 0xf6b1 + $tr - 0xA1;
    }
    if ($cp =~ /^([\xC7\xC8])([\x40-\x7E\xA1-\xFE])/) { # F70F..F848
    my $le = ord($1);
    my $tr = ord($2);
    return 0xf70f +
    ($le - 0xc7) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\xFA-\xFE])([\x40-\x7E\xA1-\xFE])/) { # E000..E310
    my $le = ord($1);
    my $tr = ord($2);
    return 0xe000 +
    ($le - 0xfa) * 0x9d + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    return;
    }


    sub pua2eudc {
    my $uv = shift;
    if (0xe000 <= $uv && $uv <= 0xe310) {
    $uv -= 0xe000;
    my $tr = $uv % 0x9D + 0x40;
    return pack 'CC', int($uv/0x9D) + 0xFA,
    $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xe311 <= $uv && $uv <= 0xeeb7) {
    $uv -= 0xe311;
    my $tr = $uv % 0x9D + 0x40;
    return pack 'CC', int($uv/0x9D) + 0x8E,
    $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xeeb8 <= $uv && $uv <= 0xf6b0) {
    $uv -= 0xeeb8;
    my $tr = $uv % 0x9D + 0x40;
    return pack 'CC', int($uv/0x9D) + 0x81,
    $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xf6b1 <= $uv && $uv <= 0xf70e) {
    $uv -= 0xf6b1;
    return pack 'CC', 0xC6, $uv + 0xA1;
    }
    if (0xf70f <= $uv && $uv <= 0xf848) {
    $uv -= 0xf70f;
    my $tr = $uv % 0x9D + 0x40;
    return pack 'CC', int($uv/0x9D) + 0xC7,
    $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    return;
    }

    P.S. This EUDC mapping *was* available from Microsoft typography,
    ( http://www.microsoft.com/typography/default.asp )
    but that file has been deleted. Though I don't know the reason,
    I guess it is (maybe) because the mapping was an older version
    than that distributed now under www.unicode.org/Public/MAPPINGS.

    However the fact that the leading byte range
    for CP-950 is \x81-\xfe is shown in
    http://www.microsoft.com/globaldev/reference/dbcs/950.htm
    (additional leadbytes are identified by a darker gray background)
    and in
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

    SADAHIRO Tomoyuki
  • Mark Lewellen at Mar 25, 2003 at 4:55 am
    I often encounter lower-ascii codes mixed in with Big5 text, which is
    fine
    and straightforward to handle. However, a problem arises when upper
    ascii occasionally occur outside of the Big5 range. When such a
    character occurs, this is probably an error or part of a user-defined
    character.
    However, it appears that Encode DOES NOT display warnings for these but
    rather maps individual upper ascii to conventional characters such as
    Roman letters with diacritics commonly found in European languages.
    (It appears that Encode displays warnings for characters that are within
    the Big5 range, but do not have a mapping to Unicode, perhaps because
    these code points are not used in Big5 itself.)

    Is there a way to cause Encode to display warnings for upper ascii
    outside
    of the Big5 range when converting from Big5 to Unicode? If not, could
    the
    developers consider this for a future fix?

    Mark

    P.S. Another problem. How can it be determined whether that
    user-defined character (UDC hereafter) is single-byte or
    double-byte?
    The file big5-eten.ucm does not contain how to determin the
    character
    length in bytes for an unmapped UDC.
    As I understand it, the "parsing" rules for big5 involve stepping
    through the character stream one byte at a time, and:

    - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one
    complete character (*); otherwise:

    - when the byte just taken is in the range [\xA1-\xFE], you have the
    first half of a 16-bit big5 character, and you need to get the next
    byte as well; if that next byte is in the range
    [\x40-\x7E\xA1-\xFE],
    then you now have a complete big5 code point

    - an initial byte in the range [\x80-\xA0\xFF] is presumably
    some form
    of noise, and should be discarded; likewise, when expecting
    the second
    byte of a big5 character, a byte in the range
    [\x00-\x3F\x7F-\xA0\xFF]
    is also noise, and presumably both this byte and the one
    preceding it
    should be discarded. (**)
    footnotes:

    (*) If reading a plain text file, you would of course expect
    (hope) that
    the ASCII codes are limited to just white-space and [\x21-\x7E] (and
    maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc;
    still, if these occur, they should behave as ASCII for purposes of
    parsing the characters.

    (**) I'm really just guessing about what sort of action
    should be taken
    when a stream violates the rules; discarding one or two bytes
    at a time
    when they happen to be out of bounds should be the "safest" approach.

    There is still the issue that those rules map out a very
    large range of
    potential code points, many of which are not in fact used or
    defined in
    Chinese. Also, there must be some number of big5 code points that are
    used/defined (at least by some big5 applications), but are
    not mapped to
    Unicode. How Perl "decode()" handles these cases may be a
    problem where
    developers still have some work to do to fix things...
  • Dan Kogai at Mar 25, 2003 at 5:08 am

    On Tuesday, Mar 25, 2003, at 13:59 Asia/Tokyo, Mark Lewellen wrote:
    Is there a way to cause Encode to display warnings for upper ascii
    outside of the Big5 range when converting from Big5 to Unicode? If
    not, could
    the developers consider this for a future fix?
    Use the optional 3rd argument to decode().

    $utf8 = decode("Big5" => $big5); # ill-formed chars are mapped to U+FFFD
    $utf8 = decode("Big5" => $big5, Encode::FB_WARN); # same but warnings
    issued

    see "Handling Malformed Data" of "perldoc Encode" for how to use the
    3rd argument.

    Dan the Encode Maintainer
  • SADAHIRO Tomoyuki at Mar 25, 2003 at 12:51 pm
    Well, is it right?

    I'm not sure of the status and the single byte-range
    for Big-5, though.

    diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
    --- ucm~/big5-eten.ucm Thu Jan 23 23:21:00 2003
    +++ ucm/big5-eten.ucm Tue Mar 25 21:43:00 2003
    @@ -137,38 +137,6 @@
    <U007E> \x7E |0 # TILDE
    <U007F> \x7F |0 # DELETE
    <U0080> \x80 |0 # <control>
    -<U0081> \x81 |0 # <control>
    -<U0082> \x82 |0 # BREAK PERMITTED HERE
    -<U0083> \x83 |0 # NO BREAK HERE
    -<U0084> \x84 |0 # <control>
    -<U0085> \x85 |0 # NEXT LINE
    -<U0086> \x86 |0 # START OF SELECTED AREA
    -<U0087> \x87 |0 # END OF SELECTED AREA
    -<U0088> \x88 |0 # CHARACTER TABULATION SET
    -<U0089> \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION
    -<U008A> \x8A |0 # LINE TABULATION SET
    -<U008B> \x8B |0 # PARTIAL LINE DOWN
    -<U008C> \x8C |0 # PARTIAL LINE UP
    -<U008D> \x8D |0 # REVERSE LINE FEED
    -<U008E> \x8E |0 # SINGLE SHIFT TWO
    -<U008F> \x8F |0 # SINGLE SHIFT THREE
    -<U0090> \x90 |0 # DEVICE CONTROL STRING
    -<U0091> \x91 |0 # PRIVATE USE ONE
    -<U0092> \x92 |0 # PRIVATE USE TWO
    -<U0093> \x93 |0 # SET TRANSMIT STATE
    -<U0094> \x94 |0 # CANCEL CHARACTER
    -<U0095> \x95 |0 # MESSAGE WAITING
    -<U0096> \x96 |0 # START OF GUARDED AREA
    -<U0097> \x97 |0 # END OF GUARDED AREA
    -<U0098> \x98 |0 # START OF STRING
    -<U0099> \x99 |0 # <control>
    -<U009A> \x9A |0 # SINGLE CHARACTER INTRODUCER
    -<U009B> \x9B |0 # CONTROL SEQUENCE INTRODUCER
    -<U009C> \x9C |0 # STRING TERMINATOR
    -<U009D> \x9D |0 # OPERATING SYSTEM COMMAND
    -<U009E> \x9E |0 # PRIVACY MESSAGE
    -<U009F> \x9F |0 # APPLICATION PROGRAM COMMAND
    -<U00A0> \xA0 |0 # NO-BREAK SPACE
    <U00A7> \xA1\xB1 |0
    <U00A8> \xC6\xD8 |0
    <U00AF> \xA1\xC2 |0
    @@ -178,11 +146,6 @@
    <U00D7> \xA1\xD1 |0
    <U00F7> \xA1\xD2 |0
    <U00F8> \xC8\xFB |0
    -<U00FA> \xFA |0 # LATIN SMALL LETTER U WITH ACUTE
    -<U00FB> \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX
    -<U00FD> \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE
    -<U00FE> \xFE |0 # LATIN SMALL LETTER THORN
    -<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
    <U014B> \xC8\xFC |0
    <U0153> \xC8\xFA |0
    <U0250> \xC8\xF6 |0
    diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
    --- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
    +++ ucm/big5-hkscs.ucm Tue Mar 25 21:37:10 2003
    @@ -136,13 +136,6 @@
    <U007E> \x7E |0 # TILDE
    <U007F> \x7F |0 # DELETE
    <U0080> \x80 |0 # <control>
    -<U0081> \x81 |0 # <control>
    -<U0082> \x82 |0 # BREAK PERMITTED HERE
    -<U0083> \x83 |0 # NO BREAK HERE
    -<U0084> \x84 |0 # <control>
    -<U0085> \x85 |0 # NEXT LINE
    -<U0086> \x86 |0 # START OF SELECTED AREA
    -<U0087> \x87 |0 # END OF SELECTED AREA
    <U00A7> \xA1\xB1 |0
    <U00A8> \xC6\xD8 |0
    <U00AF> \xA1\xC2 |0
    @@ -171,7 +164,6 @@
    <U00F9> \x88\x7B |0
    <U00FA> \x88\x79 |0
    <U00FC> \x88\xA2 |0
    -<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
    <U0100> \x88\x56 |0
    <U0101> \x88\x67 |0
    <U0112> \x88\x5A |0

    Regards,
    SADAHIRO Tomoyuki
    I often encounter lower-ascii codes mixed in with Big5 text, which is
    fine
    and straightforward to handle. However, a problem arises when upper
    ascii occasionally occur outside of the Big5 range. When such a
    character occurs, this is probably an error or part of a user-defined
    character.
    However, it appears that Encode DOES NOT display warnings for these but
    rather maps individual upper ascii to conventional characters such as
    Roman letters with diacritics commonly found in European languages.
    (It appears that Encode displays warnings for characters that are within
    the Big5 range, but do not have a mapping to Unicode, perhaps because
    these code points are not used in Big5 itself.)

    Is there a way to cause Encode to display warnings for upper ascii
    outside
    of the Big5 range when converting from Big5 to Unicode? If not, could
    the
    developers consider this for a future fix?

    Mark
  • Dan Kogai at Mar 25, 2003 at 2:04 pm
    Autrijus (and Porters),

    I think you are following this thread but in case you are not,
    Sadahiro-san proposes that some extraneous (and presumably unneeded)
    control characters in \x80-\xA0 in big5-eten map be removed to solve
    problems that arise in certain circumstances.
    Since these control characters are just duplicates at \x00-\x20, I
    think it is a good idea to go for it (and do the same to
    big5-hkscs.ucm). But I am not as sure of Big5 as you are please check
    if the proposal is right.
    If you affirm the idea, I'll $Encode::VERSION++.

    Dan the Encode Maintainer
    On Tuesday, Mar 25, 2003, at 21:53 Asia/Tokyo, SADAHIRO Tomoyuki wrote:
    Well, is it right?

    I'm not sure of the status and the single byte-range
    for Big-5, though.

    diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
    --- ucm~/big5-eten.ucm Thu Jan 23 23:21:00 2003
    +++ ucm/big5-eten.ucm Tue Mar 25 21:43:00 2003
    @@ -137,38 +137,6 @@
    <U007E> \x7E |0 # TILDE
    <U007F> \x7F |0 # DELETE
    <U0080> \x80 |0 # <control>
    -<U0081> \x81 |0 # <control>
    -<U0082> \x82 |0 # BREAK PERMITTED HERE
    -<U0083> \x83 |0 # NO BREAK HERE
    -<U0084> \x84 |0 # <control>
    -<U0085> \x85 |0 # NEXT LINE
    -<U0086> \x86 |0 # START OF SELECTED AREA
    -<U0087> \x87 |0 # END OF SELECTED AREA
    -<U0088> \x88 |0 # CHARACTER TABULATION SET
    -<U0089> \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION
    -<U008A> \x8A |0 # LINE TABULATION SET
    -<U008B> \x8B |0 # PARTIAL LINE DOWN
    -<U008C> \x8C |0 # PARTIAL LINE UP
    -<U008D> \x8D |0 # REVERSE LINE FEED
    -<U008E> \x8E |0 # SINGLE SHIFT TWO
    -<U008F> \x8F |0 # SINGLE SHIFT THREE
    -<U0090> \x90 |0 # DEVICE CONTROL STRING
    -<U0091> \x91 |0 # PRIVATE USE ONE
    -<U0092> \x92 |0 # PRIVATE USE TWO
    -<U0093> \x93 |0 # SET TRANSMIT STATE
    -<U0094> \x94 |0 # CANCEL CHARACTER
    -<U0095> \x95 |0 # MESSAGE WAITING
    -<U0096> \x96 |0 # START OF GUARDED AREA
    -<U0097> \x97 |0 # END OF GUARDED AREA
    -<U0098> \x98 |0 # START OF STRING
    -<U0099> \x99 |0 # <control>
    -<U009A> \x9A |0 # SINGLE CHARACTER INTRODUCER
    -<U009B> \x9B |0 # CONTROL SEQUENCE INTRODUCER
    -<U009C> \x9C |0 # STRING TERMINATOR
    -<U009D> \x9D |0 # OPERATING SYSTEM COMMAND
    -<U009E> \x9E |0 # PRIVACY MESSAGE
    -<U009F> \x9F |0 # APPLICATION PROGRAM COMMAND
    -<U00A0> \xA0 |0 # NO-BREAK SPACE
    <U00A7> \xA1\xB1 |0
    <U00A8> \xC6\xD8 |0
    <U00AF> \xA1\xC2 |0
    @@ -178,11 +146,6 @@
    <U00D7> \xA1\xD1 |0
    <U00F7> \xA1\xD2 |0
    <U00F8> \xC8\xFB |0
    -<U00FA> \xFA |0 # LATIN SMALL LETTER U WITH ACUTE
    -<U00FB> \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX
    -<U00FD> \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE
    -<U00FE> \xFE |0 # LATIN SMALL LETTER THORN
    -<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
    <U014B> \xC8\xFC |0
    <U0153> \xC8\xFA |0
    <U0250> \xC8\xF6 |0
    diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
    --- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
    +++ ucm/big5-hkscs.ucm Tue Mar 25 21:37:10 2003
    @@ -136,13 +136,6 @@
    <U007E> \x7E |0 # TILDE
    <U007F> \x7F |0 # DELETE
    <U0080> \x80 |0 # <control>
    -<U0081> \x81 |0 # <control>
    -<U0082> \x82 |0 # BREAK PERMITTED HERE
    -<U0083> \x83 |0 # NO BREAK HERE
    -<U0084> \x84 |0 # <control>
    -<U0085> \x85 |0 # NEXT LINE
    -<U0086> \x86 |0 # START OF SELECTED AREA
    -<U0087> \x87 |0 # END OF SELECTED AREA
    <U00A7> \xA1\xB1 |0
    <U00A8> \xC6\xD8 |0
    <U00AF> \xA1\xC2 |0
    @@ -171,7 +164,6 @@
    <U00F9> \x88\x7B |0
    <U00FA> \x88\x79 |0
    <U00FC> \x88\xA2 |0
    -<U00FF> \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS
    <U0100> \x88\x56 |0
    <U0101> \x88\x67 |0
    <U0112> \x88\x5A |0

    Regards,
    SADAHIRO Tomoyuki
    I often encounter lower-ascii codes mixed in with Big5 text, which is
    fine
    and straightforward to handle. However, a problem arises when upper
    ascii occasionally occur outside of the Big5 range. When such a
    character occurs, this is probably an error or part of a user-defined
    character.
    However, it appears that Encode DOES NOT display warnings for these
    but
    rather maps individual upper ascii to conventional characters such as
    Roman letters with diacritics commonly found in European languages.
    (It appears that Encode displays warnings for characters that are
    within
    the Big5 range, but do not have a mapping to Unicode, perhaps because
    these code points are not used in Big5 itself.)

    Is there a way to cause Encode to display warnings for upper ascii
    outside
    of the Big5 range when converting from Big5 to Unicode? If not, could
    the
    developers consider this for a future fix?

    Mark
  • SADAHIRO Tomoyuki at Mar 25, 2003 at 2:06 pm
    Example:

    use Encode qw(:all);
    $big5 = "\x88\x71"; # U+00ED in Big-5 HKSCS
    $utf8 = decode("Big5" => $big5, Encode::FB_WARN);
    print $ascii = encode("ascii" => $utf8, Encode::FB_XMLCREF);

    before the patch (no warning)
    &#x88;q
    after the patch (warned)
    big5-eten "\x88" does not map to Unicode
    at D:/perl/bp581/lib/Encode.pm line 156.

    The message is not 'big5-eten "\x88\x71" does not map to Unicode..',
    of course (big5-eten.ucm does not define "\x88\x71"
    as a double-byte char), that may be what is expected, though.

    Regards,
    SADAHIRO Tomoyuki

    On Tue, 25 Mar 2003 21:53:13 +0900
    SADAHIRO Tomoyuki wrote:
    Well, is it right?

    I'm not sure of the status and the single byte-range
    for Big-5, though.

    diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm
    --- ucm~/big5-eten.ucm Thu Jan 23 23:21:00 2003
    +++ ucm/big5-eten.ucm Tue Mar 25 21:43:00 2003 (snip)
    diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm
    --- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003
    +++ ucm/big5-hkscs.ucm Tue Mar 25 21:37:10 2003
    (snip)
  • Mark Lewellen at Mar 25, 2003 at 4:07 pm
    Hi all-
    I want to clarify what I was trying to say:
    Use the optional 3rd argument to decode().

    $utf8 = decode("Big5" => $big5); # ill-formed chars are
    mapped to U+FFFD
    $utf8 = decode("Big5" => $big5, Encode::); # same but warnings
    issued

    see "Handling Malformed Data" of "perldoc Encode" for how to use the
    3rd argument.
    I don't think FB_WARN or FB_CROAK catch the type of malformed data
    I was describing (upper ascii outside of the Big5 range).

    If I understand correctly, though, SADAHIRO Tomoyuki and Dan Kogai
    proposed correcting this by removing single-byte upper ascii characters
    from
    the \x80-\xA0 range in the big5-eten map (and big5-hkscs). Is this
    correct? If so, should the other GB and Big5 maps be checked so that
    single-byte upper-ascii mappings can be removed in the same way?

    after the patch (warned)
    big5-eten "\x88" does not map to Unicode
    at D:/perl/bp581/lib/Encode.pm line 156.

    The message is not 'big5-eten "\x88\x71" does not map to Unicode..',
    of course (big5-eten.ucm does not define "\x88\x71"
    as a double-byte char), that may be what is expected, though.
    The 2-byte version ("\x88\x71") would be a more helpful warning to me.
    Although in an earlier email you accurately pointed out that it may be
    ambiguous what type of error exists in such a case, displaying the
    subsequent
    byte helps for both determining what the error is and locating it in the
    original
    text. Additionally, it would be helpful to specify the text source
    (i.e. file name)
    in the warning message, if possible.

    Mark
  • SADAHIRO Tomoyuki at Mar 21, 2003 at 4:57 pm
    (test draft snipped)
    I have replace test cases 25 - 28 with the ones listed above. And the
    results were as follows:

    /defects/brian/unicode/Unicode-Transform-0.20:>make test
    PERL_DL_NONLAZY=1 /defects/brian/nonthreaded/perl-5.8.0/perl
    "-I/defects/brian/n
    onthreaded/perl-5.8.0/lib" "-I/defects/brian/nonthreaded/perl-5.8.0/lib"
    "-MExtU
    tils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
    t/handler....ok
    t/test.......ok
    All tests successful.
    Files=2, Tests=38, 3 wallclock secs ( 0.37 cusr + 0.12 csys = 0.49 CPU)
    Thank you for testing.
    A newer version with these tests will be released soon from CPAN.
    One more thing though. I am writing a XS module myself and I would like
    to convert C strings from UTF-EBCDIC to UTF-8 and back. Your module
    works well from Perl space, but do you have any ideas of how I can do
    this from C?

    Thanks
    Brian
    How about the usual way to call a Perl subroutine from XS,
    as described in perlcall.pod?
    (cf. http://www.perldoc.com/perl5.8.0/pod/perlcall.html )

    An SV will be used rather than a C string,
    on passing and returning a string.

    SADAHIRO Tomoyuki

Related Discussions

Discussion Navigation
viewthread | post