FAQ
Porters,

Perl now accepts UTF-16 source.

% perl foo-u.pl
5.008005
% hexdump -C foo-u.pl
00000000 00 70 00 72 00 69 00 6e 00 74 00 20 00 24 00 5d
.p.r.i.n.t. .$.]|
00000010 00 2c 00 20 00 22 00 5c 00 6e 00 22 00 3b 00 0a |.,.
.".\.n.".;..|
00000020

So far so good.

But once the source spans more than two lines the camel falls into
indigestion, even when the line is empty.

% perl foo-u.pl
Bareword found where operator expected at foo-u.pl line 2, near ""
(Missing operator before r?)
Bareword found where operator expected at foo-u.pl line 2, near ""
(Missing operator before i?)
Bareword found where operator expected at foo-u.pl line 2, near ""
(Missing operator before n?)
Bareword found where operator expected at foo-u.pl line 2, near ""
(Missing operator before t?)
Scalar found where operator expected at foo-u.pl line 2, near ""
(Missing operator before ?)
syntax error at foo-u.pl line 2, near ""
Unmatched right square bracket at foo-u.pl line 2, at end of line
Execution of foo-u.pl aborted due to compilation errors.
% hexdump -C foo-u.pl
00000000 00 0a 00 70 00 72 00 69 00 6e 00 74 00 20 00 24
...p.r.i.n.t. .$|
00000010 00 5d 00 2c 00 20 00 22 00 5c 00 6e 00 22 00 3b |.].,.
.".\.n.".;|
00000020 00 0a |..|
00000022

It's just that perl does not correctly handle UTF-16 newline in the
script (\x00\x0a in UTF-16BE). IMHO I doubt the usefulness of UTF-16
scripts but we should obviously fix this so it works as advertised.

Dan the Encode Maintainer

Search Discussions

  • Jarkko Hietaniemi at Sep 10, 2004 at 5:58 am
    As Dan noticed, it seems that I didn't much test my UTF-16 patch
    (#22832) :-( Pretty much all multiline stuff seems to be not working
    with UTF-16 scripts, be it a single logical line split across several
    physical ones, or multiple logical lines, or a multiline string
    constant. Even "foo\n" seems to be misparsed (as "\\n"). It seems
    that in many places toke.c just happily does a s++, without caring
    much about any possible input filtering being in place.

    I don't have the time to look into this for any foreseeable future,
    I am afraid :-/ I'd like to leave you with a quote from more than six
    years ago:

    From memory, I got to yylex() and then blacked out for a month. When
    I came to, I had tattoos in a language I can't read on parts of my
    body I can't see without a mirror, and I keep getting postcards
    covered in lipstick from someone named Yuri.
    -- gnat
    (http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/1998-01/msg00378.html)

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
  • Nick Ing-Simmons at Sep 15, 2004 at 3:49 pm

    Jarkko Hietaniemi writes:
    As Dan noticed, it seems that I didn't much test my UTF-16 patch
    (#22832) :-( Pretty much all multiline stuff seems to be not working
    with UTF-16 scripts, be it a single logical line split across several
    physical ones, or multiple logical lines, or a multiline string
    constant. Even "foo\n" seems to be misparsed (as "\\n"). It seems
    that in many places toke.c just happily does a s++, without caring
    much about any possible input filtering being in place.
    s++ should be fine if something is "decoding" UTF-16 into internal UTF-8
    form before toke.c gets at it?

    I don't have the time to look into this for any foreseeable future,
    I am afraid :-/ I'd like to leave you with a quote from more than six
    years ago:
    From memory, I got to yylex() and then blacked out for a month. When
    I came to, I had tattoos in a language I can't read on parts of my
    body I can't see without a mirror, and I keep getting postcards
    covered in lipstick from someone named Yuri.
    -- gnat
    (http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/1998-01/msg00378.html)
  • Jarkko Hietaniemi at Sep 15, 2004 at 5:13 pm

    Nick Ing-Simmons wrote:

    Jarkko Hietaniemi <jhi@iki.fi> writes:
    As Dan noticed, it seems that I didn't much test my UTF-16 patch
    (#22832) :-( Pretty much all multiline stuff seems to be not working
    with UTF-16 scripts, be it a single logical line split across several
    physical ones, or multiple logical lines, or a multiline string
    constant. Even "foo\n" seems to be misparsed (as "\\n"). It seems
    that in many places toke.c just happily does a s++, without caring
    much about any possible input filtering being in place.

    s++ should be fine if something is "decoding" UTF-16 into internal UTF-8
    form before toke.c gets at it?
    Well, yes. But that's not quite how the UTF-16 filtering works, I found
    out... there is a stage/between the raw read() and the UTF-8-fication,
    and buffers ended up with UTF-8 followed by UTF-16. I submitted couple
    days ago a hacky patch which could go at least to blead to NickC and
    Rafael. The patch allows *most* of the minitest to pass with any of the
    UTF-16[BEB?] (there's a new "minitest.utf16" make target).

    Still waiting for postcards from Yuri.
    I don't have the time to look into this for any foreseeable future,
    I am afraid :-/ I'd like to leave you with a quote from more than six
    years ago:
    From memory, I got to yylex() and then blacked out for a month. When
    I came to, I had tattoos in a language I can't read on parts of my
    body I can't see without a mirror, and I keep getting postcards
    covered in lipstick from someone named Yuri.
    -- gnat
    (http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/1998-01/msg00378.html)

    --
    Jarkko Hietaniemi <jhi@iki.fi> http://www.iki.fi/jhi/ "There is this special
    biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupperl5-porters @
categoriesperl
postedSep 10, '04 at 12:29a
activeSep 15, '04 at 5:13p
posts4
users3
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase