FAQ
Hello list,

I'm was looking for some unicode/utf8/encoding problem during my problem
and I've discovered a strange thing.

URLs provided with an unicode character will be not correctly encoded by
the Unicode::Encoding plugin.

Here's the simple test case:

1) Create the application and cd into it:
% catalyst.pl MyApp
% cd MyApp

2) Add the plugin Unicode::Encoding in lib/MyApp.pm

3) Replace the 'sub index { ...}' in 'lib/MyApp/Controller/Root.pm' with
the following code:

--------- B< ---------

sub index : Regex('^test$')
{
my ( $self, $c, $parameter ) = @_;

$c->response->body( 'length = ' . length($parameter) );
}

--------- B< ---------

4) Add a test script "t/04encoding.t" like this:

--------- B< ---------

#!/usr/bin/perl

use strict;
use warnings;

use Test::More tests => 4;
use Test::Deep;

use HTTP::Status;
use HTTP::Request;
use Data::Dumper;

BEGIN { use_ok 'Catalyst::Test', 'MyApp' }
BEGIN { use_ok 'MyApp::Controller::Root' }

foreach my $u ('http://localhost/test/%E3%81%8B',
"http://localhost/test/\x{304b}" )
{
my $request = HTTP::Request->new(
'GET'=> $u, [ 'Content-Type' => 'text/html; charset=utf8', ],
);
print $request->as_string();
my $response = request( $request );
is( $response->content, 'length = 1', 'length = 1' );
}

--------- B< ---------

5) Start the test script

% perl t/04encoding.t

The first call will give the correct answer 'length = 1' because the 3
arabian octets were encoded correctly to one character.

The second call will give the wrong answer 'length = 3'.

Please note that the statement "print $request->as_string()" will print
the same http header:
GET http://localhost/test/%E3%81%8B
Content-Type: text/html; charset=utf8
My 2 cents: Further investigation brought me to
Catalyst::Plugin::Unicode::Encoding::prepare_action().
The problem is that the second URL from above is already an utf8 string,
means that "Encode::is_utf8( $_ )" in the named method returns true and
nothing will be done by the plugin.

Before I do some silly stuff I want to hear a second opinion from the list.

Is this fixable? Is catalyst here the problem? I think not. According
to the Bug in URI (Ticket #43859, "should be _utf8_off -ed raw data
before URI encoding",
https://rt.cpan.org/Ticket/Display.html?idC859) the problem may be
within URI.

But maybe it's possible to fix this issue in the testsuite of catalyst.

Any thoughts?

--
So long... Fuzz

Search Discussions

  • Bill Moseley at Mar 4, 2011 at 5:33 am
    Does this help?
    On Thu, Mar 3, 2011 at 2:38 PM, Erik Wasser wrote:

    foreach my $u ('http://localhost/test/%E3%81%8B',
    "http://localhost/test/\x{304b}" )
    {
    my $request = HTTP::Request->new(
    'GET'=> *encode_utf8($u)*, [ 'Content-Type' => 'text/html;
    charset=utf8', ],
    );
    print $request->as_string();
    my $response = request( $request );
    is( $response->content, 'length = 1', 'length = 1' );
    }
    --
    Bill Moseley
    [email protected]
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://lists.scsys.co.uk/pipermail/catalyst/attachments/20110303/cd6bb7b2/attachment.htm
  • Tamás Eisenberger at Mar 4, 2011 at 7:26 am
    Hy!

    Yes using encode_utf8 makes the test works.

    But anyway, this looks like a problem with the test, because we have
    tests to compare the entire captures / arguments / params strings with
    their originals, and if these tests pass the length of the strings must
    be ok!

    So Erik, can you please review your test, or explain a real word
    situation of the problem you facing?

    I actually use utf8 strings in url's now without problems :)
    --
    Eisenberger Tamás <[email protected]>
    On Thu, 2011-03-03 at 21:33 -0800, Bill Moseley wrote:
    Does this help?

    On Thu, Mar 3, 2011 at 2:38 PM, Erik Wasser wrote:
    foreach my $u ('http://localhost/test/%E3%81%8B',
    "http://localhost/test/\x{304b}" )
    {
    my $request = HTTP::Request->new(
    'GET'=> encode_utf8($u), [ 'Content-Type' =>
    'text/html; charset=utf8', ],
    );
    print $request->as_string();
    my $response = request( $request );
    is( $response->content, 'length = 1', 'length = 1' );
    }


    --
    Bill Moseley
    [email protected]
    _______________________________________________
    List: [email protected]
    Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
    Searchable archive: http://www.mail-archive.com/[email protected]/
    Dev site: http://dev.catalyst.perl.org/
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: smime.p7s
    Type: application/x-pkcs7-signature
    Size: 4716 bytes
    Desc: not available
    Url : http://lists.scsys.co.uk/pipermail/catalyst/attachments/20110304/7805336c/smime.bin
  • Francisco Obispo at Mar 4, 2011 at 7:45 am
    I believe what's happening is that Catalyst is converting the UTF-8 string into perl format (decoding), and in that particular example, is working for you because the string is forced back into UTF-8 with the encode_utf8 function.

    This is a code I wrote and use to test unicode issues:

    #!/usr/bin/env perl
    use Encode;

    #my @list = Encode->encodings(q{:all});

    #printf("Encodings Available:\n");
    #map {printf("\t-%s\n",$_)}@list;


    foreach (@ARGV) {
    printf( "Word is %s\n", $_ );
    my $i = 0;
    my $string=decode('utf8',$_);
    my @chr = split( q{}, $string);
    printf( "Length decoded is %d\n", length(decode_utf8($_)) );
    printf( "Length as bytes is %d\n", length($_) );
    map {
    printf( '%d] +U%.4X - %2$04d - %s' . "\n",
    ++$i, ord($_), encode_utf8($_) )
    } @chr;
    }

    In order to get the correct length, I have to decode the UTF-8 string into internal Perl's format, otherwise it will just count bytes:

    $ ./test_unicode.pl espa?ol
    Word is espa?ol
    Length decoded is 7
    Length as bytes is 8
    1] +U0065 - 0101 - e
    2] +U0073 - 0115 - s
    3] +U0070 - 0112 - p
    4] +U0061 - 0097 - a
    5] +U00F1 - 0241 - ?
    6] +U006F - 0111 - o
    7] +U006C - 0108 - l

    As you can see, perl interprets the string (len()) as either a UTF-8 string or as bytes depending whether the string has been decoded or not.

    So, if you don't decode the string, the result is a disaster when using string functions (such as split()).

    Hope this helps.

    Francisco

    On Mar 3, 2011, at 11:26 PM, Eisenberger Tam?s wrote:

    Hy!

    Yes using encode_utf8 makes the test works.

    But anyway, this looks like a problem with the test, because we have
    tests to compare the entire captures / arguments / params strings with
    their originals, and if these tests pass the length of the strings must
    be ok!

    So Erik, can you please review your test, or explain a real word
    situation of the problem you facing?

    I actually use utf8 strings in url's now without problems :)
    --
    Eisenberger Tam?s <[email protected]>
    On Thu, 2011-03-03 at 21:33 -0800, Bill Moseley wrote:
    Does this help?

    On Thu, Mar 3, 2011 at 2:38 PM, Erik Wasser <[email protected]>
    wrote:
    foreach my $u ('http://localhost/test/%E3%81%8B',
    "http://localhost/test/\x{304b}" )
    {
    my $request = HTTP::Request->new(
    'GET'=> encode_utf8($u), [ 'Content-Type' =>
    'text/html; charset=utf8', ],
    );
    print $request->as_string();
    my $response = request( $request );
    is( $response->content, 'length = 1', 'length = 1' );
    }


    --
    Bill Moseley
    [email protected]
    _______________________________________________
    List: [email protected]
    Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
    Searchable archive: http://www.mail-archive.com/[email protected]/
    Dev site: http://dev.catalyst.perl.org/
    _______________________________________________
    List: [email protected]
    Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
    Searchable archive: http://www.mail-archive.com/[email protected]/
    Dev site: http://dev.catalyst.perl.org/
    Francisco Obispo
    Hosted@ Programme Manager
    email: [email protected]
    Phone: +1 650 423 1374 || INOC-DBA *3557* NOC
    Key fingerprint = 532F 84EB 06B4 3806 D5FA 09C6 463E 614E B38D B1BE
  • Erik Wasser at Mar 4, 2011 at 3:13 pm

    On 03/04/2011 08:26 AM, Eisenberger Tam?s wrote:

    So Erik, can you please review your test, or explain a real word
    situation of the problem you facing?
    I was trying to add some utf8 tests to my controller and calling the
    running catalyst instance from the command line was okay but calling it
    via test was not okay.

    I was wondering about about the differences between the two cases
    because the used string was the same so I've expected the same result.

    2 Things will fix this:
    1) Understand the issue (string with utf8 flag on/off)
    2) Use Encode::encode() in your .t files

    I don't know how real world this is. B-)

    --
    So long... Fuzz

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcatalyst @
categoriescatalyst, perl
postedMar 3, '11 at 10:38p
activeMar 4, '11 at 3:13p
posts5
users4
websitecatalystframework.org
irc#catalyst

People

Translate

site design / logo © 2023 Grokbase