FAQ
I'm a new Catalyst user and this is my first Catalyst project, so
please forgive my ignorance.

I am developing a simple Catalyst application using MySQL, FormFu and
TT. I have everything encoded in ISO-8859-1: the data in MySQL, the
Perl files, the FormFu .yml files, and the TT templates. (I am even
running in a ISO-8859-1 locale, for what it's worth.)

Following the advice in an earlier thread ("Charset best practice" -
<http://www.mail-archive.com/catalyst@lists.scsys.co.uk/msg00284.html>),
I have overridden the process() method of Catalyst::View::TT with a
version that does $c->response->content_type('text/html;
charset=iso-8859-1');

Running under the test server (myapp_server.pl), everything works
perfectly. Non-ASCII characters (like ???) are displayed correctly,
and they end up stored correctly in MySQL if I enter them in forms.

But under mod_perl, characters are displayed as their 2 byte UTF-8
encoding (for instance '?' becomes '??'), and values entered in forms
are stored in MySQL that way - and are "doubly" UTF-8 encoded when
they are displayed in the browser again.

Without the modified process() method, non-ASCII characters are displayed
correctly initially under mod_perl, but characters entered in forms
are still stored in MySQL as UTF-8, and are shown that way when
displayed in the browser again.

Thus it would seem that mod_perl converts latin1 to utf8 when sending
to the browser, but not back again when receiving forms.

(Under the test server, without the modified process(), non-ASCII
characters are displayed as invalid characters. This is natural,
since we are telling the browser that the document is UTF-8 encoded,
while sending it ISO-8859-1 characters.)

What can I do about this? Can I tell mod_perl to "leave my characters
alone"? Have I made an error somewhere?

--
Regards,
Bj?rn-Helge Mevik

Search Discussions

  • Dami Laurent (PJ) at Dec 22, 2008 at 8:22 am

    -----Message d'origine-----
    De : Bj?rn-Helge Mevik
    Envoy? : dimanche, 21. d?cembre 2008 23:52
    ? : catalyst@lists.scsys.co.uk
    Objet : [Catalyst] mod_perl converts latin1 to utf8 !? ....
    Thus it would seem that mod_perl converts latin1 to utf8 when sending
    to the browser, but not back again when receiving forms.

    Mod_perl probably does nothing to your encoding, but Apache might interfere.

    a) did you configure your Apache as follows ?

    AddDefaultCharset iso-8859-1

    b) try to look a the HTTP traffic (using Firebug or Fiddler2), to see if there are some other charset=... headers generated by some component in your chain.

    c) try some static latin1 pages in Apache htdocs to see if they are rendered correctly.

    Hope this helps,

    Good luck, Laurent Dami
  • Bjørn-Helge Mevik at Dec 26, 2008 at 8:03 pm

    Dami Laurent (PJ) wrote:

    Mod_perl probably does nothing to your encoding, but Apache might interfere.
    Hm. That's a thought...
    a) did you configure your Apache as follows ?

    AddDefaultCharset iso-8859-1
    I've now tried that, but it had no effect on the encoding.
    b) try to look a the HTTP traffic (using Firebug or Fiddler2), to
    see if there are some other charset=... headers generated by some
    component in your chain.
    I should probably start using tools like that -- sofar I've only used
    telnet for looking at the HTTP traffic. :-) Anyway, there is only one
    "Content-Type: text/html; charset=" HTTP header, and it conforms to
    the setting in the process() method of Catalyst::View::TT.

    (I've also tried to add a http-equiv="Content-Type" meta tag to the
    <head>, but to no avail.)
    c) try some static latin1 pages in Apache htdocs to see if they are
    rendered correctly.
    I've tried static latin1 and utf8 pages, and they are rendered
    correctly: Apache does not change the encoding of the characters. If
    the page contains a http-equiv="Content-Type" meta tag, it
    is respected, otherwise Apache looks at the characters and sets the
    HTTP content-type header correctly.

    Further, I wrote a small module with a handler() and ran it under
    mod_perl (outside the Catalyst application):

    ===============
    package Enctest;
    use strict;
    use warnings;
    use Encode;
    #use utf8;

    sub handler() {
    my $r = shift;
    my $A = "<p>???</p>";
    $r->content_type('text/html; charset=iso8859-1');
    #$r->content_type('text/html; charset=utf-8');
    $r->print("<html>$A");
    $r->print(encode('ISO-8859-1', $A));
    $r->print(encode('UTF-8', $A) . "</html>");
    0;
    }

    1;
    =============

    I tested all combinations of
    - Storing the file as latin1 vs. utf8
    - With and without "use utf8;"
    - charset iso8859-1 vs. utf-8

    In all combinations, Apache+mod_perl faithfully reproduced the bytes
    that, up to my understanding, Perl should output in the different
    print()s.
    From this it would seem that Apache and mod_perl do not recode the
    characters. Perhaps it could be something that TT does when run under
    mod_perl (as this does not happen under the development server)?

    --
    Bj?rn-Helge Mevik
  • Jonathan Rockway at Dec 22, 2008 at 8:53 pm

    On Sun, Dec 21, 2008 at 11:52:27PM +0100, Bj?rn-Helge Mevik wrote:

    I am developing a simple Catalyst application using MySQL, FormFu and
    TT. I have everything encoded in ISO-8859-1: the data in MySQL, the
    Perl files, the FormFu .yml files, and the TT templates. (I am even
    running in a ISO-8859-1 locale, for what it's worth.)

    Following the advice in an earlier thread ("Charset best practice" -
    <http://www.mail-archive.com/catalyst@lists.scsys.co.uk/msg00284.html>),
    I have overridden the process() method of Catalyst::View::TT with a
    version that does $c->response->content_type('text/html;
    charset=iso-8859-1');
    ...

    What can I do about this? Can I tell mod_perl to "leave my characters
    alone"? Have I made an error somewhere?
    Sort of. Working with latin-1, or any character encoding, is just
    like working with UTF-8. When you work with UTF-8, you need to say
    something like this:

    my $data = Encode::decode('utf8', $raw_data);
    process($data);
    print Encode::encode('utf8', $data);

    We decode the binary data into text, then we work with the text, then
    we encode the text back into binary data for the wire (or the user's
    xterm, in this case).

    When you are working with iso-8859-1, you need to do exactly the same
    thing. Everything you read needs to be decoded, and everything you
    need to write needs to be encoded. (In this case, it is sort of an
    uphill battle since most people use Unicode now, and that is the "code
    path" with the most testing. There are probably places that helpfully
    treat your latin-1 as utf-8, which is definitely incorrect of it.)

    Decoding is probably a no-op, so focus on the encoding part. Take a
    look at Catalyst::Plugin::Unicode (specifically finalize_body, I
    think), and change the Encode::encode('utf-8', ...) to
    Encode::encode('iso-8859-1', ...)

    I bet this will solve your problem, but if not, let us know what code
    you tried and where, and we will try to help you some more.

    Regards,
    Jonathan Rockway
  • Aristotle Pagaltzis at Dec 22, 2008 at 11:55 pm

    * Jonathan Rockway [2008-12-22 22:00]:
    my $data = Encode::decode('utf8', $raw_data);
    process($data);
    print Encode::encode('utf8', $data);
    Use `UTF-8`, not `utf8`. The lowercase non-dash version will
    perform purely the integer representation conversion but will
    not do any validity checks, such as whether an octet sequence
    actually decodes to a valid codepoint or if it is even well-
    formed, so it could be used to hide XSS or other injection
    attacks.

    It?s annoying that Perl makes the lazy choice the wrong one.

    Regards,
    --
    Aristotle Pagaltzis // <http://plasmasturm.org/>
  • Bjørn-Helge Mevik at Dec 28, 2008 at 11:17 am

    Jonathan Rockway wrote:

    When you are working with iso-8859-1, you need to do exactly the same
    thing. Everything you read needs to be decoded, and everything you
    need to write needs to be encoded. (In this case, it is sort of an
    uphill battle since most people use Unicode now, and that is the "code
    path" with the most testing. There are probably places that helpfully
    treat your latin-1 as utf-8, which is definitely incorrect of it.)
    I actually do not care which encoding is used for storing or
    displaying my data. I chose iso-8859-1 because I'm used to it, and I
    thought it would be easier than using UTF-8. (I saw quite a few emails
    asking how to handle UTF-8, so I guessed it wouldn't be
    straight-forward. Also, my first attempt at following the advice in
    <http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode> was unsuccessful.)
    Decoding is probably a no-op, so focus on the encoding part. Take a
    look at Catalyst::Plugin::Unicode (specifically finalize_body, I
    think), and change the Encode::encode('utf-8', ...) to
    Encode::encode('iso-8859-1', ...)
    I tried modifying Catalyst::Plugin::Unicode the following way:
    062016150213:/usr/share/perl5/Catalyst/Plugin# diff Unicode.pm.orig Unicode.pm
    3a4
    use Encode qw(encode decode);
    22c23
    < utf8::encode( $c->response->{body} );
    ---
    encode('ISO-8859-1', $c->response->{body} );
    38c39
    < utf8::decode($_) for ( ref($value) ? @{$value} : $value );
    ---
    Encode::decode('ISO-8859-1', $_) for ( ref($value) ? @{$value} : $value );
    When running under the development server, this seemed to be a no-op:
    everything still worked perfectly.

    Under mod_perl, it was almost a no-op as well. The only difference
    was that when entering non-ASCII letters in a form field and storing
    it, the entered characters were now correctly handled -- however, any
    _existing_ non-ASCII character now became stored in the data base as
    UTF-8.


    In an attempt to "stick to the broad path", I tried using the
    _unmodified_ Catalyst::Plugin::Unicode (and removed my modified
    process() method of TT). (I still have mysql and all files in latin1,
    though.) Now both Apache/mod_perl and the development server work
    identically (which is progress, I think :-): All characters are
    displayed correctly (as UTF-8), but non-ASCII characters entered into
    a form gets stored as UTF-8 in mysql.

    So perhaps my best bet now is to try and get my data properly encoded
    on the way to mysql?

    --
    Bj?rn-Helge Mevik
  • Marius Kjeldahl at Dec 28, 2008 at 2:40 pm
    I've followed a similar path to yours, and the best choice eventually
    for my projects was to make sure everything is utf8.

    Having said this, the specific bug that bit me in the ass was Firefox
    3.0 messing up character set encodings for anything using Ajax-style
    calls. Fortunately, 3.1 fixed it.

    Just mentioning this in case it may be related to your struggles.

    Regards,

    Marius K.

    Bj?rn-Helge Mevik wrote:
    Jonathan Rockway wrote:
    When you are working with iso-8859-1, you need to do exactly the same
    thing. Everything you read needs to be decoded, and everything you
    need to write needs to be encoded. (In this case, it is sort of an
    uphill battle since most people use Unicode now, and that is the "code
    path" with the most testing. There are probably places that helpfully
    treat your latin-1 as utf-8, which is definitely incorrect of it.)
    I actually do not care which encoding is used for storing or
    displaying my data. I chose iso-8859-1 because I'm used to it, and I
    thought it would be easier than using UTF-8. (I saw quite a few emails
    asking how to handle UTF-8, so I guessed it wouldn't be
    straight-forward. Also, my first attempt at following the advice in
    <http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode> was unsuccessful.)
    Decoding is probably a no-op, so focus on the encoding part. Take a
    look at Catalyst::Plugin::Unicode (specifically finalize_body, I
    think), and change the Encode::encode('utf-8', ...) to
    Encode::encode('iso-8859-1', ...)
    I tried modifying Catalyst::Plugin::Unicode the following way:
    062016150213:/usr/share/perl5/Catalyst/Plugin# diff Unicode.pm.orig Unicode.pm
    3a4
    use Encode qw(encode decode);
    22c23
    < utf8::encode( $c->response->{body} );
    ---
    encode('ISO-8859-1', $c->response->{body} );
    38c39
    < utf8::decode($_) for ( ref($value) ? @{$value} : $value );
    ---
    Encode::decode('ISO-8859-1', $_) for ( ref($value) ? @{$value} : $value );
    When running under the development server, this seemed to be a no-op:
    everything still worked perfectly.

    Under mod_perl, it was almost a no-op as well. The only difference
    was that when entering non-ASCII letters in a form field and storing
    it, the entered characters were now correctly handled -- however, any
    _existing_ non-ASCII character now became stored in the data base as
    UTF-8.


    In an attempt to "stick to the broad path", I tried using the
    _unmodified_ Catalyst::Plugin::Unicode (and removed my modified
    process() method of TT). (I still have mysql and all files in latin1,
    though.) Now both Apache/mod_perl and the development server work
    identically (which is progress, I think :-): All characters are
    displayed correctly (as UTF-8), but non-ASCII characters entered into
    a form gets stored as UTF-8 in mysql.

    So perhaps my best bet now is to try and get my data properly encoded
    on the way to mysql?
  • Bjørn-Helge Mevik at Dec 29, 2008 at 1:09 pm

    Marius Kjeldahl wrote:

    I've followed a similar path to yours, and the best choice eventually
    for my projects was to make sure everything is utf8.
    Thanks. I might follow that path eventually. For the time beeing,
    I've found a hack that at least works on my development server
    (A.K.A. my home computer :-).

    --
    Bj?rn-Helge Mevik
  • Zbigniew Lukasiak at Dec 29, 2008 at 8:44 am
    On Sun, Dec 28, 2008 at 12:17 PM, Bj?rn-Helge Mevik wrote:

    snip snip
    I tried modifying Catalyst::Plugin::Unicode the following way:
    062016150213:/usr/share/perl5/Catalyst/Plugin# diff Unicode.pm.orig Unicode.pm
    3a4
    use Encode qw(encode decode);
    22c23
    < utf8::encode( $c->response->{body} );
    ---
    encode('ISO-8859-1', $c->response->{body} );
    38c39
    < utf8::decode($_) for ( ref($value) ? @{$value} : $value );
    ---
    Encode::decode('ISO-8859-1', $_) for ( ref($value) ? @{$value} : $value );
    When running under the development server, this seemed to be a no-op:
    everything still worked perfectly.

    Under mod_perl, it was almost a no-op as well. The only difference
    was that when entering non-ASCII letters in a form field and storing
    it, the entered characters were now correctly handled -- however, any
    _existing_ non-ASCII character now became stored in the data base as
    UTF-8.
    Here is my wild guess of what happened: in some circumstances the
    internal representation of Perl strings can be latin1 - and if you
    don't encode it when writing to the database you'll get latin1 in the
    database - but for the most common case the internal representation
    will be utf8 - and that you'll have in the db when writing to it
    without any encoding. In theory you should not rely on that - because
    it is *internal representation*. You need to encode every output
    (and decode every input) that comes from the Perl program to the
    outside world - including the database. For each output (input) you do
    it separately and you can use different encoding (like UTF-8 for the
    web pages and Latin-1 for the DB). Said that - I don't know much
    about the practical side of that - for my work I just always use UTF-8
    and pg_enable_utf8.
  • Bjørn-Helge Mevik at Dec 29, 2008 at 1:07 pm

    Zbigniew Lukasiak wrote:

    Here is my wild guess of what happened: in some circumstances the
    internal representation of Perl strings can be latin1 - and if you
    don't encode it when writing to the database you'll get latin1 in the
    database - but for the most common case the internal representation
    will be utf8 - and that you'll have in the db when writing to it
    without any encoding.
    This is my guess as well. With the C::P::Unicode, everything that
    comes into the app from the browser seems to have UTF-8 as internal
    representation, and everything that comes from mysql seems to have
    ISO-8859-1 representation.

    Thus I've found a hack that seems to work for me: I use
    on_connect_do => [ "set character_set_client = 'utf8'" ] in
    connect_info. This tells mysql to expect UTF-8 from the client.
    ("set names 'utf8'" would also set the output to UTF-8, so I can't use
    that). It is recommended to also set mysql_enable_utf8 => 1, but I
    still havent seen any effect of that setting (my DBD::mysql is 4.008,
    so it should be new enough).

    This will probably break when I move the app to the production
    server. :-)
    In theory you should not rely on that - because
    it is *internal representation*. You need to encode every output
    (and decode every input) that comes from the Perl program to the
    outside world - including the database. For each output (input) you do
    it separately and you can use different encoding (like UTF-8 for the
    web pages and Latin-1 for the DB).
    I heartily agree. Unfortunately, sofar I haven't been able to figure
    out how to get the proper encode()/decode() when using
    Catalyst::Model::DBIC::Schema.

    --
    Bj?rn-Helge Mevik

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcatalyst @
categoriescatalyst, perl
postedDec 21, '08 at 10:52p
activeDec 29, '08 at 1:09p
posts10
users6
websitecatalystframework.org
irc#catalyst

People

Translate

site design / logo © 2022 Grokbase