FAQ
Getting really frustrated with mod_perl2's apparent inability to
probably read UTF8 input.

Here's my mod_perl2 setup:

   Apache 2.2.[something]
   mod_perl 2.0.7 (or nearly that)
   ModPerl::Registry
   Perl "script" with CGI.pm

Very early in my app:

   ## ensure utf8 CGI params:
   $CGI::PARAM_UTF8 = 1;

   binmode STDIN, ":utf8";
   binmode STDOUT, ":utf8";
   binmode STDERR, ":utf8";

This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
DBI::data_string_desc($foo) shows a UTF8 string with the proper
discrepency between bytes and chars.

But when I try to run it under mod_perl, the returned string appears
to be the raw ascii bytes, and definitely not utf8. Of course, when I
store that in the database (using DBD::Pg), the "latin-1" is encoded
to "utf-8", and I get a bunch of weird chars on the output.

Has anyone managed to round-trip UTF8 from form to database and back
using a setup similar to this?

I suspect part of the problem is this in CGI.pm:

     'read_from_client' => <<'END_OF_FUNC',
     # Read data from a file handle
     sub read_from_client {
     my($self, $buff, $len, $offset) = @_;
     local $^W=0; # prevent a warning
     return $MOD_PERL
         ? $self->r->read($$buff, $len, $offset)
             : read(\*STDIN, $$buff, $len, $offset);
     }
     END_OF_FUNC

Since I binmode STDIN, the non-$MOD_PERL works ok here. What's the
equivalent of $r->read() that marks the incoming stream as UTF8, so I
get chars instead of bytes? Or can I just read(\*STDIN) in mod_perl2
as well? (I know that was supported at one point...)



--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix consulting, Technical writing, Comedy, etc. etc.
Still trying to think of something clever for the fourth line of this .sig

Search Discussions

  • André Warnier at Sep 3, 2014 at 9:17 am
    Hi Randal.

    Randal L. Schwartz wrote:
    Getting really frustrated with mod_perl2's apparent inability to
    probably read UTF8 input.

    Here's my mod_perl2 setup:

    Apache 2.2.[something]
    mod_perl 2.0.7 (or nearly that)
    ModPerl::Registry
    Perl "script" with CGI.pm

    Very early in my app:

    ## ensure utf8 CGI params:
    $CGI::PARAM_UTF8 = 1;

    binmode STDIN, ":utf8";
    binmode STDOUT, ":utf8";
    binmode STDERR, ":utf8";

    This works fine in CGI mode: when I ask for $foo = $cgi->param('foo'),
    DBI::data_string_desc($foo) shows a UTF8 string with the proper
    discrepency between bytes and chars.

    But when I try to run it under mod_perl, the returned string appears
    to be the raw ascii bytes, and definitely not utf8. Of course, when I
    store that in the database (using DBD::Pg), the "latin-1" is encoded
    to "utf-8", and I get a bunch of weird chars on the output.

    Has anyone managed to round-trip UTF8 from form to database and back
    using a setup similar to this?

    I suspect part of the problem is this in CGI.pm:

    'read_from_client' => <<'END_OF_FUNC',
    # Read data from a file handle
    sub read_from_client {
    my($self, $buff, $len, $offset) = @_;
    local $^W=0; # prevent a warning
    return $MOD_PERL
    ? $self->r->read($$buff, $len, $offset)
    : read(\*STDIN, $$buff, $len, $offset);
    }
    END_OF_FUNC

    Since I binmode STDIN, the non-$MOD_PERL works ok here. What's the
    equivalent of $r->read() that marks the incoming stream as UTF8, so I
    get chars instead of bytes? Or can I just read(\*STDIN) in mod_perl2
    as well? (I know that was supported at one point...)

    I share your frustration, as I have been dealing for a long time with multi-lingual web
    applications, using perl and mod_perl.

    First a very top-level comment : the basic problem here is the incompleteness of the HTTP
    RFC's, and the lack of proper support of international characters sets, even still today.
    When a browser is POST-ing the contents of the <input> elements of a <form> to a server,
    there is a set of arcane rules which, in principle, determine the character set in which
    this content is encoded. The problem is that these arcane rules are arcane, often
    confusing, and in addition regularly flouted by different browser makes and versions (not
    to even talk about umpteen non-browser proprietary HTTP client things).

    For example, when a browser sends the content of a form in the "application/form-data"
    "enctype", the content of each form parameter is sent as a separate section, in a form
    similar to the parts in a multi-part RFC-822 email. In theory, each of these parts should
    have its own "content-type" header, and if it is text, it should also contain a "charset"
    attribute indicating the corresponding data's encoding.
    (and if it doesn't, by virtue of the HTTP RFC's, it should be ISO-8859-1, which is still
    the default HTTP character today; quite ridiculous, but so it is).

    But the sad reality is that browser don't do that, and so in the practice in many cases
    the server-side application is reduced to "guessing".

    By experience more than by definite code knowledge, I have to suppose that this kind of
    confusion sometimes also hits developers of modules such a CGI.pm and mod_perl, so that
    over the years, things have tended to vary from one version to another (versions of
    browsers, versions of perl, versions of mod_perl, versions of CGI.pm). Maybe also because
    of all the reasons above, there is just no "right" way of handling this, so CGI.pm always
    returns "bytes" (and libapreq2 may do things differwently).

    In the end, rather than trying to follow the latest developments all the time and
    continuously patch my programs because of all this, I have resorted to some "defensive
    programming" techniques in terms of interpreting <form>-posted data, which have been
    working fine for me for the last few years. It may well be that they are a total
    overkill, but in the practice they have saved me a lot of time not spent wondering why the
    data in some application suddenly started to show up as "A tilde" followed by some bizarre
    graphic sign (or, at the opposite, as a question mark embedded in a losange).

    (Even logging this stuff and trying to figure out what is going on is a pain, because you
    have to figure out first in what encoding you are logging, and second in what encoding you
    are viewing your logs).

    The methodology I follow is as follows :

    1) all html <form> pages of the applications should have a tag like :
    <meta content-type="text/html; charset=.....">
    2) all <forms> in the page should have the attributes
    enctype="application/form-data"
    accept-charset="....." (the same as above)

    The above 2 things do not really guarantee anything, but at least they establish some
    "baseline" which helps in interpreting the rest (and slapping users when they change their
    browser settings).

    3) all forms contain a hidden text <input> like
    <input type="hidden" name="my-UTF8-check" value="AÜÖ.."> (some known sequence of
    "diacritics" characters guaranteed to have a different byte length between ISO-8859-x and
    UTF-8 encoding)

    The point of this one is :
    - all "your" forms have this parameter, so when you receive some posted data, you can
    reasonably assume that it is one of "your" forms that sent it.
    - if the browser sends the data in iso-8859-1, this string will be a certain length in
    bytes, and similarly for UTF-8. You can measure that length in a "use bytes;" section of
    the cgi-bin script. And you can also just compare this with some carefully-crafted string
    constant.

    Then, on the server side, I have some code which systematically checks which is the
    encoding that is *really* seen by the program (cgi-bin script or mod_perl module) for
    these form input elements (using various clues from the server configuration, and the
    above received hidden form parameter).
    And when this code "knows" the received encoding, it then systemetically "sets" or not the
    perl "utf8" flag for these received cgi->param("x") values before actually using them (or
    encode/decode's them as appropriate).
    The point here being that the rest of your script can assume that all the param values are
    UTF-8 encoded, and known as such by Perl; and be done with it all.

    I'm not saying that this is the cleverest and most elegant and most efficient way of
    dealing with this, nor that it is the answer you were looking for.
    But it's helped me sleep better for quite a while now.
  • Cosimo Streppone at Sep 3, 2014 at 9:23 am

    On 09/03/2014 11:17 AM, André Warnier wrote:

    3) all forms contain a hidden text <input> like
    <input type="hidden" name="my-UTF8-check" value="AÜÖ.."> (some known
    sequence of "diacritics" characters guaranteed to have a different byte
    length between ISO-8859-x and UTF-8 encoding)
    [...]
    But it's helped me sleep better for quite a while now.
    This is brilliant :-)
    Thanks André.

    --
    Cosimo
  • Dr James A Smith at Sep 3, 2014 at 9:34 pm
    I encode a "pound sign" which as a parameter which indicates whether
    content is UTF-8, UCS or latin-1 - and this seems to resolve most of the
    issues... I did take a lot of effort to fix issues with utf8 and there
    are a lot of these - between form -> post; between requests if storing
    data in sessions; between script and database; etc...

    I do however not use CGI.pm but use APR instead which I know works (and
    may be less error prone)

    James

    ---
    This email is free from viruses and malware because avast! Antivirus protection is active.
    http://www.avast.com



    --
      The Wellcome Trust Sanger Institute is operated by Genome Research
      Limited, a charity registered in England with number 1021457 and a
      company registered in England with number 2742969, whose registered
      office is 215 Euston Road, London, NW1 2BE.
  • Randal L. Schwartz at Sep 3, 2014 at 7:39 pm
    "André" == André Warnier writes:

    André> The methodology I follow is as follows :

    André> 1) all html <form> pages of the applications should have a tag like :
    André> <meta content-type="text/html; charset=.....">
    André> 2) all <forms> in the page should have the attributes
    André> enctype="application/form-data"
    André> accept-charset="....." (the same as above)

    I've pretty much got success with CGI (and CGI.pm) doing the things I
    listed above. So this isn't needed. I'm not having problems with the
    browser, Apache, or Perl, or RDBO, or Postgresql. (Even that took a bit
    of work to get working, and so I think none of those are the issue.)

    What I need to know is what is mod_perl doing differently? Does it not
    respect binmode STDIN, ":utf8"? Apparently not. So if you know of a
    way to get mod_perl to "fix" reading from the browser properly, I'm
    interested in that.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    Still trying to think of something clever for the fourth line of this .sig
  • Torsten Förtsch at Sep 4, 2014 at 8:21 am

    On 03/09/14 21:38, Randal L. Schwartz wrote:
    What I need to know is what is mod_perl doing differently? Does it not
    respect binmode STDIN, ":utf8"? Apparently not. So if you know of a
    way to get mod_perl to "fix" reading from the browser properly, I'm
    interested in that.
    Something along these lines:

    use Apache2::RequestIO ();
    use Encode ();
    BEGIN {
         my $orig=\&Apache2::RequestRec::read;
         *Apache2::RequestRec::read=sub {
             my ($r, $buf, $len, $offset)=@_;
             my $_buf;
             my $rc=$r->$orig($_buf, $len);
             substr($buf, $offset, undef, Encode::decode_utf8 $_buf);
             return $rc;
         };
    }

    It's a bit more complicated than that because $_buf may end in the
    middle of a character. But you can catch that and read a few more bytes.
    Also, not sure if you expect the return value to be in octets or characters.

    Though, I wouldn't go this way. I'd either try to force CGI.pm to read
    from STDIN and use the perl-script handler
    (http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
    pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
    of that you can push :utf8 then.

    The other way I'd prefer over the hack above is to patch CGI.pm to
    convert the data after it has read it. You can even do that in your
    application. Many applications I have seen have a separate step to
    sanitize the input. That would be the place to do that. However, then
    you have to watch out for upload fields.

    So, there is no really simple solution. And I don't think this will be
    "fixed" in modperl because $r has no such concept as an IO layer. The
    closest thing httpd/modperl has to offer is an input filter. But that
    won't help you here because brigades are handled mainly by httpd which
    knows only about octets. You don't want to change the data itself. You
    want to change the data's metadata.

    Torsten
  • Randal L. Schwartz at Sep 4, 2014 at 7:45 pm
    "Torsten" == Torsten Förtsch writes:
    Torsten> Though, I wouldn't go this way. I'd either try to force CGI.pm to read
    Torsten> from STDIN and use the perl-script handler
    Torsten> (http://perl.apache.org/docs/2.0/user/config/config.html#C_perl_script_). This
    Torsten> pushes a PerlIO layer to STDIN so that you can read from STDIN. On top
    Torsten> of that you can push :utf8 then.

    Yeah, just coded that. In a BEGIN block in my app, I monkey-patched
    read_from_client:

    BEGIN {
       ## monkey-patch CGI.pm so we can get proper utf8 handling
       require CGI;
       CGI::_compile_all(qw(
                     read_from_client
                                  ));
       # warn "defined &CGI::read_from_client is ", 0 + defined
       &CGI::read_from_client;

       ## moose 'around' would be nice here. :)
       my $read_from_client = \&CGI::read_from_client;
       no warnings 'redefine';
       *CGI::read_from_client = sub {
         local $CGI::MOD_PERL = $CGI::MOD_PERL;
         warn "prior MOD_PERL is $CGI::MOD_PERL";
         if (our $USE_STDIN_FOR_MOD_PERL) {
           $CGI::MOD_PERL = 0;
         }
         warn "after MOD_PERL is $CGI::MOD_PERL";
         goto &$read_from_client;
       }
    }

    And in my toplevel, I now do this:

    sub activate {
       my $self = shift;

       require Carp;
       local $SIG{__DIE__} = \&Carp::confess;

       ## ensure utf8 CGI params:
       local $CGI::PARAM_UTF8 = 1;
       ## and disable mod_perl handling during read_from_client
       local our $USE_STDIN_FOR_MOD_PERL = 1;

       binmode STDIN, ":utf8";
       binmode STDOUT, ":utf8";
       binmode STDERR, ":utf8";

       return $self->SUPER::activate(@_);
    }

    (This is my CGI::Prototype-based code, from the CPAN...)

    I'm properly getting the $CGI::MOD_PERL set to 0, which forces
    read from STDIN (via $r) instead of the native STDIN. In theory. In
    practice, even though I've done a binmode STDIN, I'm still getting raw
    bytes from read(\*STDIN...), not utf8-tagged strings.

    Not sure what to do next. Still frustrated.

    Why can't the world just use ASCII? :)

    (I even tried binmode STDIN, "encoding(utf8)" just now as well.)

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    Still trying to think of something clever for the fourth line of this .sig
  • Randal L. Schwartz at Sep 4, 2014 at 8:46 pm
    "Randal" == Randal L Schwartz writes:
    Randal> Yeah, just coded that. In a BEGIN block in my app, I monkey-patched
    Randal> read_from_client:

    And then I've also tried to monkey-patch ->read just as you said.

    On the first read, an empty string is apparently returned, which fails
    something higher in CGI.pm. Ugh.

    Update:

    This monkey patch works:

       *Apache2::RequestRec::read = sub {
         warn "READ CALLED";
         goto &$orig;
       }

    Although it doesn't do any decoding. When I replace the body of that
    with your code, I'm getting these zero-byte reads. Even this fails:

         my ($r, $buff, $len, $offset)=@_;
         # my $_buff;
         # my $rc = $r->$orig($_buff, $len);
         my $rc = $r->$orig($buff, $len, $offset);
         # warn "BEFORE: ", DBI::data_string_desc($_buff);
         # utf8::decode($_buff);
         # warn "AFTER: ", DBI::data_string_desc($_buff);
         # substr($buff, $offset, undef, $_buff);
         # warn "AFTER: ", DBI::data_string_desc($buff);
         return $rc;

    which should be the same as your code without the utf8 encoding still.
    Still getting 0 bytes though.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    Still trying to think of something clever for the fourth line of this .sig
  • Michael Schout at Sep 8, 2014 at 5:57 pm

    On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:

    ## ensure utf8 CGI params:
    $CGI::PARAM_UTF8 = 1;
    Sorry to chime in late on this, but part of the problem with CGI.pm and
    UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
    itself registers if its running under mod_perl.

    This caused major headaches for me at one time until I figured this out.

    You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
    REQUEST, because if you just set it globally (e.g.: in a startup perl
    script), then it only works for the first request.

    Regards,
    Michael Schout
  • André Warnier at Sep 8, 2014 at 7:17 pm

    Michael Schout wrote:
    On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:

    ## ensure utf8 CGI params:
    $CGI::PARAM_UTF8 = 1;
    Sorry to chime in late on this, but part of the problem with CGI.pm and
    UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
    itself registers if its running under mod_perl.

    This caused major headaches for me at one time until I figured this out.

    You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
    REQUEST, because if you just set it globally (e.g.: in a startup perl
    script), then it only works for the first request.
    Hi.
    Just an addendum to the discussion :

    There are really two distinct approaches to this issue, and they work at different levels :

    1) is to "fix" CGI.pm so that it delivers the parameters in the way which you expect.
    As shown by the previous valuable and technical contributions, this generally works, but
    it requires a certain level of expertise; and it does not necessarily work backwards with
    all versions of mod_perl and CGI.pm.

    2) is to take whatever CGI.pm does deliver to the calling script or module, and use a
    couple of tricks and some additional code in ditto script or module, to ensure that
    whatever CGI.pm delivers under whatever mod_perl version, the receiving script or module
    always knows in the end what it is dealing with.
    That is the method which I presented early in the discussion.
    As stated in that contribution, it is not necessarily the most elegant or efficient way to
    deal with the issue, but it has the advantage of working always, no matter which version
    of CGI.pm and/or mod_perl are in use.

    The real crux of the matter is this, in my view : as things stand today in terms of
    protocol and RFCs, there is no real way for CGI.pm (or any comparable framework) to be
    *sure* of the encoding of the data sent by a browser or another HTTP client agent. Even
    the RFCs do not really provide a way by which this can be enforced. (*)

    So if you are sure of what the client is sending, and the matter consists of *forcing*
    CGI.pm to always communicate POST (or GET) data as UTF-8 encoded and utf8-marked (or the
    opposite) to the calling script/module, then method 1 will work, and it is more elegant
    and probably more efficient than method 2.

    But if the matter consists of ensuring that the receiving code in the script/module which
       handles the data submitted by the HTTP client, is resilient and "does the right thing"
    whatever the submitted data really was, then in my opinion method 2 is better.
    (But that's only my opinion of the moment, and I stand ready to be corrected).

    (*) and if you believe this not to be true, please send me some references about it,
    because I am really interested. It might save me some code in all my web-facing applications.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmodperl @
categoriesmodperl, perl
postedSep 2, '14 at 9:20p
activeSep 8, '14 at 7:17p
posts10
users6
websiteperl.apache.org

People

Translate

site design / logo © 2018 Grokbase