FAQ

hernan gonzalez writes:
The issue is that psql tries (apparently) to convert to UTF8
(even when he plans to output the raw text -LATIN9 in this case)
just for computing the lenght of the field, to build the table.
And because for this computation he (apparently) rely on the string
routines with it's own locale, instead of the DB or client encoding.
I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations. But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

fprintf(fout, "%.*s", bytes_to_output,
this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print. However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding. I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero. I think we're going to have to
change the code to not rely on %.*s here. Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte. (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

regards, tom lane

Search Discussions

  • Hgonzalez at May 8, 2010 at 1:49 am
    However, it appears that glibc's printf
    code interprets the parameter as the number of *characters* to print,
    and to determine what's a character it assumes the string is in the
    environment LC_CTYPE's encoding.

    Well, I myself have problems to believe that :-)
    This would be nasty... Are you sure?

    I couldn reproduce that.
    I made a quick test, passing a utf-8 encoded string
    (5 bytes correspoding to 4 unicode chars: "niño")
    And my glib (same Fedora 12) seems to count bytes,
    as it should.

    #include<stdio.h>
    main () {
    char s[] = "ni\xc3\xb1o";
    printf("|%.*s|\n",5,s);
    }

    This, compiled with gcc 4.4.3, run with my root locale (utf8)
    did not padded a blank. ie it worked as expected.

    Hernán
  • Hernan gonzalez at May 8, 2010 at 2:31 am
    Sorry about a error in my previous example (mixed width and precision).
    But the conclusion is the same - it works on bytes:

    #include<stdio.h>
    main () {
    char s[] = "ni\xc3\xb1o"; /* 5 bytes , 4 utf8 chars */
    printf("|%*s|\n",6,s); /* this should pad a black */
    printf("|%.*s|\n",4,s); /* this should eat a char */
    }

    [root@myserv tmp]# ./a.out | od -t cx1
    0000000 | n i 303 261 o | \n | n i 303 261 | \n
    7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a


    Hernán


    On Fri, May 7, 2010 at 10:48 PM, wrote:
    However, it appears that glibc's printf
    code interprets the parameter as the number of *characters* to print,
    and to determine what's a character it assumes the string is in the
    environment LC_CTYPE's encoding.

    Well, I myself have problems to believe that :-)
    This would be nasty... Are you sure?

    I couldn reproduce that.
    I made a quick test, passing a utf-8 encoded string
    (5 bytes correspoding to 4 unicode chars: "niño")
    And my glib (same Fedora 12) seems to count bytes,
    as it should.

    #include<stdio.h>
    main () {
    char s[] = "ni\xc3\xb1o";
    printf("|%.*s|\n",5,s);
    }

    This, compiled with gcc 4.4.3, run with my root locale (utf8)
    did not padded a blank. i.e. it worked as expected.

    Hernán
  • Tom Lane at May 8, 2010 at 1:53 pm

    hernan gonzalez writes:
    Sorry about a error in my previous example (mixed width and precision).
    But the conclusion is the same - it works on bytes:
    This example works like that because it's running in C locale always.
    Try something like this:

    #include<stdio.h>
    #include<locale.h>

    int main () {
    char s[] = "ni\xc3qo"; /* 5 bytes , not valid utf8 */

    setlocale(LC_ALL, "");
    printf("|%.*s|\n",3,s);
    return 0;
    }


    I get different (and undesirable) effects depending on LANG.

    regards, tom lane
  • Hernan gonzalez at May 8, 2010 at 5:08 pm
    Wow, you are right, this is bizarre...

    And it's not that glibc intends to compute the length in unicode chars,
    it actually counts bytes (c plain chars) -as it should- for computing
    field widths...
    But, for some strange reason, when there is some width calculation involved
    it tries so parse the char[] using the locale encoding (when there's no point
    in doing it!) and if it fails, it truncates (silently) the printf output.
    So it seems more a glib bug to me than an interpretion issue (bytes vs chars).
    I posted some details in stackoverflow:
    http://stackoverflow.com/questions/2792567/printf-field-width-bytes-or-chars

    BTW, I understand that postgresql uses locale semantics in the server code.
    But is this really necessary/appropiate in the client (psql) side?
    Couldnt we stick
    with C locale here?

    --
    Hernán J. González
    http://hjg.com.ar/
  • Hgonzalez at May 8, 2010 at 10:51 pm
    Well, I finally found some related -rather old- issues in Bugzilla (glib)

    http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308
    http://sources.redhat.com/bugzilla/show_bug.cgi?id=649

    The last explains why they do not consider it a bug:

    ISO C99 requires for %.*s to only write complete characters that fit below
    the
    precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
    characters as shown in the input file you provided, some of the strings are
    not valid UTF-8 strings, therefore sprintf fails with -1 because of the
    encoding error. That's not a bug in glibc.

    It's clear, though it's also rather ugly, from a specification point of
    view (we must
    count raw bytes for the width field, but also must decode the utf8 chars
    for finding
    character boundaries). I guess we must live with that.

    Hernán J. González
  • Tom Lane at May 9, 2010 at 1:24 am

    hgonzalez@gmail.com writes:
    http://sources.redhat.com/bugzilla/show_bug.cgi?id=649
    The last explains why they do not consider it a bug:
    ISO C99 requires for %.*s to only write complete characters that fit below
    the
    precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1
    characters as shown in the input file you provided, some of the strings are
    not valid UTF-8 strings, therefore sprintf fails with -1 because of the
    encoding error. That's not a bug in glibc.
    Yeah, that was about the position I thought they'd take.

    So the bottom line here is that we're best off to avoid %.*s because
    it may fail if the string contains data that isn't validly encoded
    according to libc's idea of the prevailing encoding. I think that
    means the patch I committed earlier is still a good idea, but the
    comments need a bit of adjustment. Will fix.

    regards, tom lane
  • Tom Lane at May 9, 2010 at 2:19 am

    hernan gonzalez writes:
    BTW, I understand that postgresql uses locale semantics in the server code.
    But is this really necessary/appropiate in the client (psql) side?
    Couldnt we stick with C locale here?
    As far as that goes, I think we have to turn on that machinery in order
    to have gettext() work (ie, to have localized error messages).

    regards, tom lane

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedMay 7, '10 at 11:46p
activeMay 9, '10 at 2:19a
posts8
users2
websitepostgresql.org...
irc#postgresql

2 users in discussion

Tom Lane: 4 posts Hgonzalez: 4 posts

People

Translate

site design / logo © 2022 Grokbase