FAQ
A question about DBD::Pg and client encoding came up on the Rose::DB
list and now I'm a bit more curious about how DBD::Pg handles
different client encodings. Feel free to point me to any archive if
this is an old topic.

I assume the "pg_enable_utf8" feature was added in 1.22:

1.22 Wed Mar 26 22:33:44 EST 2003 (subversion r6993)
- Add utf8 support [Dominic Mitchell <[email protected]>]


I use the pg_enable_utf8 option -- my database is in utf8 (and thus my
client encoding defaults to utf8). All seems to work well. In my
applications I also decode all input data from other sources (e.g.
templates), and call encode() for all output (except data going to the
database).

From my *limited* knowledge of DBD::Pg it seems the pg_enable_utf8
option simply forces the utf8 flag on data read from the database
*IF* that data looks like a valid utf8 string. Does that happen on
all column data, or only columns that are text?

Are there any other actions by DBD::Pg for supporting character
encodings?

I didn't see that DBD::Pg does any encoding of data, yet I do not see
any "Wide Character in %s" errors when writing to the db, so I assume
DBD::Pg is sending data to the db in a way that avoids that message.


I'm curious about forcing the utf8 flag because binary data might look
like a valid utf8 string but not be (not that setting the utf8 flag
will do any damage to non-text data, that I'm aware of).


I think the "proper" approach would be to decode using the
client_encoding on read from the db on text columns and likewise
encode to the client encoding on write back to the db. But, perhaps
there is a reason that approach was not taken.

Can anyone fill me in on the state of character support in DBD::Pg?


Thanks,



--
Bill Moseley
[email protected]
Sent from my iMutt

Search Discussions

  • Greg Sabino Mullane at Sep 8, 2008 at 9:25 pm

    From my *limited* knowledge of DBD::Pg it seems the pg_enable_utf8
    option simply forces the utf8 flag on data read from the database
    *IF* that data looks like a valid utf8 string. Does that happen on
    all column data, or only columns that are text?
    All text-like columns: CHAR, TEXT, BPCHAR, VARCHAR
    Are there any other actions by DBD::Pg for supporting character
    encodings?
    Not really.
    I think the "proper" approach would be to decode using the
    client_encoding on read from the db on text columns and likewise
    encode to the client encoding on write back to the db. But, perhaps
    there is a reason that approach was not taken.
    I don't honestly remember why things are like they are at the moment,
    but we certainly may be doing the things the wrong way. :) Maybe you can
    expand the above paragraph into a more formal set of proposed rules.
    When I get some time, I'll devote some cycles to this.

    - --
    Greg Sabino Mullane [email protected]
    End Point Corporation
    PGP Key: 0x14964AC8 200809081724
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • Bill Moseley at Sep 9, 2008 at 3:42 am

    On Mon, Sep 08, 2008 at 09:25:17PM -0000, Greg Sabino Mullane wrote:
    I think the "proper" approach would be to decode using the
    client_encoding on read from the db on text columns and likewise
    encode to the client encoding on write back to the db. But, perhaps
    there is a reason that approach was not taken.
    I don't honestly remember why things are like they are at the moment,
    but we certainly may be doing the things the wrong way. :) Maybe you can
    expand the above paragraph into a more formal set of proposed rules.
    When I get some time, I'll devote some cycles to this.
    Ok, how formal would you need?

    I guess my proposal would be to perhaps have a flag that when set
    causes DBD::Pg to read the client_encoding after making a connection.
    Then use that encoding with Encode::encode() and Encode::decode() when
    moving data between Perl and Pg.

    The implementation details are a bit more sketchy. ;)

    For one thing, I'm not sure if the client_encoding returned from Pg
    would match with the encoding names used by Perl.

    Then, would also need to know what data should and should not be
    decoded and encoded. I guess everything except binary (and numeric?)
    column data.

    But, the current system seems to work fine -- someone can simply use
    utf8 client encoding and set pg_enable_utf8 (regardless of their
    database's encoding). And it's unlikely that Perl's internal
    representation of character data will change from utf8 anytime soon
    so just forcing the utf8 flag is fine.


    --
    Bill Moseley
    [email protected]
    Sent from my iMutt
  • David E. Wheeler at Sep 9, 2008 at 6:10 am

    On Sep 8, 2008, at 20:42, Bill Moseley wrote:

    Ok, how formal would you need?

    I guess my proposal would be to perhaps have a flag that when set
    causes DBD::Pg to read the client_encoding after making a connection.
    Then use that encoding with Encode::encode() and Encode::decode() when
    moving data between Perl and Pg.
    Frankly, I think that this should be a part of the DBI. But note that
    some databases have different encodings on different columns.
    The implementation details are a bit more sketchy. ;)

    For one thing, I'm not sure if the client_encoding returned from Pg
    would match with the encoding names used by Perl.
    Should be do-able.
    Then, would also need to know what data should and should not be
    decoded and encoded. I guess everything except binary (and numeric?)
    column data.
    I think only text types and text-like types (Greg, how does DBD::Pg
    determine this, currently? I'd want CITEXT data to be converted to
    UTF-8, too; is there some way to tell it what types should be utf8?)
    But, the current system seems to work fine -- someone can simply use
    utf8 client encoding and set pg_enable_utf8 (regardless of their
    database's encoding). And it's unlikely that Perl's internal
    representation of character data will change from utf8 anytime soon
    so just forcing the utf8 flag is fine.
    Yep.

    Best,

    David
  • Greg Sabino Mullane at Sep 9, 2008 at 2:44 pm

    I think only text types and text-like types (Greg, how does DBD::Pg
    determine this, currently? I'd want CITEXT data to be converted to
    UTF-8, too; is there some way to tell it what types should be utf8?)
    As far as stuff coming out of the database, it's only the four text-like
    types I mentioned earlier. See line 3329 of dbdimp.c. We might want to
    make than an exclusion check, and/or go global as mentioned below.

    Now that I've had some time to recall things, I think the primary reason
    for not so much automagicness is simply a question of efficiency. Parsing
    every string coming out of the database for "utf-8ness" is expensive. Also
    expensive is checking client_encoding, although libpq at least tracks
    that for us, so it's not as bad as it first looks.

    So the next question is, why don't we just flip the utf8 flag on for
    all strings coming back from the database? What's the drawbacks?

    I need to brush up on my unicode foo, but let's keep the discussion going,
    I'd love to see this solved in a way that limits or removes the need
    for things like setting specific utf8 flags via the database handle.

    - --
    Greg Sabino Mullane [email protected]
    End Point Corporation
    PGP Key: 0x14964AC8 200809091043
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • David E. Wheeler at Sep 9, 2008 at 8:26 pm

    On Sep 9, 2008, at 07:44, Greg Sabino Mullane wrote:

    I need to brush up on my unicode foo, but let's keep the discussion
    going,
    I'd love to see this solved in a way that limits or removes the need
    for things like setting specific utf8 flags via the database handle.
    For me, I'd love to be able to tell it extra datatypes to convert to
    utf8. The basic four are great, but since there are a lot of contrib
    and pgFoundry modules that have custom data types, some of which
    support multibyte text, I think it's worthwhile to be able to tell it
    the types (or extra types) on a per-connection basis.

    But also ask in dbi-dev; I believe that Tim Bunce has some ideas about
    this stuff, as well.

    Best,

    David
  • Bill Moseley at Sep 10, 2008 at 12:17 am

    On Tue, Sep 09, 2008 at 02:44:07PM -0000, Greg Sabino Mullane wrote:

    Now that I've had some time to recall things, I think the primary reason
    for not so much automagicness is simply a question of efficiency. Parsing
    every string coming out of the database for "utf-8ness" is expensive. Also
    expensive is checking client_encoding, although libpq at least tracks
    that for us, so it's not as bad as it first looks.

    So the next question is, why don't we just flip the utf8 flag on for
    all strings coming back from the database? What's the drawbacks?
    By all strings you mean the current list:

    PG_CHAR
    PG_TEXT
    PG_BPCHAR
    PG_VARCHAR

    And above you mean just set the utf8 flag and not check that it's
    valid utf8?

    Seems reasonable. If a user sets pg_enable_utf8 then that would mean
    that client_encoding is utf8, too. Therefore, for the types above, we
    know it's already encoded as utf8 (well, assuming PG encodes to utf8
    that set of columns).

    But, yes, blindly setting the utf8 flag can cause problems for all
    columns. Besides Perl likely blowing up on invalid utf8 sequences,
    things like length() would be wrong.


    Postgresql's client_encoding allows for a different encoding than the
    database encoding. I assume that also means that Postgresql must
    decide what columns need to be converted between encodings.

    Seems like if DBD::Pg knew that information (what columns were
    candidates for re-encoding by Postgresql) then DBD::Pg could then
    simply set the utf8 flag on those and not bother with calling
    is_utf8_string(). That's assuming that client encoding is utf8, of
    course. That would help if there were other data types that were
    indeed character data but not the listed types above.

    But, I have no idea how Postgresql actually decides what columns to
    re-encode.


    I guess the only other client encodings supported would be 8858-1
    (and ascii, of course) with pg_enable_utf8 off.






    --
    Bill Moseley
    [email protected]
    Sent from my iMutt
  • Steve Haslam at Sep 9, 2008 at 7:43 pm

    On Mon, Sep 08, 2008 at 08:42:06PM -0700, Bill Moseley wrote:
    On Mon, Sep 08, 2008 at 09:25:17PM -0000, Greg Sabino Mullane wrote:

    I think the "proper" approach would be to decode using the
    client_encoding on read from the db on text columns and likewise
    encode to the client encoding on write back to the db. But, perhaps
    there is a reason that approach was not taken.
    I don't honestly remember why things are like they are at the moment,
    but we certainly may be doing the things the wrong way. :) Maybe you can
    expand the above paragraph into a more formal set of proposed rules.
    When I get some time, I'll devote some cycles to this.
    Ok, how formal would you need?

    I guess my proposal would be to perhaps have a flag that when set
    causes DBD::Pg to read the client_encoding after making a connection.
    Then use that encoding with Encode::encode() and Encode::decode() when
    moving data between Perl and Pg. [...]
    database's encoding). And it's unlikely that Perl's internal
    representation of character data will change from utf8 anytime soon
    so just forcing the utf8 flag is fine.
    So why try to support encodings other than utf8 anyway? IMHO having
    DBD::Pg setting the client_encoding to 'utf8' if pg_enable_utf8 is
    specified would make more sense. I suspect that trying to support
    other encodings is a bit of a minefield, and tbh I'm a bit puzzled as
    to what exactly the gain would be.

    SRH
    --
    Steve Haslam Reading, UK [email protected]
    maybe the human race deserves to be wiped out

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdbd-pg @
categoriesperl
postedSep 1, '08 at 7:05p
activeSep 10, '08 at 12:17a
posts8
users4
websiteperl.org

People

Translate

site design / logo © 2023 Grokbase