Grokbase Groups Perl dbd-pg July 2011
FAQ
David & Greg,

Apologies for the delayed reply here. I wanted a chance to really read through this stuff carefully.
On Jun 28, 2011, at 3:31 PM, [email protected] wrote:

Committed by David Christensen <[email protected]>

Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

---
TODO.utf8 | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 161 insertions(+), 0 deletions(-)

diff --git a/TODO.utf8 b/TODO.utf8
new file mode 100644
index 0000000..5260bac
--- /dev/null
+++ b/TODO.utf8
@@ -0,0 +1,161 @@
+Summary of design changes from discussions with GSM and DWC re: utf-8 in DBD::Pg
+================================================================================
+
+Behavior of the pg_unicode/pg_utf8_strings connection attribute
+---------------------------------------------------------------
+We will utilize a connect attribute (enabled by default) to enable the
+use of an immediate SET client_encoding. The current name of this is
+"pg_utf8_strings", but DWC prefers something non-encoding specific;
+examples wanted, but "pg_unicode" or "pg_internal" seem best.
pg_decode_strings. Or pg_encode_strings, depending on how you look at it.
+If the "pg_internal" attribute is explicitly provided in the DBI
+connect attributes it will be one of (0, 1), to enable/disable the
+pg_internal behavior explicitly. If not provided, we check the
+initial "server_encoding" and "client_encoding" settings.
+
+The logic for setting "pg_internal" when unspecified is:
+
+ - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
+
+ - If "client_encoding" <> "server_encoding", or perhaps better yet if
+ the pg_setting("client_encoding") returns a different value than
+ the default version for that setting, then we assuming that the
+ client encoding choice is *explicit* and the user will be wanting
+ to get raw octets back from DBI, thus set pg_internal to 0.
I find this description confusing. What is the default value for that setting? I mean, how can one know that?

Assuming one can, I suggest alternate phrasing:

- If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so pg_internal is false.
+ - Otherwise set pg_internal to 1.
But we strongly recommend you set it explicitly to avoid confusion. And really, setting it to 1 is strongly recommended for proper and transparent handling of multibyte characters.
+
+Immediately after the connection initialization completes, we will
+check for the set pg_internal flag; if set, we issue a "SET
+client_encoding TO 'utf-8'" and commit.
Sounds sensible.
+
+
+Proposal for an "encoding" DBD attribute interface
+--------------------------------------------------
+
+DWC suggested a DBD::db attribute handle, suggested to be called
+"encoding" which when set would effectively pass-thru to the
+underlying: "SET client_encoding = $blah" and *disable* the
+pg_internal flag. Specifically, by setting the encoding attribute,
+you are effectively indicating that you want the data from PostgreSQL
+back
I like this *so* much better.
+
+If such a mechanism *was* instituted, we could utilize `pg_encoding =>
+'blah'` as the connection-level attribute and just tie the underlying
+implementation of the pg_internal mechanism to this, by having a
+keyword ('internal') as the special-case encoding, which could be
+enabled/disabled via $dbh->{pg_encoding} = 'internal';
WTF is internal?

Seems to me that with pg_encoding you don't need pg_internal at all. You just have a default value for pg_encoding, which would be:

* If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so use that.
* Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
* Else use "utf-8".
+
+This would allow us to pass-through utf-8 *without* setting the SvUTF8
+flag by setting $dbh->{pg_encoding} = 'utf-8'.
+1. And the fewer of these options the better, IMHO.
+Behavior changes if pg_internal is set
+--------------------------------------
Or if pg_encoding eq 'utf-8'.
+There will be two distinct changes that need to take place,
+specifically input and output.
+
+When processing the result sets returned by the server, if pg_internal
+is set, we can either fiat that the "client_encoding" is set to UTF-8
+as it was originally when we switched it on connection, or verify that
+the libpq's result set charset/encoding is equal to UTF-8. I believe
+this is available as an int, which could be cached when we do the
+original "SET client_encoding" and/or initial setup tests, which
+should prevent accidental corruption.
Or just strongly recommend that if you want to change it, set pg_encoding instead of executing SET CLIENT_ENCODING.
+ - if we decide to go this route and detect the charset change, we can
+ issue a notice/warning from DBD::Pg that the client_encoding has
+ changed and then turn off the pg_internal flag.
But only if pg_internal was not explicitly set by the user, right?
+ - if everything checks out, we use the usual dequote_* methods and
+ set the SvUTF8 flag on either text-based bytes, or set only on the
+ ASCII datums.
+
+ - a possible option to benchmark would be to directly use the
+ "utf8::upgrade" method from the perl internals (or some Sv-creation
+ method based on (char*)) to take advantage of any perl-specific
+ enhancements already in place. This may be just as fast since perl
+ already needs to copy the (char*) contents into the SV, and may
+ already have fast-tracked code-paths for this type of operation,
+ since we know the data will be valid UTF8.
+
+When processing data coming *in* from the user i.e., (SV*) we consider
+the following:
+
+ - if pg_internal is 0, pass through the normal methods unabashed.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
+ do nothing; the underlying (char*) will already be in utf-8 data.
Maybe. utf8 ne UTF-8, quite.
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
+ special consideration for hi-bit characters; since we've
+ effectively co-opted the expected client_encoding and forced UTF8,
+ we need to treat the raw data as octets. We have a couple choices:
+
+ - treat as latin-1/perl raw. This may be a good default choice,
+ but I'm not 100% convinced; in any case we would need to
+ convert from raw to utf-8 using utf8::upgrade.
I think this is basically what Perl assumes, so it's probably pretty safe. It would also be the reasonable thing to do if pg_encoding is set to something other than utf-8: you assume the user knows what she's doing and passing things in the proper encoding.
+
+ - treat as original client_encoding. This may be the least
+ changed expectation as far as the user is concerned, but
+ requires us to either:
+
+ a) switch client_encoding for query to the original
+ client_encoding, while somehow still retaining the utf-8
+ client encoding for result set retrieval, or,
+
+ b) actually use Encode to transcode from the original
+ client_encoding to UTF8. I think GSM is particularly
+ against bringing Encode into the picture just due to
+ additional complexity issues.
To me, this is just more reason to use pg_encoding and not have pg_internal at all.
+
+Implementation considerations/ideas
+-----------------------------------
+
+DWC feels strongly that we should avoid setting the SvUTF8 flag on any
+retrieved/created SV which does not require it; Why?
as such, an operation
+that can quickly check whether there are any hi-bit characters in a
+given (char*) would need to be weighed against the possible
+inconvenience of *always* setting the SvUTF8 flag on eligible strings,
+regardless of whether it is full ASCII.
Yeah, needs benchmarking. And if it's slow and you still want it, maybe give us a knob to turn it off.

Best,

David

Search Discussions

  • Greg Sabino Mullane at Jul 14, 2011 at 1:23 pm
    Thanks David W. My replies below.
    We will utilize a connect attribute (enabled by default) to enable the
    use of an immediate SET client_encoding. The current name of this is
    "pg_utf8_strings", but DWC prefers something non-encoding specific;
    examples wanted, but "pg_unicode" or "pg_internal" seem best.
    pg_decode_strings. Or pg_encode_strings, depending on how you look at it.
    Yeah, I'm opposed to "pg_internal". I'm okay with the others, but I
    don't feel anyone has stumbled upon a name that "feels" right yet.
    I'll use 'pg_unicode' for the rest of this email.
    +If the "pg_internal" attribute is explicitly provided in the DBI
    +connect attributes it will be one of (0, 1), to enable/disable the
    +pg_internal behavior explicitly. If not provided, we check the
    +initial "server_encoding" and "client_encoding" settings.
    +
    +The logic for setting "pg_internal" when unspecified is:
    +
    + - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
    +
    + - If "client_encoding" <> "server_encoding", or perhaps better yet if
    + the pg_setting("client_encoding") returns a different value than
    + the default version for that setting, then we assuming that the
    + client encoding choice is *explicit* and the user will be wanting
    + to get raw octets back from DBI, thus set pg_internal to 0.
    I find this description confusing. What is the default value for that setting?
    I mean, how can one know that?
    There is no default: it's computed on the fly at connection time, based
    on the server_encoding and the client_encoding. As the client_encoding
    defaults to the server_encoding, the only way it can be different is
    in the rare case that someone has set it inside of postgresql.conf. In
    which case, we respect that and don't do any transformations at all.
    But we strongly recommend you set it explicitly to avoid confusion. And
    really, setting it to 1 is strongly recommended for proper and transparent
    handling of multibyte characters.
    Yes, or some wording along the lines of "this is an expert knob, and you really
    ought to leave it alone unless you really know what you are doing".
    +DWC suggested a DBD::db attribute handle, suggested to be called
    +"encoding" which when set would effectively pass-thru to the
    +underlying: "SET client_encoding = $blah" and *disable* the
    +pg_internal flag. Specifically, by setting the encoding attribute,
    +you are effectively indicating that you want the data from PostgreSQL
    +back
    I like this *so* much better.
    Better than? This is in addition to the above, to be clear. This is
    basically a shortcut for someone setting pg_unicode false and issuing
    a "SET client_encoding = 'foo'". I'm still on the fence about making
    such a shortcut into a formal call. The advantage is that it removes
    the case where someone sets client_encoding manually but forgets to
    switch pg_unicode off.
    +If such a mechanism *was* instituted, we could utilize `pg_encoding =>
    +'blah'` as the connection-level attribute and just tie the underlying
    +implementation of the pg_internal mechanism to this, by having a
    +keyword ('internal') as the special-case encoding, which could be
    +enabled/disabled via $dbh->{pg_encoding} = 'internal';
    WTF is internal?
    I'm not sure what David C is saying above, to be honest.
    Seems to me that with pg_encoding you don't need pg_internal at all. You
    just have a default value for pg_encoding, which would be:

    * If "client_encoding" is not set to its default value, DBD::Pg assumes that
    the choice is explicit, so use that.
    * Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
    * Else use "utf-8".
    We still need a flag to know if we are unicoding or not. We cannot tell just
    from a stored client_encoding.
    +Behavior changes if pg_internal is set
    +--------------------------------------
    Or if pg_encoding eq 'utf-8'.
    No: what if someone changes the encoding later? In that case, we do *not*
    want to unicodalize (yep, making up words left and right here) the strings
    coming back from the database.
    +When processing the result sets returned by the server, if pg_internal
    +is set, we can either fiat that the "client_encoding" is set to UTF-8
    +as it was originally when we switched it on connection, or verify that
    +the libpq's result set charset/encoding is equal to UTF-8. I believe
    +this is available as an int, which could be cached when we do the
    +original "SET client_encoding" and/or initial setup tests, which
    +should prevent accidental corruption.

    Or just strongly recommend that if you want to change it, set pg_encoding
    instead of executing SET CLIENT_ENCODING.
    Yeah: I'm not keen on checking the client_encoding every single time we
    get a resultset back from the server, no matter how cheap the result.
    As David W implies, people should use the encoding interface of suffer
    the consequences.
    + - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
    + do nothing; the underlying (char*) will already be in utf-8 data.
    Maybe. utf8 ne UTF-8, quite.
    Right, but it is the best we can do.
    + - treat as latin-1/perl raw. This may be a good default choice,
    + but I'm not 100% convinced; in any case we would need to
    + convert from raw to utf-8 using utf8::upgrade.
    I think this is basically what Perl assumes, so it's probably pretty
    safe. It would also be the reasonable thing to do if pg_encoding
    is set to something other than utf-8: you assume the user knows what
    she's doing and passing things in the proper encoding.
    Agree with the first, but not with the second: once the user sets pg_encoding,
    we stop messing with their data, both incoming and outgoing, in the expectation
    that they have entered expert mode and want to handle things themselves.
    Or at the very least, we have separate flags for incoming and outgoing tweaking.
    + a) switch client_encoding for query to the original
    + client_encoding, while somehow still retaining the utf-8
    + client encoding for result set retrieval, or,
    I can't see this one working out.
    +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
    +retrieved/created SV which does not require it;
    GSM feels just as strongly we should set it on everything.

    - --
    Greg Sabino Mullane [email protected]
    End Point Corporation http://www.endpoint.com/
    PGP Key: 0x14964AC8 201107140921
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • David E. Wheeler at Jul 14, 2011 at 4:24 pm

    On Jul 14, 2011, at 6:23 AM, Greg Sabino Mullane wrote:

    I find this description confusing. What is the default value for that setting?
    I mean, how can one know that?
    There is no default: it's computed on the fly at connection time, based
    on the server_encoding and the client_encoding.
    Yeah, that's what I meant. It's difficult to comprehend how it calculates a value if you don't specify one.
    As the client_encoding
    defaults to the server_encoding, the only way it can be different is
    in the rare case that someone has set it inside of postgresql.conf. In
    which case, we respect that and don't do any transformations at all.
    There is also the PGCLIENTENCODING environment variable. http://www.postgresql.org/docs/9.0/static/multibyte.html#AEN30737
    But we strongly recommend you set it explicitly to avoid confusion. And
    really, setting it to 1 is strongly recommended for proper and transparent
    handling of multibyte characters.
    Yes, or some wording along the lines of "this is an expert knob, and you really
    ought to leave it alone unless you really know what you are doing".
    Maybe. I'm not convinced, because if you don't set it yourself, the thing it decides to do may or may not be what you expect, and it would be hard to figure out why.
    +DWC suggested a DBD::db attribute handle, suggested to be called
    +"encoding" which when set would effectively pass-thru to the
    +underlying: "SET client_encoding = $blah" and *disable* the
    +pg_internal flag. Specifically, by setting the encoding attribute,
    +you are effectively indicating that you want the data from PostgreSQL
    +back
    I like this *so* much better.
    Better than? This is in addition to the above, to be clear. This is
    basically a shortcut for someone setting pg_unicode false and issuing
    a "SET client_encoding = 'foo'".
    Unless I set it to "utf8", in which case pg_unicode would be true and client_encoding would be set to "UTF-8". Right?
    I'm still on the fence about making
    such a shortcut into a formal call. The advantage is that it removes
    the case where someone sets client_encoding manually but forgets to
    switch pg_unicode off.
    From the user's perspective, I think it makes much more sense. It says, "Here is what I want the encoding to be," which is easier to understand than "Should we or should we not convert the incoming data to Perl's internal form." Most people won't know WTF that means.
    Seems to me that with pg_encoding you don't need pg_internal at all. You
    just have a default value for pg_encoding, which would be:

    * If "client_encoding" is not set to its default value, DBD::Pg assumes that
    the choice is explicit, so use that.
    * Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
    * Else use "utf-8".
    We still need a flag to know if we are unicoding or not. We cannot tell just
    from a stored client_encoding.
    Why not? That's what pg_unicode was figuring out on its own if you didn't set it.
    +Behavior changes if pg_internal is set
    +--------------------------------------
    Or if pg_encoding eq 'utf-8'.
    No: what if someone changes the encoding later? In that case, we do *not*
    want to unicodalize (yep, making up words left and right here) the strings
    coming back from the database.
    Yes we do, unless that encoding is SQL_ASCII. If, however, someone does *not* want the data decoded (or encoded when sending to the database), then yes, I can see where we would then need pg_unicode. But I think that pg_unicode should have a default value based on the setting of pg_encoding, and if pg_encoding is not set, it should respect the client encoding setting.
    Yeah: I'm not keen on checking the client_encoding every single time we
    get a resultset back from the server, no matter how cheap the result.
    As David W implies, people should use the encoding interface of suffer
    the consequences.
    Word, yo.
    + - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
    + do nothing; the underlying (char*) will already be in utf-8 data.
    Maybe. utf8 ne UTF-8, quite.
    Right, but it is the best we can do.
    Well, no, it's not. We can encode it with Perl's API for encoding strings. Internally it might do nothing, but we should use that API if it's there.
    + - treat as latin-1/perl raw. This may be a good default choice,
    + but I'm not 100% convinced; in any case we would need to
    + convert from raw to utf-8 using utf8::upgrade.
    I think this is basically what Perl assumes, so it's probably pretty
    safe. It would also be the reasonable thing to do if pg_encoding
    is set to something other than utf-8: you assume the user knows what
    she's doing and passing things in the proper encoding.
    Agree with the first, but not with the second: once the user sets pg_encoding,
    we stop messing with their data, both incoming and outgoing, in the expectation
    that they have entered expert mode and want to handle things themselves.
    I disagree. I think the value of pg_encoding should be respected and things encoded and decoded appropriately (unless it's SQL_ASCII or pg_unicode is off).
    Or at the very least, we have separate flags for incoming and outgoing tweaking.
    Oy. Let's not go there yet.
    + a) switch client_encoding for query to the original
    + client_encoding, while somehow still retaining the utf-8
    + client encoding for result set retrieval, or,
    I can't see this one working out.
    +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
    +retrieved/created SV which does not require it;
    GSM feels just as strongly we should set it on everything.
    I agree.

    Best,

    David
  • Greg Sabino Mullane at Jul 17, 2011 at 6:11 pm

    I find this description confusing. What is the default value for that setting?
    I mean, how can one know that?
    There is no default: it's computed on the fly at connection time, based
    on the server_encoding and the client_encoding.
    Yeah, that's what I meant. It's difficult to comprehend
    how it calculates a value if you don't specify one.
    Well, for most people it won't matter: DBD::Pg will simply do the
    right thing. Which for 99% of people will be to set client_encoding
    to UTF-8, which is really the only sensible option (excluding
    SQL_ASCII people, of course).
    There is also the PGCLIENTENCODING environment variable.
    Ah, true dat.
    Yes, or some wording along the lines of "this is an expert knob, and you really
    ought to leave it alone unless you really know what you are doing".
    Maybe. I'm not convinced, because if you don't set it yourself, the thing
    it decides to do may or may not be what you expect, and it would be hard
    to figure out why.
    Well, it will set it to UTF-8, unless there is a really good reason not to.
    And the only exceptions are SQL_ASCII and if they went out of their way to
    set the client encoding themselves, in which case it would be rude of us
    to change it back on them. :)
    Better than? This is in addition to the above, to be clear. This is
    basically a shortcut for someone setting pg_unicode false and issuing
    a "SET client_encoding = 'foo'".
    Unless I set it to "utf8", in which case pg_unicode would be true and
    client_encoding would be set to "UTF-8". Right?
    Right. Although in most cases that will be a no-op as those will already
    be set that way. Although a weak case could be argued that setting it
    to UTF-8 via the interface should turn pg_unicodde *off*, to be consistent.
    But I think that's all the more reason for a separate knob, and one of the
    reasons I'm only lukewarm to the whole $h->{encoding} thing.
    I'm still on the fence about making
    such a shortcut into a formal call. The advantage is that it removes
    the case where someone sets client_encoding manually but forgets to
    switch pg_unicode off.
    From the user's perspective, I think it makes much more sense. It says,
    "Here is what I want the encoding to be," which is easier to understand
    than "Should we or should we not convert the incoming data to Perl's
    internal form." Most people won't know WTF that means.
    Yeah, that's true. On the other hand, even the encoding setting is meant
    as sort of an expert knob.
    We still need a flag to know if we are unicoding or not. We cannot tell just
    from a stored client_encoding.
    Why not? That's what pg_unicode was figuring out on its own if you didn't set it.
    Yes, but once we call $h->{encoding}, we need to track both the encoding and
    the fact that we are decoding or not. Which could be either way. Which raises
    a point: if we need a way to get things back to "normal" after the user
    sets $h->{encoding} to something weird, presumably they would then call
    $h->{encoding} = UTF-8. So perhaps that answers the above: we turn pg_unicode
    *on* in that case. But it still means that there is no way for someone to
    want a UTF-8 client_encoding but do NOT want us to decode things. Sigh.

    (some more of the same arguments trimmed from your reply)
    Maybe. utf8 ne UTF-8, quite.
    Right, but it is the best we can do.
    Well, no, it's not. We can encode it with Perl's API for encoding
    strings. Internally it might do nothing, but we should use that
    API if it's there.
    I meant that the only thing we can do with the internal strings
    is flip the utf8 bit on or off: we have no other knobs for
    other encodings.
    Agree with the first, but not with the second: once the user sets pg_encoding,
    we stop messing with their data, both incoming and outgoing, in the expectation
    that they have entered expert mode and want to handle things themselves.
    I disagree. I think the value of pg_encoding should be respected and things
    encoded and decoded appropriately (unless it's SQL_ASCII or pg_unicode is off).
    Or at the very least, we have separate flags for incoming and outgoing tweaking.
    Oy. Let's not go there yet.
    How about now? :) The problem is that people have existing scripts that we don't
    want to fail, and are trying to shove who-knows-what into the database, so we
    definitely want to clean up their mess as it comes in, but give them the option
    not to mess with it in case that is what they need. I think that should be a separate
    knob from the stuff coming back from the database. To put another way, I'm happy
    linking the two together for most things but providing an expert knob just in case
    they need it that can de-couple them.

    I'm trying to make this as bulletproof as possible so that we break as few existing
    scripts as possible on the first release, and allow as much fine-tuning as needed
    from the get-go, since we cannot know what will really break or the strange combinations
    people will want until this is released in the wild.
    +DWC feels strongly that we should avoid setting the SvUTF8 flag on any
    +retrieved/created SV which does not require it;
    GSM feels just as strongly we should set it on everything.
    I agree.
    Ball's in your court, David C. :)

    - --
    Greg Sabino Mullane [email protected]
    End Point Corporation http://www.endpoint.com/
    PGP Key: 0x14964AC8 201107171409
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • David E. Wheeler at Jul 18, 2011 at 1:16 am

    On Jul 17, 2011, at 11:11 AM, Greg Sabino Mullane wrote:

    Well, it will set it to UTF-8, unless there is a really good reason not to.
    And the only exceptions are SQL_ASCII and if they went out of their way to
    set the client encoding themselves, in which case it would be rude of us
    to change it back on them. :)
    Okay, put that way I understand it. I think that should be the introductory paragraph, followed by a bulleted list explaining the situations in which it would be off.
    Better than? This is in addition to the above, to be clear. This is
    basically a shortcut for someone setting pg_unicode false and issuing
    a "SET client_encoding = 'foo'".
    Unless I set it to "utf8", in which case pg_unicode would be true and
    client_encoding would be set to "UTF-8". Right?
    Right. Although in most cases that will be a no-op as those will already
    be set that way. Although a weak case could be argued that setting it
    to UTF-8 via the interface should turn pg_unicodde *off*, to be consistent.
    But I think that's all the more reason for a separate knob, and one of the
    reasons I'm only lukewarm to the whole $h->{encoding} thing.
    I think that setting pg_encoding should always turn pg_unicode *on*.
    From the user's perspective, I think it makes much more sense. It says,
    "Here is what I want the encoding to be," which is easier to understand
    than "Should we or should we not convert the incoming data to Perl's
    internal form." Most people won't know WTF that means.
    Yeah, that's true. On the other hand, even the encoding setting is meant
    as sort of an expert knob.
    Maybe. I think a lot of existing installations may find they need to turn it off, unless they had been using pg_enable_utf8 before.
    We still need a flag to know if we are unicoding or not. We cannot tell just
    from a stored client_encoding.
    Why not? That's what pg_unicode was figuring out on its own if you didn't set it.
    Yes, but once we call $h->{encoding}, we need to track both the encoding and
    the fact that we are decoding or not. Which could be either way. Which raises
    a point: if we need a way to get things back to "normal" after the user
    sets $h->{encoding} to something weird, presumably they would then call
    $h->{encoding} = UTF-8. So perhaps that answers the above: we turn pg_unicode
    *on* in that case. But it still means that there is no way for someone to
    want a UTF-8 client_encoding but do NOT want us to decode things. Sigh.
    I think that setting pg_encoding should turn on pg_unicode, unless it's set to :raw or something. Then someone could always explicitly set both to make it do what they mean.
    (some more of the same arguments trimmed from your reply)
    Yeah, sorry. :-)
    Or at the very least, we have separate flags for incoming and outgoing tweaking.
    Oy. Let's not go there yet.
    How about now? :) The problem is that people have existing scripts that we don't
    want to fail, and are trying to shove who-knows-what into the database, so we
    definitely want to clean up their mess as it comes in, but give them the option
    not to mess with it in case that is what they need. I think that should be a separate
    knob from the stuff coming back from the database. To put another way, I'm happy
    linking the two together for most things but providing an expert knob just in case
    they need it that can de-couple them.
    Oh I agree, I just think it's worth putting off until this other stuff gets sorted out.
    I'm trying to make this as bulletproof as possible so that we break as few existing
    scripts as possible on the first release, and allow as much fine-tuning as needed
    from the get-go, since we cannot know what will really break or the strange combinations
    people will want until this is released in the wild.
    The truth is, unless we pay attention to what pg_enable_utf8 was set to in such scripts -- and if it was set -- then suddenly having stuff be encoded and decoded when it wasn't before may surprise some folks. It *shouldn't*, but it will be different than what it was doing before.

    Have you asked Tim Bunce about any of this stuff? I know he has thought about adding encoding knobs to the DBI core, but I don't know how far a long he got in thinking about a design.

    Best,

    David
  • Greg Sabino Mullane at Jul 21, 2011 at 11:01 pm

    Okay, put that way I understand it. I think that should be the introductory
    paragraph, followed by a bulleted list explaining the situations in which
    it would be off.
    +1. I'm going to try and find some time to make a dev version with most
    of what we are talking about here soonish.
    I think that setting pg_encoding should always turn pg_unicode *on*.
    Hm...no, I think it should always be off. If someone really wants a different
    encoding, they probably are used to it coming back "raw". David C,
    I think we talked about this?
    Yeah, that's true. On the other hand, even the encoding setting is meant
    as sort of an expert knob.
    Maybe. I think a lot of existing installations may find they need to
    turn it off, unless they had been using pg_enable_utf8 before.
    Yep: no way to know until we release. David and I were thinking that the
    other direction (data going to database) is probably more likely to
    break things.
    I think that setting pg_encoding should turn on pg_unicode, unless it's
    set to :raw or something. Then someone could always explicitly set both
    to make it do what they mean.
    Yep, more knobs, more knobs! ;)
    (some more of the same arguments trimmed from your reply)
    Yeah, sorry. :-)
    No, I meant trimmed more of the stuff you said that bolstered my arguments,
    so no need to include it. Unless we want to really pile it on for David C.
    Oh I agree, I just think it's worth putting off until this other stuff
    gets sorted out.
    Nah, the more stuff we can fix out of the gate the better.
    The truth is, unless we pay attention to what pg_enable_utf8 was set
    to in such scripts -- and if it was set -- then suddenly having stuff
    be encoded and decoded when it wasn't before may surprise some folks.
    It *shouldn't*, but it will be different than what it was doing before.
    Yep. That's why this is a major release - we should not, and cannot,
    make everyone happy. Some people's scripts will break. Most (all?) will
    be able to twist some new knobs and get things working again.
    Have you asked Tim Bunce about any of this stuff? I know he has
    thought about adding encoding knobs to the DBI core, but I don't
    know how far a long he got in thinking about a design.
    Good idea: I have not. Will try to do so. Or anyone else that wants to
    raise this on dbi-dev....


    - --
    Greg Sabino Mullane [email protected]
    End Point Corporation http://www.endpoint.com/
    PGP Key: 0x14964AC8 201107211900
    http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
  • David E. Wheeler at Jul 21, 2011 at 11:16 pm

    On Jul 21, 2011, at 4:01 PM, Greg Sabino Mullane wrote:

    I think that setting pg_encoding should always turn pg_unicode *on*.
    Hm...no, I think it should always be off. If someone really wants a different
    encoding, they probably are used to it coming back "raw". David C,
    I think we talked about this?
    I disagree. It's me telling DBD::Pg what encoding the database uses, but I definitely want that converted to Perl's internal form. I *only* want raw if I explicitly ask for raw (or if there's no choice, such as when I set the encoding to ":raw" or something).

    I think of it being kind of like the `encdoding` pragma, in which I declare the encoding of my source code. Perl sees that and converts it to its internal form.
    Maybe. I think a lot of existing installations may find they need to
    turn it off, unless they had been using pg_enable_utf8 before.
    Yep: no way to know until we release. David and I were thinking that the
    other direction (data going to database) is probably more likely to
    break things.
    I wonder if, as an interrime measure, existing code that sets pg_enable_utf8 should still do something, like set pg_encoding to "utf-8" and turn pg_unicode on.
    Oh I agree, I just think it's worth putting off until this other stuff
    gets sorted out.
    Nah, the more stuff we can fix out of the gate the better.
    Okay.
    Have you asked Tim Bunce about any of this stuff? I know he has
    thought about adding encoding knobs to the DBI core, but I don't
    know how far a long he got in thinking about a design.
    Good idea: I have not. Will try to do so. Or anyone else that wants to
    raise this on dbi-dev....
    Yes, a must, IMHO. More cooks! ;-P

    Best,

    David
  • David Christensen at Jul 21, 2011 at 11:23 pm

    On Jul 21, 2011, at 6:16 PM, David E. Wheeler wrote:
    On Jul 21, 2011, at 4:01 PM, Greg Sabino Mullane wrote:

    I think that setting pg_encoding should always turn pg_unicode *on*.
    Hm...no, I think it should always be off. If someone really wants a different
    encoding, they probably are used to it coming back "raw". David C,
    I think we talked about this?
    I disagree. It's me telling DBD::Pg what encoding the database uses, but I definitely want that converted to Perl's internal form. I *only* want raw if I explicitly ask for raw (or if there's no choice, such as when I set the encoding to ":raw" or something).
    As a developer, why would you care if this information is available from the database itself? If you are caring about the encoding at all, you would be dealing with bytes/octets. Perl does not store unicode characters in any format besides UTF-8 so you're not changing "internal" characteristics ; what DBD::Pg uses to talk to your database shouldn't matter.
    I think of it being kind of like the `encdoding` pragma, in which I declare the encoding of my source code. Perl sees that and converts it to its internal form.
    The only time this would be useful would be if your database is set to something inscrutable (aka SQL_ASCII); if your end result is meant to be internal perl, you have no business providing an encoding.
    Maybe. I think a lot of existing installations may find they need to
    turn it off, unless they had been using pg_enable_utf8 before.
    Yep: no way to know until we release. David and I were thinking that the
    other direction (data going to database) is probably more likely to
    break things.
    I wonder if, as an interrime measure, existing code that sets pg_enable_utf8 should still do something, like set pg_encoding to "utf-8" and turn pg_unicode on.
    Yeah, I'd had that thought, along with spitting the deprecation warning.
    Oh I agree, I just think it's worth putting off until this other stuff
    gets sorted out.
    Nah, the more stuff we can fix out of the gate the better.
    Okay.
    +1.
    Have you asked Tim Bunce about any of this stuff? I know he has
    thought about adding encoding knobs to the DBI core, but I don't
    know how far a long he got in thinking about a design.
    Good idea: I have not. Will try to do so. Or anyone else that wants to
    raise this on dbi-dev....
    Yes, a must, IMHO. More cooks! ;-P

    Yeah, it'd be nice to know what at least some proposed interfaces/APIs are so we don't need to support a whole other place setting for years to come.

    Regards,

    David
    --
    David Christensen
    End Point Corporation
    [email protected]
  • David E. Wheeler at Jul 21, 2011 at 11:54 pm

    On Jul 21, 2011, at 4:23 PM, David Christensen wrote:

    I disagree. It's me telling DBD::Pg what encoding the database uses, but I definitely want that converted to Perl's internal form. I *only* want raw if I explicitly ask for raw (or if there's no choice, such as when I set the encoding to ":raw" or something).
    As a developer, why would you care if this information is available from the database itself? If you are caring about the encoding at all, you would be dealing with bytes/octets. Perl does not store unicode characters in any format besides UTF-8 so you're not changing "internal" characteristics ; what DBD::Pg uses to talk to your database shouldn't matter.
    Because otherwise what's the point? I could just turn pg_unicode off.
    I think of it being kind of like the `encdoding` pragma, in which I declare the encoding of my source code. Perl sees that and converts it to its internal form.
    The only time this would be useful would be if your database is set to something inscrutable (aka SQL_ASCII); if your end result is meant to be internal perl, you have no business providing an encoding.
    You are convincing me now that pg_encoding may not be useful at all, then.
    I wonder if, as an interrime measure, existing code that sets pg_enable_utf8 should still do something, like set pg_encoding to "utf-8" and turn pg_unicode on.
    Yeah, I'd had that thought, along with spitting the deprecation warning.
    Right, I think that'd be the least painful thing for users.
    Yeah, it'd be nice to know what at least some proposed interfaces/APIs are so we don't need to support a whole other place setting for years to come.
    +1

    Best,

    David
  • Dhudes at Jul 22, 2011 at 12:06 am
    I point out that Pg itself supports Perl for stored procedures in PL/Perl.
    What if I have a perl program which wants to store a perl procedure to the database? I have not done such a thing myself I use stored procedures very seldom but I could see interesting possibilities
    Sent from my BlackBerry® smartphone with Nextel Direct Connect

    -----Original Message-----
    From: "David E. Wheeler" <[email protected]>
    Date: Thu, 21 Jul 2011 16:54:25
    To: David Christensen<[email protected]>
    Cc: Greg Sabino Mullane<[email protected]>; <[email protected]>
    Subject: Re: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between
    DWC/GSM
    On Jul 21, 2011, at 4:23 PM, David Christensen wrote:

    I disagree. It's me telling DBD::Pg what encoding the database uses, but I definitely want that converted to Perl's internal form. I *only* want raw if I explicitly ask for raw (or if there's no choice, such as when I set the encoding to ":raw" or something).
    As a developer, why would you care if this information is available from the database itself? If you are caring about the encoding at all, you would be dealing with bytes/octets. Perl does not store unicode characters in any format besides UTF-8 so you're not changing "internal" characteristics ; what DBD::Pg uses to talk to your database shouldn't matter.
    Because otherwise what's the point? I could just turn pg_unicode off.
    I think of it being kind of like the `encdoding` pragma, in which I declare the encoding of my source code. Perl sees that and converts it to its internal form.
    The only time this would be useful would be if your database is set to something inscrutable (aka SQL_ASCII); if your end result is meant to be internal perl, you have no business providing an encoding.
    You are convincing me now that pg_encoding may not be useful at all, then.
    I wonder if, as an interrime measure, existing code that sets pg_enable_utf8 should still do something, like set pg_encoding to "utf-8" and turn pg_unicode on.
    Yeah, I'd had that thought, along with spitting the deprecation warning.
    Right, I think that'd be the least painful thing for users.
    Yeah, it'd be nice to know what at least some proposed interfaces/APIs are so we don't need to support a whole other place setting for years to come.
    +1

    Best,

    David
  • David E. Wheeler at Jul 22, 2011 at 12:19 am

    On Jul 21, 2011, at 5:06 PM, [email protected] wrote:

    I point out that Pg itself supports Perl for stored procedures in PL/Perl.
    What if I have a perl program which wants to store a perl procedure to the database? I have not done such a thing myself I use stored procedures very seldom but I could see interesting possibilities
    This issue was fixed in 9.0 in PL/Perl. This is the relevant release note:

    • Verify that PL/Perl return values are valid in the server encoding (Andrew Dunstan)

    What this means is that it's up to you to return data from Perl in the server encoding, IIRC. If the server encoding is UTF-8, then IIRC things should just work.

    Would be nice to see this spelled out in the docs somewhere, though. Think I'll bug Andrew about that.

    Best,

    David

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdbd-pg @
categoriesperl
postedJul 8, '11 at 5:39a
activeJul 22, '11 at 12:19a
posts11
users4
websiteperl.org

People

Translate

site design / logo © 2023 Grokbase