Apologies for the delayed reply here. I wanted a chance to really read through this stuff carefully.
On Jun 28, 2011, at 3:31 PM, [email protected] wrote:
Committed by David Christensen <[email protected]>
Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM
---
TODO.utf8 | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 161 insertions(+), 0 deletions(-)
diff --git a/TODO.utf8 b/TODO.utf8
new file mode 100644
index 0000000..5260bac
--- /dev/null
+++ b/TODO.utf8
@@ -0,0 +1,161 @@
+Summary of design changes from discussions with GSM and DWC re: utf-8 in DBD::Pg
+================================================================================
+
+Behavior of the pg_unicode/pg_utf8_strings connection attribute
+---------------------------------------------------------------
+We will utilize a connect attribute (enabled by default) to enable the
+use of an immediate SET client_encoding. The current name of this is
+"pg_utf8_strings", but DWC prefers something non-encoding specific;
+examples wanted, but "pg_unicode" or "pg_internal" seem best.
pg_decode_strings. Or pg_encode_strings, depending on how you look at it.
+If the "pg_internal" attribute is explicitly provided in the DBI
+connect attributes it will be one of (0, 1), to enable/disable the
+pg_internal behavior explicitly. If not provided, we check the
+initial "server_encoding" and "client_encoding" settings.
+
+The logic for setting "pg_internal" when unspecified is:
+
+ - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
+
+ - If "client_encoding" <> "server_encoding", or perhaps better yet if
+ the pg_setting("client_encoding") returns a different value than
+ the default version for that setting, then we assuming that the
+ client encoding choice is *explicit* and the user will be wanting
+ to get raw octets back from DBI, thus set pg_internal to 0.
I find this description confusing. What is the default value for that setting? I mean, how can one know that?Committed by David Christensen <[email protected]>
Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM
---
TODO.utf8 | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 161 insertions(+), 0 deletions(-)
diff --git a/TODO.utf8 b/TODO.utf8
new file mode 100644
index 0000000..5260bac
--- /dev/null
+++ b/TODO.utf8
@@ -0,0 +1,161 @@
+Summary of design changes from discussions with GSM and DWC re: utf-8 in DBD::Pg
+================================================================================
+
+Behavior of the pg_unicode/pg_utf8_strings connection attribute
+---------------------------------------------------------------
+We will utilize a connect attribute (enabled by default) to enable the
+use of an immediate SET client_encoding. The current name of this is
+"pg_utf8_strings", but DWC prefers something non-encoding specific;
+examples wanted, but "pg_unicode" or "pg_internal" seem best.
pg_decode_strings. Or pg_encode_strings, depending on how you look at it.
+If the "pg_internal" attribute is explicitly provided in the DBI
+connect attributes it will be one of (0, 1), to enable/disable the
+pg_internal behavior explicitly. If not provided, we check the
+initial "server_encoding" and "client_encoding" settings.
+
+The logic for setting "pg_internal" when unspecified is:
+
+ - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
+
+ - If "client_encoding" <> "server_encoding", or perhaps better yet if
+ the pg_setting("client_encoding") returns a different value than
+ the default version for that setting, then we assuming that the
+ client encoding choice is *explicit* and the user will be wanting
+ to get raw octets back from DBI, thus set pg_internal to 0.
Assuming one can, I suggest alternate phrasing:
- If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so pg_internal is false.
+ - Otherwise set pg_internal to 1.
But we strongly recommend you set it explicitly to avoid confusion. And really, setting it to 1 is strongly recommended for proper and transparent handling of multibyte characters.
+
+Immediately after the connection initialization completes, we will
+check for the set pg_internal flag; if set, we issue a "SET
+client_encoding TO 'utf-8'" and commit.
Sounds sensible.
+
+
+Proposal for an "encoding" DBD attribute interface
+--------------------------------------------------
+
+DWC suggested a DBD::db attribute handle, suggested to be called
+"encoding" which when set would effectively pass-thru to the
+underlying: "SET client_encoding = $blah" and *disable* the
+pg_internal flag. Specifically, by setting the encoding attribute,
+you are effectively indicating that you want the data from PostgreSQL
+back
I like this *so* much better.
+
+If such a mechanism *was* instituted, we could utilize `pg_encoding =>
+'blah'` as the connection-level attribute and just tie the underlying
+implementation of the pg_internal mechanism to this, by having a
+keyword ('internal') as the special-case encoding, which could be
+enabled/disabled via $dbh->{pg_encoding} = 'internal';
WTF is internal?But we strongly recommend you set it explicitly to avoid confusion. And really, setting it to 1 is strongly recommended for proper and transparent handling of multibyte characters.
+
+Immediately after the connection initialization completes, we will
+check for the set pg_internal flag; if set, we issue a "SET
+client_encoding TO 'utf-8'" and commit.
Sounds sensible.
+
+
+Proposal for an "encoding" DBD attribute interface
+--------------------------------------------------
+
+DWC suggested a DBD::db attribute handle, suggested to be called
+"encoding" which when set would effectively pass-thru to the
+underlying: "SET client_encoding = $blah" and *disable* the
+pg_internal flag. Specifically, by setting the encoding attribute,
+you are effectively indicating that you want the data from PostgreSQL
+back
I like this *so* much better.
+
+If such a mechanism *was* instituted, we could utilize `pg_encoding =>
+'blah'` as the connection-level attribute and just tie the underlying
+implementation of the pg_internal mechanism to this, by having a
+keyword ('internal') as the special-case encoding, which could be
+enabled/disabled via $dbh->{pg_encoding} = 'internal';
Seems to me that with pg_encoding you don't need pg_internal at all. You just have a default value for pg_encoding, which would be:
* If "client_encoding" is not set to its default value, DBD::Pg assumes that the choice is explicit, so use that.
* Else if "server_encoding" is "SQL_ASCII" set pg_encoding to "SQL_ASCII".
* Else use "utf-8".
+
+This would allow us to pass-through utf-8 *without* setting the SvUTF8
+flag by setting $dbh->{pg_encoding} = 'utf-8'.
+1. And the fewer of these options the better, IMHO.
+Behavior changes if pg_internal is set
+--------------------------------------
Or if pg_encoding eq 'utf-8'.
+There will be two distinct changes that need to take place,
+specifically input and output.
+
+When processing the result sets returned by the server, if pg_internal
+is set, we can either fiat that the "client_encoding" is set to UTF-8
+as it was originally when we switched it on connection, or verify that
+the libpq's result set charset/encoding is equal to UTF-8. I believe
+this is available as an int, which could be cached when we do the
+original "SET client_encoding" and/or initial setup tests, which
+should prevent accidental corruption.
Or just strongly recommend that if you want to change it, set pg_encoding instead of executing SET CLIENT_ENCODING.
+ - if we decide to go this route and detect the charset change, we can
+ issue a notice/warning from DBD::Pg that the client_encoding has
+ changed and then turn off the pg_internal flag.
But only if pg_internal was not explicitly set by the user, right?
+ - if everything checks out, we use the usual dequote_* methods and
+ set the SvUTF8 flag on either text-based bytes, or set only on the
+ ASCII datums.
+
+ - a possible option to benchmark would be to directly use the
+ "utf8::upgrade" method from the perl internals (or some Sv-creation
+ method based on (char*)) to take advantage of any perl-specific
+ enhancements already in place. This may be just as fast since perl
+ already needs to copy the (char*) contents into the SV, and may
+ already have fast-tracked code-paths for this type of operation,
+ since we know the data will be valid UTF8.
+
+When processing data coming *in* from the user i.e., (SV*) we consider
+the following:
+
+ - if pg_internal is 0, pass through the normal methods unabashed.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
+ do nothing; the underlying (char*) will already be in utf-8 data.
Maybe. utf8 ne UTF-8, quite.
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
+ special consideration for hi-bit characters; since we've
+ effectively co-opted the expected client_encoding and forced UTF8,
+ we need to treat the raw data as octets. We have a couple choices:
+
+ - treat as latin-1/perl raw. This may be a good default choice,
+ but I'm not 100% convinced; in any case we would need to
+ convert from raw to utf-8 using utf8::upgrade.
I think this is basically what Perl assumes, so it's probably pretty safe. It would also be the reasonable thing to do if pg_encoding is set to something other than utf-8: you assume the user knows what she's doing and passing things in the proper encoding.
+
+ - treat as original client_encoding. This may be the least
+ changed expectation as far as the user is concerned, but
+ requires us to either:
+
+ a) switch client_encoding for query to the original
+ client_encoding, while somehow still retaining the utf-8
+ client encoding for result set retrieval, or,
+
+ b) actually use Encode to transcode from the original
+ client_encoding to UTF8. I think GSM is particularly
+ against bringing Encode into the picture just due to
+ additional complexity issues.
To me, this is just more reason to use pg_encoding and not have pg_internal at all.
+
+Implementation considerations/ideas
+-----------------------------------
+
+DWC feels strongly that we should avoid setting the SvUTF8 flag on any
+retrieved/created SV which does not require it; Why?
as such, an operation
+that can quickly check whether there are any hi-bit characters in a
+given (char*) would need to be weighed against the possible
+inconvenience of *always* setting the SvUTF8 flag on eligible strings,
+regardless of whether it is full ASCII.
Yeah, needs benchmarking. And if it's slow and you still want it, maybe give us a knob to turn it off.+This would allow us to pass-through utf-8 *without* setting the SvUTF8
+flag by setting $dbh->{pg_encoding} = 'utf-8'.
+1. And the fewer of these options the better, IMHO.
+Behavior changes if pg_internal is set
+--------------------------------------
Or if pg_encoding eq 'utf-8'.
+There will be two distinct changes that need to take place,
+specifically input and output.
+
+When processing the result sets returned by the server, if pg_internal
+is set, we can either fiat that the "client_encoding" is set to UTF-8
+as it was originally when we switched it on connection, or verify that
+the libpq's result set charset/encoding is equal to UTF-8. I believe
+this is available as an int, which could be cached when we do the
+original "SET client_encoding" and/or initial setup tests, which
+should prevent accidental corruption.
Or just strongly recommend that if you want to change it, set pg_encoding instead of executing SET CLIENT_ENCODING.
+ - if we decide to go this route and detect the charset change, we can
+ issue a notice/warning from DBD::Pg that the client_encoding has
+ changed and then turn off the pg_internal flag.
But only if pg_internal was not explicitly set by the user, right?
+ - if everything checks out, we use the usual dequote_* methods and
+ set the SvUTF8 flag on either text-based bytes, or set only on the
+ ASCII datums.
+
+ - a possible option to benchmark would be to directly use the
+ "utf8::upgrade" method from the perl internals (or some Sv-creation
+ method based on (char*)) to take advantage of any perl-specific
+ enhancements already in place. This may be just as fast since perl
+ already needs to copy the (char*) contents into the SV, and may
+ already have fast-tracked code-paths for this type of operation,
+ since we know the data will be valid UTF8.
+
+When processing data coming *in* from the user i.e., (SV*) we consider
+the following:
+
+ - if pg_internal is 0, pass through the normal methods unabashed.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
+ do nothing; the underlying (char*) will already be in utf-8 data.
Maybe. utf8 ne UTF-8, quite.
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
+ special consideration for hi-bit characters; since we've
+ effectively co-opted the expected client_encoding and forced UTF8,
+ we need to treat the raw data as octets. We have a couple choices:
+
+ - treat as latin-1/perl raw. This may be a good default choice,
+ but I'm not 100% convinced; in any case we would need to
+ convert from raw to utf-8 using utf8::upgrade.
I think this is basically what Perl assumes, so it's probably pretty safe. It would also be the reasonable thing to do if pg_encoding is set to something other than utf-8: you assume the user knows what she's doing and passing things in the proper encoding.
+
+ - treat as original client_encoding. This may be the least
+ changed expectation as far as the user is concerned, but
+ requires us to either:
+
+ a) switch client_encoding for query to the original
+ client_encoding, while somehow still retaining the utf-8
+ client encoding for result set retrieval, or,
+
+ b) actually use Encode to transcode from the original
+ client_encoding to UTF8. I think GSM is particularly
+ against bringing Encode into the picture just due to
+ additional complexity issues.
To me, this is just more reason to use pg_encoding and not have pg_internal at all.
+
+Implementation considerations/ideas
+-----------------------------------
+
+DWC feels strongly that we should avoid setting the SvUTF8 flag on any
+retrieved/created SV which does not require it; Why?
as such, an operation
+that can quickly check whether there are any hi-bit characters in a
+given (char*) would need to be weighed against the possible
+inconvenience of *always* setting the SvUTF8 flag on eligible strings,
+regardless of whether it is full ASCII.
Best,
David