I'm investigating the possibility of developing a utility function for
our C++ client library, libpqxx, that produces array literals that can
be used in prepared statements. This problem appears to be a bit of a
tar pit, so I'm hoping that someone can help me out. My goal is to
produce a template function that accepts arbitrarily nested standard
library containers, that contain at the most nested level
constants/literals of some type that can be fed into a stream, such as
an int or a std::string.

I'm aware that I cannot assume that types are delimited by a single
quote, even for built-in types. I thought that I would put the onus on
the client to specify the correct delimiter, by checking pg_type
themselves if necessary, but default to ',' . Is this a reasonable
approach?

Escaping/quoting individual elements seems tricky. I have produced a
generic and superficially well behaved implementation by using double
quotes for constants. However, I have now opened the door to malicious
parties injecting multiple array elements where only one is allowed,
or causing malformed array literal errors by simply including a double
quote of their own. It's not clear where the responsibility should
rest for escaping constants/ensuring that constants don't contain
double quotes. Can someone suggest a better approach? I can't very
well use single quotes, because they are escaped/doubled up when we
pass the array literal to something similar to PQexecPrepared(), and
they shouldn't be - strings end up looking like this: "'has errant
single quotes on either side'".

Since Postgres only supports encodings that are ASCII supersets, I
don't believe that I have to consider encoding - only my clients do.

Can someone please point me in the direction of an established client
library/driver where all corner cases are covered, or at least enough
of them to produce a net gain in usefulness? There may well be
additional subtleties that have not occurred to me.

--
Regards,
Peter Geoghegan

Search Discussions

  • Greg Stark at Feb 23, 2011 at 4:37 am

    On Wed, Feb 23, 2011 at 4:16 AM, Peter Geoghegan wrote:
    Since Postgres only supports encodings that are ASCII supersets, I
    don't believe that I have to consider encoding - only my clients do.
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.

    I'm a bit surprised libpqxx isn't using binary mode internally though.
    This would at least avoid the problems with encoding. However I'm not
    sure things like the array binary format are really stable and
    portable enough to really use from a client library. Some datatypes
    might be dependent on the server ABI (floats -- I'm looking at you) so
    that might make it difficult or impossible.

    --
    greg
  • Tatsuo Ishii at Feb 23, 2011 at 6:38 am

    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    We have a clear definition what encodings are for client
    only(mb/pg_wchar.h):

    /* followings are for client encoding only */
    PG_SJIS, /* Shift JIS (Winindows-932) */
    PG_BIG5, /* Big5 (Windows-950) */
    PG_GBK, /* GBK (Windows-936) */
    PG_UHC, /* UHC (Windows-949) */
    PG_GB18030, /* GB18030 */
    PG_JOHAB, /* EUC for Korean JOHAB */
    PG_SHIFT_JIS_2004, /* Shift-JIS-2004 */
    --
    Tatsuo Ishii
    SRA OSS, Inc. Japan
    English: http://www.sraoss.co.jp/index_en.php
    Japanese: http://www.sraoss.co.jp
  • Peter Geoghegan at Feb 23, 2011 at 3:09 pm

    On 23 February 2011 04:36, Greg Stark wrote:
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets. The absence of by far the most popular non-ASCII superset
    encoding, UTF-16, as a client encoding indicated that to me. It isn't
    byte oriented, and Postgres is.
    I'm a bit surprised libpqxx isn't using binary mode internally though.
    This would at least avoid the problems with encoding. However I'm not
    sure things like the array binary format are really stable and
    portable enough to really use from a client library. Some datatypes
    might be dependent on the server ABI (floats -- I'm looking at you) so
    that might make it difficult or impossible.
    Yes, that question is addressed here:
    http://pqxx.org/development/libpqxx/wiki/BinaryTransfers .

    --
    Regards,
    Peter Geoghegan
  • Andrew Dunstan at Feb 23, 2011 at 3:16 pm

    On 02/23/2011 10:09 AM, Peter Geoghegan wrote:
    On 23 February 2011 04:36, Greg Starkwrote:
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets. The absence of by far the most popular non-ASCII superset
    encoding, UTF-16, as a client encoding indicated that to me. It isn't
    byte oriented, and Postgres is.

    They are not. It's precisely because they are not that they are not
    allowed as server encodings.


    cheers

    andrew
  • Heikki Linnakangas at Feb 23, 2011 at 3:22 pm

    On 23.02.2011 17:16, Andrew Dunstan wrote:
    On 02/23/2011 10:09 AM, Peter Geoghegan wrote:
    On 23 February 2011 04:36, Greg Starkwrote:
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets. The absence of by far the most popular non-ASCII superset
    encoding, UTF-16, as a client encoding indicated that to me. It isn't
    byte oriented, and Postgres is.
    They are not. It's precisely because they are not that they are not
    allowed as server encodings.
    To be precise, they are all ASCII supersets in the sense that a valid
    7-bit ASCII string is valid and means the same thing in all of the
    client-only encodings as well. The difference between supported
    server-encodings and those that are only supported as client_encoding is
    whether *all* bytes in a multi-byte character have the high bit set. All
    server-encodings have that property, and we rely on it in the backend.
    In the supported client-only encodings, the *first* byte of a multi-byte
    character is guaranteed to have the high bit set, but the subsequent
    bytes are not.

    Even that more loose property isn't true for UTF-16, which is why we
    don't support it even as a client-only encoding.

    --
    Heikki Linnakangas
    EnterpriseDB http://www.enterprisedb.com
  • Tom Lane at Feb 23, 2011 at 3:32 pm

    Heikki Linnakangas writes:
    On 23.02.2011 17:16, Andrew Dunstan wrote:
    On 02/23/2011 10:09 AM, Peter Geoghegan wrote:
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets.
    They are not. It's precisely because they are not that they are not
    allowed as server encodings.
    In the supported client-only encodings, the *first* byte of a multi-byte
    character is guaranteed to have the high bit set, but the subsequent
    bytes are not.
    And to be even more specific: the problem with the client-only encodings
    is that some of them allow the byte values corresponding to "\" and
    other escaping-critical characters to appear as non-first bytes in a
    multibyte character. This is nasty because you have to be aware of the
    encoding to do escaping correctly and not break the data. And as soon
    as the server and client don't agree on what the encoding is, you have
    the potential for SQL-injection security holes, not just confused data.

    regards, tom lane
  • Andrew Dunstan at Feb 23, 2011 at 3:36 pm

    On 02/23/2011 10:22 AM, Heikki Linnakangas wrote:
    On 23.02.2011 17:16, Andrew Dunstan wrote:
    On 02/23/2011 10:09 AM, Peter Geoghegan wrote:
    On 23 February 2011 04:36, Greg Starkwrote:
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets. The absence of by far the most popular non-ASCII superset
    encoding, UTF-16, as a client encoding indicated that to me. It isn't
    byte oriented, and Postgres is.
    They are not. It's precisely because they are not that they are not
    allowed as server encodings.
    To be precise, they are all ASCII supersets in the sense that a valid
    7-bit ASCII string is valid and means the same thing in all of the
    client-only encodings as well. The difference between supported
    server-encodings and those that are only supported as client_encoding
    is whether *all* bytes in a multi-byte character have the high bit
    set. All server-encodings have that property, and we rely on it in the
    backend. In the supported client-only encodings, the *first* byte of a
    multi-byte character is guaranteed to have the high bit set, but the
    subsequent bytes are not.
    Yes, that's a better explanation.
    Even that more loose property isn't true for UTF-16, which is why we
    don't support it even as a client-only encoding.
    The fact that UTF-16 uses nul bytes would make it particularly hard to
    handle.

    There might be value in having a UTF-16 aware version of libpq that
    would translate strings into UTF-8 on the way to the server and to
    UTF-16 on the way back to the client.

    cheers

    andrew
  • Greg Stark at Feb 23, 2011 at 3:26 pm

    On Wed, Feb 23, 2011 at 3:16 PM, Andrew Dunstan wrote:
    On 02/23/2011 10:09 AM, Peter Geoghegan wrote:

    On 23 February 2011 04:36, Greg Starkwrote:
    This is only true for server encodings. In a client library I think
    you lose on this and do have to deal with it. I'm not sure what client
    encodings we do support that aren't ascii-supersets though, it's
    possible none of them generate quote characters this way.
    I'm pretty sure all of the client encodings Tatsuo mentions are ASCII
    supersets. The absence of by far the most popular non-ASCII superset
    encoding, UTF-16, as a client encoding indicated that to me. It isn't
    byte oriented, and Postgres is.

    They are not. It's precisely because they are not that they are not allowed
    as server encodings.
    Well just as an example, BIG5 is a 16-bit encoding where the first
    byte always has the high bit set. The second byte can't be a ' or "
    because those aren't in the allowable range for the second byte. So it
    might be safe anyways. However \ is in the allowable range so I'm not
    sure.

    In the case of BIG5 ascii characters are included inline so any byte
    with the high bit *not* set that isn't the second byte of a two-byte
    sequence is assumed to be ascii. So an ascii parser would work fine
    modulo the problem above with backslashes.

    But this is just a special case. Wikipedia implies it's also true for
    shift-JIS but there's no guarantee it would work for other client
    encodings.

    --
    greg
  • Merlin Moncure at Feb 23, 2011 at 3:34 pm

    On Tue, Feb 22, 2011 at 10:16 PM, Peter Geoghegan wrote:
    I'm investigating the possibility of developing a utility function for
    our C++ client library, libpqxx, that produces array literals that can
    be used in prepared statements. This problem appears to be a bit of a
    tar pit, so I'm hoping that someone can help me out. My goal is to
    produce a template function that accepts arbitrarily nested standard
    library containers, that contain at the most nested level
    constants/literals of some type that can be fed into a stream, such as
    an int or a std::string.

    I'm aware that I cannot assume that types are delimited by a single
    quote, even for built-in types. I thought that I would put the onus on
    the client to specify the correct delimiter, by checking pg_type
    themselves if necessary, but default to ',' . Is this a reasonable
    approach?

    Escaping/quoting individual elements seems tricky. I have produced a
    generic and superficially well behaved implementation by using double
    quotes for constants. However, I have now opened the door to malicious
    parties injecting multiple array elements where only one is allowed,
    or causing malformed array literal errors by simply including a double
    quote of their own. It's not clear where the responsibility should
    rest for escaping constants/ensuring that constants don't contain
    double quotes. Can someone suggest a better approach? I can't very
    well use single quotes, because they are escaped/doubled up when we
    pass the array literal to something similar to PQexecPrepared(), and
    they shouldn't be - strings end up looking like this: "'has errant
    single quotes on either side'".
    You can send nested arrays safely. You just have to be very formal
    about escaping *everything* both as you get it and as it goes into the
    container. This is what postgres does on the backend as it sends
    arrays out the door in text. It might be instructive to see what the
    server does in terms of escaping. Note that the way this works it's
    not impossible to see 128+ consecutive backslashes when dealing with
    arrays of composites.
    Since Postgres only supports encodings that are ASCII supersets, I
    don't believe that I have to consider encoding - only my clients do.

    Can someone please point me in the direction of an established client
    library/driver where all corner cases are covered, or at least enough
    of them to produce a net gain in usefulness? There may well be
    additional subtleties that have not occurred to me.
    yes: libpqtypes. it manages everything in binary. i've been thinking
    for a while that libpqtypes could be wrapped with variadic templates
    or other c++ trickery. Because libpqtypes does everything in binary,
    it completely sidesteps all the escaping nastiness.

    merlin
  • Kenneth Marshall at Feb 23, 2011 at 3:40 pm

    On Wed, Feb 23, 2011 at 09:34:06AM -0600, Merlin Moncure wrote:
    On Tue, Feb 22, 2011 at 10:16 PM, Peter Geoghegan
    wrote:
    I'm investigating the possibility of developing a utility function for
    our C++ client library, libpqxx, that produces array literals that can
    be used in prepared statements. This problem appears to be a bit of a
    tar pit, so I'm hoping that someone can help me out. My goal is to
    produce a template function that accepts arbitrarily nested standard
    library containers, that contain at the most nested level
    constants/literals of some type that can be fed into a stream, such as
    an int or a std::string.

    I'm aware that I cannot assume that types are delimited by a single
    quote, even for built-in types. I thought that I would put the onus on
    the client to specify the correct delimiter, by checking pg_type
    themselves if necessary, but default to ',' . Is this a reasonable
    approach?

    Escaping/quoting individual elements seems tricky. I have produced a
    generic and superficially well behaved implementation by using double
    quotes for constants. However, I have now opened the door to malicious
    parties injecting multiple array elements where only one is allowed,
    or causing malformed array literal errors by simply including a double
    quote of their own. It's not clear where the responsibility should
    rest for escaping constants/ensuring that constants don't contain
    double quotes. Can someone suggest a better approach? I can't very
    well use single quotes, because they are escaped/doubled up when we
    pass the array literal to something similar to PQexecPrepared(), and
    they shouldn't be - strings end up looking like this: "'has errant
    single quotes on either side'".
    You can send nested arrays safely. You just have to be very formal
    about escaping *everything* both as you get it and as it goes into the
    container. This is what postgres does on the backend as it sends
    arrays out the door in text. It might be instructive to see what the
    server does in terms of escaping. Note that the way this works it's
    not impossible to see 128+ consecutive backslashes when dealing with
    arrays of composites.
    Since Postgres only supports encodings that are ASCII supersets, I
    don't believe that I have to consider encoding - only my clients do.

    Can someone please point me in the direction of an established client
    library/driver where all corner cases are covered, or at least enough
    of them to produce a net gain in usefulness? There may well be
    additional subtleties that have not occurred to me.
    yes: libpqtypes. it manages everything in binary. i've been thinking
    for a while that libpqtypes could be wrapped with variadic templates
    or other c++ trickery. Because libpqtypes does everything in binary,
    it completely sidesteps all the escaping nastiness.

    merlin
    Avoiding the escaping by using binary parameter transmission is
    the best method. Shameless plug: libpqtypes is great!
    I hope that it can be eventually included in the core distribution.
    It is not uncommon to get "It's an add-on package???" and avoidance
    of pieces outside of the standard dist regardless of its value.

    Regards,
    Ken
  • Andrew Dunstan at Feb 23, 2011 at 5:19 pm

    On 02/23/2011 10:40 AM, Kenneth Marshall wrote:
    Can someone please point me in the direction of an established client
    library/driver where all corner cases are covered, or at least enough
    of them to produce a net gain in usefulness? There may well be
    additional subtleties that have not occurred to me.
    yes: libpqtypes. it manages everything in binary. i've been thinking
    for a while that libpqtypes could be wrapped with variadic templates
    or other c++ trickery. Because libpqtypes does everything in binary,
    it completely sidesteps all the escaping nastiness.
    Avoiding the escaping by using binary parameter transmission is
    the best method. Shameless plug: libpqtypes is great!
    I hope that it can be eventually included in the core distribution.
    It is not uncommon to get "It's an add-on package???" and avoidance
    of pieces outside of the standard dist regardless of its value.

    Binary mode had serious limitations, such as portability.

    We do need some support in libpq for constructing and deconstructing
    arrays (and probably for composites too, although that will be harder, I
    suspect).

    cheers

    andrew
  • Andrew Chernow at Feb 23, 2011 at 8:06 pm
    Binary mode had serious limitations, such as portability.
    What are the other limitations?

    As far as portability is concerned, we are using it on many different
    operating systems and architectures without issue. Even our most recent
    bump to 9.0.1 and 9.0.3 was flawless in regard to libpq/libpqtypes.
    We do need some support in libpq for constructing and deconstructing
    arrays (and probably for composites too, although that will be harder, I
    suspect).
    [sigh...]

    --
    Andrew Chernow
    eSilo, LLC
    global backup
    http://www.esilo.com/
  • Andrew Dunstan at Feb 23, 2011 at 8:50 pm

    On 02/23/2011 02:21 PM, Andrew Chernow wrote:
    Binary mode had serious limitations, such as portability.
    What are the other limitations?

    As far as portability is concerned, we are using it on many different
    operating systems and architectures without issue. Even our most
    recent bump to 9.0.1 and 9.0.3 was flawless in regard to
    libpq/libpqtypes.

    It's probably fine if you can control both ends. But there is no
    guarantee of portability, nor does it seem likely to me there ever will
    be, so I don't find your assertion terribly useful. The fact that it
    hasn't broken for you doesn't mean it can't or won't be.

    The other downside I see is that binary protocols are often a lot harder
    to debug, but maybe that's just me.

    cheers

    andrew
  • Andrew Chernow at Feb 23, 2011 at 9:19 pm


    It's probably fine if you can control both ends. But there is no
    guarantee of portability, nor does it seem likely to me there ever will
    be, so I don't find your assertion terribly useful. The fact that it
    hasn't broken for you doesn't mean it can't or won't be.
    All true. If you change the protocol, libpqtypes needs to be adjusted.
    I think that is a very fair statement. It already toggles on server
    version around a few changes in the past ... like the money data type.
    So far, since 8.1, the number of changes to the binary protocol has put
    me to sleep :)
    The other downside I see is that binary protocols are often a lot harder
    to debug, but maybe that's just me.
    Also very true. However, libpqtypes addresses this by abstracting the
    end user from the binary transformation or preparation. Instead, users
    are presented with a printf/scanf style interface. PQexecf(conn,
    "select %int4 + %int4", 4, 4) is pretty far removed from the underlying
    byte swapping, parallel array setup for PQexecParams and other
    nastiness. But yes, the maintainer of the library must deal with
    protocol changes and provide backward compatibility.

    --
    Andrew Chernow
    eSilo, LLC
    global backup
    http://www.esilo.com/
  • Peter Geoghegan at Feb 23, 2011 at 8:06 pm

    On 23 February 2011 15:34, Merlin Moncure wrote:
    You can send nested arrays safely.  You just have to be very formal
    about escaping *everything* both as you get it and as it goes into the
    container.  This is what postgres does on the backend as it sends
    arrays out the door in text.  It might be instructive to see what the
    server does in terms of escaping.  Note that the way this works it's
    not impossible to see 128+ consecutive backslashes when dealing with
    arrays of composites.
    Sounds tedious.
    yes: libpqtypes.  it manages everything in binary.  i've been thinking
    for a while that libpqtypes could be wrapped with variadic templates
    or other c++ trickery.  Because libpqtypes does everything in binary,
    it completely sidesteps all the escaping nastiness.
    The fact that libpqtypes does everything in binary mode is
    interesting, but doesn't really help me.

    Variadic template support is still quite patchy, and I don't think
    that it is of particular use here. My proof-of-concept implementation
    uses recursive template instantiation and type traits, and just uses
    C++98 features. I've attached it for your information. I might be able
    to use partial template specialisation to support regular arrays too.
    That hasn't been a priority, because C++ generally discourages their
    use, and because it's trickier. Arrays don't "know their own size",
    and I want to provide a uniform, simple interface. On the other hand,
    I've seen interesting things done with template specialisation on
    static integral values, such as the size of arrays on the stack, so
    perhaps it's possible to support arrays while having a uniform
    interface.

    To be clear: I don't want to take responsibility for correctly
    escaping the array literal. The user has a responsibility to use a
    prepared statement/explicit escaping to do that, just as they do with
    a regular text value, for example. There is no additional threat of a
    traditional SQL injection attack, because we cannot break out of the
    array literal itself. However, within the array literal, it is
    currently possible to break out of a constant/value literal using a
    double quote, to perhaps inject additional values (more than
    intended), or to cause malformed array literal errors. Sure, I could
    write my own function to escape the constant which is wary of double
    quotes, but that would have many of the same challenges as writing a
    general purpose drop-in replacement for PQescapeStringConn(). It might
    be just as misguided.

    --
    Regards,
    Peter Geoghegan
  • Andrew Chernow at Feb 23, 2011 at 8:34 pm

    On 2/23/2011 3:06 PM, Peter Geoghegan wrote:
    On 23 February 2011 15:34, Merlin Moncurewrote:
    You can send nested arrays safely. You just have to be very formal
    about escaping *everything* both as you get it and as it goes into the
    container. This is what postgres does on the backend as it sends
    arrays out the door in text. It might be instructive to see what the
    server does in terms of escaping. Note that the way this works it's
    not impossible to see 128+ consecutive backslashes when dealing with
    arrays of composites.
    Sounds tedious.
    It is tedious, which is one reason why libpqtypes went binary. There
    are some compelling performance reasons as well that affect both client
    and server.

    libpqtypes was originally developed to serve a very particular need and
    wasn't aiming to be general purpose. That came about along the way
    trying to solve the problem. Personally, PQexec is dead to me as well
    as text results from a C/C++ app. I see no advantage over libpqtypes in
    that context.

    Unless I am missing your ultimate goal, you'd probably get what you want
    by wrapping libpqtypes.

    --
    Andrew Chernow
    eSilo, LLC
    global backup
    http://www.esilo.com/
  • Kenneth Marshall at Feb 23, 2011 at 9:02 pm

    On Wed, Feb 23, 2011 at 03:34:45PM -0500, Andrew Chernow wrote:
    On 2/23/2011 3:06 PM, Peter Geoghegan wrote:
    On 23 February 2011 15:34, Merlin Moncurewrote:
    You can send nested arrays safely. You just have to be very formal
    about escaping *everything* both as you get it and as it goes into the
    container. This is what postgres does on the backend as it sends
    arrays out the door in text. It might be instructive to see what the
    server does in terms of escaping. Note that the way this works it's
    not impossible to see 128+ consecutive backslashes when dealing with
    arrays of composites.
    Sounds tedious.
    It is tedious, which is one reason why libpqtypes went binary. There are
    some compelling performance reasons as well that affect both client and
    server.

    libpqtypes was originally developed to serve a very particular need and
    wasn't aiming to be general purpose. That came about along the way trying
    to solve the problem. Personally, PQexec is dead to me as well as text
    results from a C/C++ app. I see no advantage over libpqtypes in that
    context.

    Unless I am missing your ultimate goal, you'd probably get what you want by
    wrapping libpqtypes.
    The performance is one of the big reasons to use binary parameters.
    Converting/packing/transmitting/unpacking/converting use a lot of
    CPU resources on both the server and the client in addition to
    the larger communication resources needed by the text-based methods.

    Ken

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedFeb 23, '11 at 4:17a
activeFeb 23, '11 at 9:19p
posts18
users9
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase