Following up on my previous patch [0], here is a fairly complete
implementation of this feature. The general description and
implementation outline of the previous message still apply. This patch
contains documentation and regression tests, which can serve as further
explanations.

As this patch touches pretty much everything in the system, there are
probably countless bugs and bogosities, some of which I have marked with
FIXME, TODO, etc. But all the functionality is basically there, so it's
time someone else gives this a serious examination.

Note: As previously, regression tests only work with "make check
MULTIBYTE=UTF8" and the feature overall only works on Linux/glibc.

[0]
http://archives.postgresql.org/message-id/1279045531.32647.14.camel@vanquo.pezone.net

Search Discussions

  • Itagaki Takahiro at Sep 22, 2010 at 10:44 am

    On Thu, Sep 16, 2010 at 5:46 AM, Peter Eisentraut wrote:
    Following up on my previous patch [0], here is a fairly complete
    implementation of this feature.  The general description and
    implementation outline of the previous message still apply.  This patch
    contains documentation and regression tests, which can serve as further
    explanations.
    I tested the patch on database with encoding=UTF8 and locale-C.
    I have a couple of questions and comments.

    * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
    We need to copy collations by default, or add INCLUDING COLLATE option.

    * upper() doesn't work if a column has a collation.
    It still works if a column doesn't have a collation.
    postgres=# \d tbl
    Table "public.tbl"
    Column | Type | Modifiers
    --------+------+--------------------
    c | text | collate C
    ja | text | collate ja_JP.utf8

    postgres=# SELECT name, setting FROM pg_settings WHERE name IN
    ('lc_ctype', 'lc_collate');
    name | setting
    ------------+---------
    lc_collate | C
    lc_ctype | C
    (2 rows)

    postgres=# SELECT upper(c) FROM tbl;
    ERROR: invalid multibyte character for locale
    HINT: The server's LC_CTYPE locale is probably incompatible with the
    database encoding.
    postgres=# SELECT upper(ja) FROM tbl;
    ERROR: invalid multibyte character for locale
    HINT: The server's LC_CTYPE locale is probably incompatible with the
    database encoding

    * Comparison of strings with different collations is forbidden,
    but assignment is allowed, right?

    postgres=# SELECT * FROM tbl WHERE c = ja;
    ERROR: collation mismatch between implicit collations "C" and "ja_JP.utf8"
    LINE 1: SELECT * FROM tbl WHERE c = ja;
    ^
    HINT: You can override the collation by applying the COLLATE clause
    to one or both expressions.
    postgres=# INSERT INTO tbl(c, ja) SELECT ja, c FROM tbl;
    INSERT 0 6

    * psql \d needs a separator between collate and not null modifiers.
    postgres=# ALTER TABLE tbl ALTER COLUMN c SET NOT NULL;
    ALTER TABLE
    postgres=# \d tbl
    Table "public.tbl"
    Column | Type | Modifiers
    --------+------+--------------------
    c | text | collate Cnot null <= HERE
    ja | text | collate ja_JP.utf8

    the feature overall only works on Linux/glibc.
    We could support it also on MSVC.
    http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
    http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l

    --
    Itagaki Takahiro
  • Peter Eisentraut at Sep 22, 2010 at 1:29 pm

    On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
    * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
    We need to copy collations by default, or add INCLUDING COLLATE option.
    OK, should be easy to fix.
    * upper() doesn't work if a column has a collation.
    It still works if a column doesn't have a collation.
    I think what you are observing is the result of mixing C and non-C
    locales. Of course that should also be fixed, but it doesn't have much
    to do with what upper() does. Note that there is a regression test case
    for lower(), which works mostly the same way.
    * Comparison of strings with different collations is forbidden,
    but assignment is allowed, right? Correct.
    * psql \d needs a separator between collate and not null modifiers. OK.
    We could support it also on MSVC.
    http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
    http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
    Great.
  • Itagaki Takahiro at Sep 24, 2010 at 12:32 am

    On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut wrote:
    We could support it also on MSVC.
    http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
    http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
    Great.
    If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
    in pg_dump? Since collation names depend on platforms, a backup dumped
    at UNIX cannot be reloaded to Windows. We might need to normalize
    locale names to generate a portable dump.

    --
    Itagaki Takahiro
  • Peter Eisentraut at Sep 24, 2010 at 5:58 am

    On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
    On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut wrote:
    We could support it also on MSVC.
    http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
    http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
    Great.
    If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
    in pg_dump? Since collation names depend on platforms, a backup dumped
    at UNIX cannot be reloaded to Windows. We might need to normalize
    locale names to generate a portable dump.
    It is not necessary that the SQL collation names are the same as the OS
    locale names. That is just a convenient way to initialize it. If you
    need to transport dumps, you can create your own SQL collation entry at
    the target that locally fits what you are trying to do (or perhaps
    rename the collation at the source).

    I don't think there is a universally applicable way to "normalize"
    locale names, because we want to support user-defined OS locales.
  • Robert Haas at Oct 14, 2010 at 10:32 am

    On Fri, Sep 24, 2010 at 1:57 AM, Peter Eisentraut wrote:
    On fre, 2010-09-24 at 09:32 +0900, Itagaki Takahiro wrote:
    On Wed, Sep 22, 2010 at 10:29 PM, Peter Eisentraut wrote:
    We could support it also on MSVC.
    http://msdn.microsoft.com/en-us/library/a7cwbx4t(v=VS.90).aspx -- _strcoll_l
    http://msdn.microsoft.com/en-us/library/45119yx3(v=VS.90).aspx -- _towupper_l
    Great.
    If we support both glibc and msvc, how to we handle CREATE TABLE DDLs
    in pg_dump? Since collation names depend on platforms, a backup dumped
    at UNIX cannot be reloaded to Windows. We might need to normalize
    locale names to generate a portable dump.
    It is not necessary that the SQL collation names are the same as the OS
    locale names.  That is just a convenient way to initialize it.  If you
    need to transport dumps, you can create your own SQL collation entry at
    the target that locally fits what you are trying to do (or perhaps
    rename the collation at the source).

    I don't think there is a universally applicable way to "normalize"
    locale names, because we want to support user-defined OS locales.
    What's the status of this patch?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Peter Eisentraut at Oct 14, 2010 at 4:54 pm

    On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
    What's the status of this patch?
    I would appreciate some more review of the basic architecture.
  • Robert Haas at Oct 15, 2010 at 2:54 am

    On Thu, Oct 14, 2010 at 12:53 PM, Peter Eisentraut wrote:
    On ons, 2010-10-13 at 19:15 -0400, Robert Haas wrote:
    What's the status of this patch?
    I would appreciate some more review of the basic architecture.
    <reads patch>

    Wow, what a patch. On the whole, I think this looks pretty good. Of
    course, the obvious thing to be dismayed about is how many parts of
    the system this touches. To some extent, that's probably inevitable
    and maybe not that bad, but I wonder if there is some preparatory
    refactoring that could be done to trim it down a bit. I notice, for
    example, that a lot of places that looked at <asc/desc, nulls
    first/last> now look at <asc/desc, nulls first/last, collation>. In
    particular, all the pathkey stuff is like this. And similarly places
    that used to care about <type, typmod> now have to care about <type,
    tymod, collation>. There might be ways to restructure some of this
    code so that these things can be changed without having to touch quite
    so many places. If we're extending these lists from two items to
    three, do we need to worry about what happens when they grow to four
    or five or six? I particularly think this is in issue for the type
    information. We are still finding bugs where typemod isn't carried
    through properly; this kind of thing is only going to make it much
    worse. We need to encapsulate it in some future-proof way.

    It seems you've falsified the header comment in
    pathkeys_useful_for_merging(), although I guess it's already false
    because it doesn't seem to have been updated for the NULLS ASC/DESC
    stuff, and the interior comment in right_merge_direction() also needs
    adjusting. But this might be more than a documentation problem,
    because the choice of merge direction really *is* arbitrary in the
    case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
    actually true for collation. If collation affects the definition of
    equality then it certainly isn't true.

    It looks like you've define collations as objects that exist within
    particular namespaces, but there's no CREATE COLLATION statement, so I
    don't see what purpose this serves. I suppose we could leave that to
    be added later, but is there actually a use case for having collations
    in individual schemas, or should we treat them more like we do casts -
    i.e. as database-global objects?

    Why does the executor ever need to see collate clauses?

    In the department of minor nits, the use of the word "respectively" in
    the CREATE INDEX documentation doesn't make sense to me. The message
    about "has a collation conflict" is not self-explanatory.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Peter Eisentraut at Oct 21, 2010 at 6:44 pm

    On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
    and maybe not that bad, but I wonder if there is some preparatory
    refactoring that could be done to trim it down a bit. I notice, for
    example, that a lot of places that looked at <asc/desc, nulls
    first/last> now look at <asc/desc, nulls first/last, collation>. In
    particular, all the pathkey stuff is like this. And similarly places
    that used to care about <type, typmod> now have to care about <type,
    tymod, collation>. There might be ways to restructure some of this
    code so that these things can be changed without having to touch quite
    so many places.
    Yeah, I think that's what I'll try to do next.

    We already have TypeName as a structure that contains type and typmod
    (and collation, in my patch). We could make that a primnode instead of
    a parsenode, and use it in more places, or we could make a new leaner
    structure that only contains the numeric info.

    We could then, for example, change things like this:

    typedef struct Var
    {
    Expr xpr;
    ...
    Oid vartype;
    int32 vartypmod;
    ...
    }

    into this

    typedef struct Var
    {
    Expr xpr;
    ...
    TypeName/TypeFoo vartype;
    ...
    }

    This would save boatloads of duplicate code.
    It looks like you've define collations as objects that exist within
    particular namespaces, but there's no CREATE COLLATION statement, so I
    don't see what purpose this serves. I suppose we could leave that to
    be added later, but is there actually a use case for having collations
    in individual schemas, or should we treat them more like we do casts -
    i.e. as database-global objects?
    The SQL standard defines it that way, and there should be a CREATE
    COLLATION statement later. Application-specific collation sequences
    might not be unreasonable in the future.
    Why does the executor ever need to see collate clauses?
    Hmm, maybe not. I think it did in an earlier working draft.
  • Robert Haas at Oct 21, 2010 at 7:06 pm

    On Thu, Oct 21, 2010 at 2:44 PM, Peter Eisentraut wrote:
    On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
    and maybe not that bad, but I wonder if there is some preparatory
    refactoring that could be done to trim it down a bit.  I notice, for
    example, that a lot of places that looked at <asc/desc, nulls
    first/last> now look at <asc/desc, nulls first/last, collation>.  In
    particular, all the pathkey stuff is like this.  And similarly places
    that used to care about <type, typmod> now have to care about <type,
    tymod, collation>.  There might be ways to restructure some of this
    code so that these things can be changed without having to touch quite
    so many places.
    Yeah, I think that's what I'll try to do next.

    We already have TypeName as a structure that contains type and typmod
    (and collation, in my patch).  We could make that a primnode instead of
    a parsenode, and use it in more places, or we could make a new leaner
    structure that only contains the numeric info.

    We could then, for example, change things like this:

    typedef struct Var
    {
    Expr        xpr;
    ...
    Oid         vartype;
    int32       vartypmod;
    ...
    }

    into this

    typedef struct Var
    {
    Expr        xpr;
    ...
    TypeName/TypeFoo vartype;
    ...
    }

    This would save boatloads of duplicate code.
    I think that the idea of having a node that represents "a type in all
    its glory" is a very good one. I'm somewhat inclined not to reuse
    TypeName, because I think we'll end up wanting to use this in places
    where "names" and "location" are not available. In fact, judging by
    some of the logic in LookupTypeNames(), we have some cases like that
    already, which might be worth trying to clean up.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Oct 21, 2010 at 8:29 pm

    Peter Eisentraut writes:
    We already have TypeName as a structure that contains type and typmod
    (and collation, in my patch). We could make that a primnode instead of
    a parsenode, and use it in more places, or we could make a new leaner
    structure that only contains the numeric info.
    TypeName per se is completely inappropriate for use beyond the first
    stage of parsing, because it requires catalog lookups to make any sense
    of. I think the post-parsing representation should still start with a
    type OID. I can agree with replacing typmod with a struct, though.

    regards, tom lane
  • Robert Haas at Oct 21, 2010 at 8:31 pm

    On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane wrote:
    Peter Eisentraut <peter_e@gmx.net> writes:
    We already have TypeName as a structure that contains type and typmod
    (and collation, in my patch).  We could make that a primnode instead of
    a parsenode, and use it in more places, or we could make a new leaner
    structure that only contains the numeric info.
    TypeName per se is completely inappropriate for use beyond the first
    stage of parsing, because it requires catalog lookups to make any sense
    of.  I think the post-parsing representation should still start with a
    type OID.  I can agree with replacing typmod with a struct, though.
    I think we should have both the type OID and the typmod in the struct.
    Carrying the type OID separately from the typmod has caused us enough
    heartache already. No?

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Tom Lane at Oct 21, 2010 at 8:39 pm

    Robert Haas writes:
    On Thu, Oct 21, 2010 at 4:28 PM, Tom Lane wrote:
    TypeName per se is completely inappropriate for use beyond the first
    stage of parsing, because it requires catalog lookups to make any sense
    of.  I think the post-parsing representation should still start with a
    type OID.  I can agree with replacing typmod with a struct, though.
    I think we should have both the type OID and the typmod in the struct.
    Carrying the type OID separately from the typmod has caused us enough
    heartache already. No?
    I think that that would probably involve a whole lot more notational
    busywork than just replacing typmod with something more complicated.
    However, we're talking about breaking vast amounts of code in either
    case, so maybe making it even vaster isn't a real consideration.

    regards, tom lane
  • Josh Berkus at Oct 21, 2010 at 8:44 pm

    I think that that would probably involve a whole lot more notational
    busywork than just replacing typmod with something more complicated.
    However, we're talking about breaking vast amounts of code in either
    case, so maybe making it even vaster isn't a real consideration.
    Gods, yes. Please let's not extend typemod any further without an overhaul.

    --
    -- Josh Berkus
    PostgreSQL Experts Inc.
    http://www.pgexperts.com
  • Peter Eisentraut at Nov 24, 2010 at 8:38 pm

    On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
    It seems you've falsified the header comment in
    pathkeys_useful_for_merging(), although I guess it's already false
    because it doesn't seem to have been updated for the NULLS ASC/DESC
    stuff, and the interior comment in right_merge_direction() also needs
    adjusting. But this might be more than a documentation problem,
    because the choice of merge direction really *is* arbitrary in the
    case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
    actually true for collation. If collation affects the definition of
    equality then it certainly isn't true.
    I did check that again and didn't arrive at the conclusion that the
    comments would need updating either with respect to this patch or some
    previous change. Could you check again and possibly provide a
    suggestion?
  • Robert Haas at Nov 25, 2010 at 3:57 am

    On Wed, Nov 24, 2010 at 3:37 PM, Peter Eisentraut wrote:
    On tor, 2010-10-14 at 22:54 -0400, Robert Haas wrote:
    It seems you've falsified the header comment in
    pathkeys_useful_for_merging(), although I guess it's already false
    because it doesn't seem to have been updated for the NULLS ASC/DESC
    stuff, and the interior comment in right_merge_direction() also needs
    adjusting.  But this might be more than a documentation problem,
    because the choice of merge direction really *is* arbitrary in the
    case of ASC/DESC and NULLS FIRST/LAST, but I'm not sure whether that's
    actually true for collation.  If collation affects the definition of
    equality then it certainly isn't true.
    I did check that again and didn't arrive at the conclusion that the
    comments would need updating either with respect to this patch or some
    previous change.  Could you check again and possibly provide a
    suggestion?
    I think that you are right and that my previous comment was erroneous.
    Sorry for the noise.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Peter Eisentraut at Nov 24, 2010 at 8:42 pm

    On ons, 2010-09-22 at 19:44 +0900, Itagaki Takahiro wrote:
    * CREATE TABLE (LIKE table_with_collation) doesn't inherit collations.
    This was fixed in the CF2010-11 patch.
    * psql \d needs a separator between collate and not null modifiers.
    And this as well.
  • Pavel Stehule at Sep 23, 2010 at 8:13 am
    Hello

    I am playing with your patch now. I found a few issues:

    1. It's doesn't work with SQL 92 rules for sortby list. I can
    understand so explicit COLLATE using doesn't work, but the implicit
    using doesn't work too:

    CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")

    SELECT * FROM foo ORDER BY 1 -- produce wrong order

    2. Why default encoding for collate is static? There are latin2 for
    czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
    encoding explicitly. But the more used and preferred encoding is UTF8
    now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.

    3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
    to_char
    ──────────
    thursday -- bad result
    (1 row)

    4. is somewhere ToDo for collation implementation?

    5.

    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist

    can be there some more friendly message or hint ? like "you cannot to
    use a different encoding". This collate is in pg_collates table.

    --
    patch was applied cleanly and works in very well. Thank you.

    Regards

    Pavel Stehule


    2010/9/15 Peter Eisentraut <peter_e@gmx.net>:
    Following up on my previous patch [0], here is a fairly complete
    implementation of this feature.  The general description and
    implementation outline of the previous message still apply.  This patch
    contains documentation and regression tests, which can serve as further
    explanations.

    As this patch touches pretty much everything in the system, there are
    probably countless bugs and bogosities, some of which I have marked with
    FIXME, TODO, etc.  But all the functionality is basically there, so it's
    time someone else gives this a serious examination.

    Note: As previously, regression tests only work with "make check
    MULTIBYTE=UTF8" and the feature overall only works on Linux/glibc.

    [0]
    http://archives.postgresql.org/message-id/1279045531.32647.14.camel@vanquo.pezone.net


    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Itagaki Takahiro at Sep 23, 2010 at 8:29 am

    On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule wrote:
    3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
    to_char
    ──────────
    thursday -- bad result
    (1 row)
    COLLATE means "collation" rather than "locale", no?
    5.
    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR:  collation "cs_CZ" for current database encoding "UTF8" does not exist
    can be there some more friendly message or hint ?
    I hope Postgres automatically detects the omitted encoding
    because it knows the database encoding is UTF8.

    --
    Itagaki Takahiro
  • Pavel Stehule at Sep 23, 2010 at 8:48 am

    2010/9/23 Itagaki Takahiro <itagaki.takahiro@gmail.com>:
    On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule wrote:
    3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
    to_char
    ──────────
    thursday -- bad result
    (1 row)
    COLLATE means "collation" rather than "locale", no?
    ok.
    5.
    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR:  collation "cs_CZ" for current database encoding "UTF8" does not exist
    can be there some more friendly message or hint ?
    I hope Postgres automatically detects the omitted encoding
    because it knows the database encoding is UTF8.
    I know what this issue means, but it needs some detail or hint I think

    Regards

    Pavel
    --
    Itagaki Takahiro
  • Peter Eisentraut at Sep 23, 2010 at 9:10 am

    On tor, 2010-09-23 at 17:29 +0900, Itagaki Takahiro wrote:
    On Thu, Sep 23, 2010 at 5:12 PM, Pavel Stehule wrote:
    5.
    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist
    can be there some more friendly message or hint ?
    I hope Postgres automatically detects the omitted encoding
    because it knows the database encoding is UTF8.
    I would rather not build too many expectations into this yet. The
    collation names are chosen by the user, the locale names are from the
    operating system. There is not necessarily a correspondence. The best
    fix is probably what I described earlier, populate the pg_collation
    table with the ".utf8" suffix stripped.
  • Peter Eisentraut at Sep 23, 2010 at 9:03 am

    On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
    1. It's doesn't work with SQL 92 rules for sortby list. I can
    understand so explicit COLLATE using doesn't work, but the implicit
    using doesn't work too:

    CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")

    SELECT * FROM foo ORDER BY 1 -- produce wrong order
    I can't reproduce that. Please provide more details.
    2. Why default encoding for collate is static? There are latin2 for
    czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
    encoding explicitly. But the more used and preferred encoding is UTF8
    now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.
    That's tweakable. One idea I had is to strip the ".utf8" suffix from
    locale names when populating the pg_collation catalog, or create both
    versions. I agree that the current way is a bit cumbersome.
    3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
    to_char
    ──────────
    thursday -- bad result
    (1 row)
    As was already pointed out, collation only covers lc_collate and
    lc_ctype. (It could cover other things, for example an application to
    the money type was briefly discussed, but that's outside the current
    mandate.)

    As a point of order, what you wrote above attaches a collation to the
    result of the function call. To get the collation to apply to the
    function call itself, you have to put the collate clause on one of the
    arguments, e.g.,

    select to_char(current_date,'tmday' collate "cs_CZ.utf8");
    4. is somewhere ToDo for collation implementation?
    At the moment it's mostly in the source code. I have a list of notes
    locally that I can clean up and put in the wiki once we agree on the
    general direction.
    5.

    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR: collation "cs_CZ" for current database encoding "UTF8" does not exist

    can be there some more friendly message or hint ? like "you cannot to
    use a different encoding". This collate is in pg_collates table.
    That can surely be polished.
  • Pavel Stehule at Sep 23, 2010 at 9:56 am

    2010/9/23 Peter Eisentraut <peter_e@gmx.net>:
    On tor, 2010-09-23 at 10:12 +0200, Pavel Stehule wrote:
    1. It's doesn't work with SQL 92 rules for sortby list. I can
    understand so explicit COLLATE using doesn't work, but the implicit
    using doesn't work too:

    CREATE TABLE foo(a text, b text COLLATE "cs_CZ.UTF8")

    SELECT * FROM foo ORDER BY 1 -- produce wrong order
    I can't reproduce that.  Please provide more details.
    sorry, it is ok - I was confused
    2. Why default encoding for collate is static? There are latin2 for
    czech, cs_CZ and cs_CZ.iso88592. So any user with UTF8 has to write
    encoding explicitly. But the more used and preferred encoding is UTF8
    now. I am thinking so cs_CZ on utf8 database should mean cs_CS.UTF8.
    That's tweakable.  One idea I had is to strip the ".utf8" suffix from
    locale names when populating the pg_collation catalog, or create both
    versions.  I agree that the current way is a bit cumbersome.
    yes. now almost all databases are in utf8
    3. postgres=# select to_char(current_date,'tmday') collate "cs_CZ.utf8";
    to_char
    ──────────
    thursday -- bad result
    (1 row)
    As was already pointed out, collation only covers lc_collate and
    lc_ctype.  (It could cover other things, for example an application to
    the money type was briefly discussed, but that's outside the current
    mandate.) ook
    As a point of order, what you wrote above attaches a collation to the
    result of the function call.  To get the collation to apply to the
    function call itself, you have to put the collate clause on one of the
    arguments, e.g.,

    select to_char(current_date,'tmday' collate "cs_CZ.utf8");
    I am thinking, collates can be used for this purpose too. I see some
    impacts - this syntax changes a stable function to immutable and it
    cannot be simple to solve.
    4. is somewhere ToDo for collation implementation?
    At the moment it's mostly in the source code.  I have a list of notes
    locally that I can clean up and put in the wiki once we agree on the
    general direction.
    5.

    postgres=# create table xy(a text, b text collate "cs_CZ");
    ERROR:  collation "cs_CZ" for current database encoding "UTF8" does not exist

    can be there some more friendly message or hint ? like "you cannot to
    use a different encoding". This collate is in pg_collates table.
    That can surely be polished.
    Regards

    Pavel Stehule
  • Peter Eisentraut at Sep 24, 2010 at 6:01 am

    On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
    select to_char(current_date,'tmday' collate "cs_CZ.utf8");
    I am thinking, collates can be used for this purpose too. I see some
    impacts - this syntax changes a stable function to immutable and it
    cannot be simple to solve.
    I don't understand how you come to that conclusion.
  • Pavel Stehule at Sep 24, 2010 at 6:40 am

    2010/9/24 Peter Eisentraut <peter_e@gmx.net>:
    On tor, 2010-09-23 at 11:55 +0200, Pavel Stehule wrote:
    select to_char(current_date,'tmday' collate "cs_CZ.utf8");
    I am thinking, collates can be used for this purpose too. I see some
    impacts - this syntax changes a stable function to immutable and it
    cannot be simple to solve.
    I don't understand how you come to that conclusion.
    sorry, I was wrong - it has sense for date output function. to_char is
    immutable everywhere

    Pavel
  • Pavel Stehule at Sep 26, 2010 at 12:16 pm
    Hello Peter

    Is there any reason why you prohibit a different encodings per one
    database? Actually people expect from collate per column a possibility
    to store a two or more different encodings per one database. Without
    this possibility - only UTF8 is possible for practical work - and for
    other encodings only pairs (national locale + C). Yes - it is from my
    perspective (as Czech programmer) - very typical situation and request
    is mix latin2 and latin1. I can live with limit, but it is very hard
    limit and should be documented.

    Regards

    Pavel

    2010/9/15 Peter Eisentraut <peter_e@gmx.net>:
    Following up on my previous patch [0], here is a fairly complete
    implementation of this feature.  The general description and
    implementation outline of the previous message still apply.  This patch
    contains documentation and regression tests, which can serve as further
    explanations.

    As this patch touches pretty much everything in the system, there are
    probably countless bugs and bogosities, some of which I have marked with
    FIXME, TODO, etc.  But all the functionality is basically there, so it's
    time someone else gives this a serious examination.

    Note: As previously, regression tests only work with "make check
    MULTIBYTE=UTF8" and the feature overall only works on Linux/glibc.

    [0]
    http://archives.postgresql.org/message-id/1279045531.32647.14.camel@vanquo.pezone.net


    --
    Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-hackers
  • Greg Stark at Sep 26, 2010 at 1:37 pm

    On Sun, Sep 26, 2010 at 1:15 PM, Pavel Stehule wrote:
    Is there any reason why you prohibit a different encodings per one
    database? Actually people expect from collate per column a possibility
    to store a two or more different encodings per one database.
    These are two completely separate problems that only look related. The
    main difference is that while collation is a property of the
    comparison or sort you're performing encoding is actually a property
    of the string itself. It doesn't make sense to specify a different
    encoding than what the string actually contains.

    You could actually do what you want now by using bytea columns and
    convert_to/convert_from and it wouldn't be much easier if the support
    were built into text since you would still have to keep track of the
    encoding it's in and the encoding you want. We could have a
    encoded_text data type which includes both the encoding and the string
    and which any comparison function automatically handles conversion
    based on the encoding of the collation requested -- but I wouldn't
    want that to be the default text datatype. It would impose a lot of
    overhead on the basic text operations and magnify the effects of
    choosing the wrong collation.

    --
    greg
  • Andrew Dunstan at Sep 26, 2010 at 1:54 pm

    On 09/26/2010 09:37 AM, Greg Stark wrote:
    We could have a
    encoded_text data type which includes both the encoding and the string
    and which any comparison function automatically handles conversion
    based on the encoding of the collation requested -- but I wouldn't
    want that to be the default text datatype. It would impose a lot of
    overhead on the basic text operations and magnify the effects of
    choosing the wrong collation.
    Yeah, but it would be a nice gadget to have.

    cheers

    andrew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 15, '10 at 8:46p
activeNov 25, '10 at 3:57a
posts28
users8
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase