Hi,

We are a software publisher searching for a new DBMS for our software. We have more than one hundred installed servers, running Mac OS and a Primebase database.

We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command.
LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.

It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on BSD systems.

Does anyone know a workaround ?

Thanks for your answers,

Martin

Search Discussions

  • Maximilian Tyrtania at Jan 13, 2010 at 10:21 am

    Am 12.01.2010 um 12:36 schrieb Martin Flahault:

    We have spend some time evaluating PostgreSQL and we can't get correct outputs with the ORDER BY command.
    LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.

    It seems there is a known problem with the collating order of text including diacritics with the UTF8 encodings on BSD systems.

    Does anyone know a workaround ?
    The best i've seen so far is:

    CREATE OR REPLACE FUNCTION f_getorderbyfriendlyversion(texttoconvert text)
    RETURNS text AS
    $BODY$
    select translate(upper($1),'ÄÖÜ','AOU')--add french diacritical characters here
    $BODY$
    LANGUAGE 'sql' IMMUTABLE STRICT
    COST 100;
    ALTER FUNCTION f_getorderbyfriendlyversion(text) OWNER TO postgres;

    CREATE INDEX idx_mytable_myfield_orderbyfriendly
    ON mytable
    USING btree
    (f_getorderbyfriendlyversion(myfield::text));

    Select * from mytable order by f_getorderbyfriendlyversion(myfield);

    Not an ideal solution, but it seems to perform quite well.

    Best wishes from Berlin,

    Maximilian Tyrtania
  • Craig Ringer at Jan 13, 2010 at 11:06 am

    On 12/01/2010 7:36 PM, Martin Flahault wrote:
    Hi,

    We are a software publisher searching for a new DBMS for our software.
    We have more than one hundred installed servers, running Mac OS and a
    Primebase database.

    We have spend some time evaluating PostgreSQL and we can't get correct
    outputs with the ORDER BY command.
    Can you provide a sample? Include a table of sample values, an example
    query, its output, and what you'd expect to get instead? And why?

    --
    Craig Ringer
  • Martin Flahault at Jan 13, 2010 at 3:15 pm
    Here is an exemple :

    postgres=# create database newbase;
    CREATE DATABASE
    postgres=# \c newbase;
    psql (8.4.2)
    You are now connected to database "newbase".
    newbase=# create table t1 (contenu text);
    CREATE TABLE
    newbase=# insert into t1 values ('a'), ('e'), ('à'), ('é'), ('A'), ('E');
    INSERT 0 6

    newbase=# select * from t1 order by contenu;
    contenu
    ---------
    A
    E
    a
    e
    à
    é
    (6 rows)

    newbase=# select * from t1 order by upper(contenu);
    contenu
    ---------
    a
    A
    e
    E
    à
    é
    (6 rows)


    Here is the encoding informations :

    newbase=# \encoding
    UTF8
    newbase=# show lc_collate;
    lc_collate
    ------------
    fr_FR
    (1 row)

    newbase=# show lc_ctype;
    lc_ctype
    ----------
    fr_FR
    (1 row)


    As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the expected output:
    a
    à
    A
    e
    é
    E


    It seems there is a problem with the collating order on BSD systems with diacritics using UTF8.
    If you put this text :
    a
    A
    à
    é
    e
    E

    in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
    A
    E
    a
    e
    à
    é

    Hope this will help,

    Martin
  • Martijn van Oosterhout at Jan 13, 2010 at 10:02 pm
    On Wed, Jan 13, 2010 at 04:15:06PM +0100, Martin Flahault wrote:

    [postgres]
    newbase=# select * from t1 order by contenu;
    contenu
    ---------
    A
    E
    a
    e
    Postgresql outputs whatever the C library does on the underlying
    system. The quality of this varies wildly.
    à
    As with others DBMS (MySQL for example), diacritics should be ignored when determining the sort order. Here is the expected output:
    MySQL implements the unicode collation algorithm, which means it
    essentially does what you want.
    It seems there is a problem with the collating order on BSD systems with diacritics using UTF8.
    Last I checked, BSD did not support useful sorting on UTF-8 at all, so
    it's not surprised it doesn't work.
    in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
    Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
    for programs that would like true unicode collation, but there is
    little chance that postgresql will ever use this.

    Hope this helps,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    Please line up in a tree and maintain the heap invariant while
    boarding. Thank you for flying nlogn airlines.
  • Craig Ringer at Jan 14, 2010 at 5:32 am

    Martijn van Oosterhout wrote:

    in a UTF8 text file and use the "sort" command on it, you will have the same wrong output as with PostgreSQL :
    Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
    for programs that would like true unicode collation, but there is
    little chance that postgresql will ever use this.
    Out of interest: Why not?

    Using ICU would permit Pg to be independent of libc's collation rules,
    finally permitting things like specifying a specific collation for a
    textual sort. It'd make mixing data from different locales in a database
    a lot easier (read: possible to do correctly).

    Is this just a matter of "nobody cares enough to produce a solid, tested
    patch with equivalent performance that doesn't turn people who try to
    review it green with disgust" ... or are there specific reasons why
    using something like ICU instead of libc's locale support is not
    appropriate for Pg?

    --
    Craig Ringer
  • Tom Lane at Jan 14, 2010 at 6:02 am

    Craig Ringer writes:
    Martijn van Oosterhout wrote:
    Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
    for programs that would like true unicode collation, but there is
    little chance that postgresql will ever use this.
    Out of interest: Why not?
    There's plenty of discussion in the archives about it, but basically
    ICU would represent a pretty enormous dependency and would lock us in
    to having no other backend encoding but UTF8.

    The state of OS X's POSIX-spec locale support is pretty pitiful, but on
    the whole I'd say if you need better UTF8 locale support you could use
    another OS.

    regards, tom lane
  • Craig Ringer at Jan 14, 2010 at 6:15 am

    Tom Lane wrote:
    Craig Ringer <craig@postnewspapers.com.au> writes:
    Martijn van Oosterhout wrote:
    Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
    for programs that would like true unicode collation, but there is
    little chance that postgresql will ever use this.
    Out of interest: Why not?
    There's plenty of discussion in the archives about it, but basically
    ICU would represent a pretty enormous dependency and would lock us in
    to having no other backend encoding but UTF8.
    Thanks. You're right - I should've just STFA ;-) so I appreciate the
    brief explanation.
    The state of OS X's POSIX-spec locale support is pretty pitiful, but on
    the whole I'd say if you need better UTF8 locale support you could use
    another OS.
    That's my personal opinion too ... I have the "pleasure" of
    administrating an OS X Server and six Mac Pro clients at work, and have
    become well acquainted with the exciting variety of bugs, undocumented
    "features", and bizarre quirks of that particular OS. POSIX locale
    issues are the least of its issues.

    Alas, people will want to run Pg on it anyway, especially when bundling
    with an app. It'd be nice if it could be made to work smoothly ... but I
    certainly don't care enough to try! Like you, I favour using an OS that
    follows the specs it claims to support instead.

    Perhaps someone who wants to use Mac OS X and Pg for their product will
    come forward with some compat wrapper functions for the localizable
    libc/posix functions, so Pg can just be built against the wrapper and
    the rest of us need not care about OS X's bugs.

    --
    Craig Ringer
  • Scott Marlowe at Jan 14, 2010 at 6:21 am

    On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer wrote:
    Perhaps someone who wants to use Mac OS X and Pg for their product will
    come forward with some compat wrapper functions for the localizable
    libc/posix functions, so Pg can just be built against the wrapper and
    the rest of us need not care about OS X's bugs.
    I know this sounds crazy, but couldn't Apple be bothered to fix their OS? :)
  • Craig Ringer at Jan 14, 2010 at 6:26 am

    Scott Marlowe wrote:
    On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer
    wrote:
    Perhaps someone who wants to use Mac OS X and Pg for their product will
    come forward with some compat wrapper functions for the localizable
    libc/posix functions, so Pg can just be built against the wrapper and
    the rest of us need not care about OS X's bugs.
    I know this sounds crazy, but couldn't Apple be bothered to fix their OS? :)
    I see you don't use Macs much :-P

    ( Goes back to researching a way to make network automounts on login
    work and properly support search in 10.6, since the only way that used
    to work in 10.4 is broken by 10.6 )

    --
    Craig Ringer
  • Tom Lane at Jan 14, 2010 at 6:25 am

    Craig Ringer writes:
    Tom Lane wrote:
    The state of OS X's POSIX-spec locale support is pretty pitiful, but on
    the whole I'd say if you need better UTF8 locale support you could use
    another OS.
    Alas, people will want to run Pg on it anyway, especially when bundling
    with an app. It'd be nice if it could be made to work smoothly ... but I
    certainly don't care enough to try! Like you, I favour using an OS that
    follows the specs it claims to support instead.
    For the record, I *like* OS X; I'm typing this on a Macbook Pro. But
    it doesn't do everything, and one of the things it doesn't do well is
    POSIX-spec locale support.

    As you now know from having looked at the archives, there've been many
    discussions of changing PG to not rely on the platform's locale
    support. But the bang-for-buck ratio of any such change doesn't seem
    very attractive --- we have a lot of higher priority things to spend
    our finite development manpower on.

    regards, tom lane
  • Greg Stark at Jan 16, 2010 at 9:11 pm

    On Thu, Jan 14, 2010 at 6:15 AM, Craig Ringer wrote:
    Out of interest: Why not?
    There's plenty of discussion in the archives about it, but basically
    ICU would represent a pretty enormous dependency and would lock us in
    to having no other backend encoding but UTF8.
    Thanks. You're right - I should've just STFA ;-) so I appreciate the
    brief explanation.
    There's also the question of whether being inconsistent with the rest
    of the system is really the right thing at all. If a programmer does a
    sort in the database and then writes application code using the same
    collation on the same system which depends on the data being sorted
    it's nice that that works. Or if an admin presorts the data using sort
    before doing a data load he might reasonable expect the table to be
    effectively clustered.

    Switching to ICU means trading our current inconsistency from platform
    to platform for a different inconsistency which would be better in
    some cases and worse in others.


    --
    greg
  • Martijn van Oosterhout at Jan 18, 2010 at 7:10 am

    On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote:
    Switching to ICU means trading our current inconsistency from platform
    to platform for a different inconsistency which would be better in
    some cases and worse in others.
    Or, you can have the cake and eat it too. That is, aim for the end goal
    and let people choose what library they want to use for sorting (that
    is, extend the meaning of the locale identifier). Patches for this
    should be in the archives somewhere. As I recall the reason this was
    rejected is that *BSD lack the capability of handling multiple
    collation algorithms at all at the libc level (that is, if you don't
    just tell people to use ICU in that case).

    Mac OS X doesn't have great POSIX locale support but at least they
    implemented strcoll_l.

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    Please line up in a tree and maintain the heap invariant while
    boarding. Thank you for flying nlogn airlines.
  • Martin Flahault at Jan 18, 2010 at 3:34 pm
    Very interesting discussion indeed.

    It seems that "Postgresql:The world's most advanced open source database" can not work properly on "Mac OS X: the world's most advanced operating system" and FreeBSD.

    Don't you think postgresql.org should remove from their download page the links to FreeBSD and Mac OS X binary packages?

    Martin Flahault



    Le 18 janv. 2010 à 08:10, Martijn van Oosterhout a écrit :
    On Sat, Jan 16, 2010 at 09:10:53PM +0000, Greg Stark wrote:
    Switching to ICU means trading our current inconsistency from platform
    to platform for a different inconsistency which would be better in
    some cases and worse in others.
    Or, you can have the cake and eat it too. That is, aim for the end goal
    and let people choose what library they want to use for sorting (that
    is, extend the meaning of the locale identifier). Patches for this
    should be in the archives somewhere. As I recall the reason this was
    rejected is that *BSD lack the capability of handling multiple
    collation algorithms at all at the libc level (that is, if you don't
    just tell people to use ICU in that case).

    Mac OS X doesn't have great POSIX locale support but at least they
    implemented strcoll_l.

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    Please line up in a tree and maintain the heap invariant while
    boarding. Thank you for flying nlogn airlines.
  • Craig Ringer at Jan 14, 2010 at 2:42 am

    On 13/01/2010 11:15 PM, Martin Flahault wrote:

    It seems there is a problem with the collating order on BSD systems with
    diacritics using UTF8.
    If you put this text :
    a
    A
    à
    é
    e
    E

    in a UTF8 text file and use the "sort" command on it, you will have the
    same wrong output as with PostgreSQL :
    A
    E
    a
    e
    à
    é
    First: PostgreSQL expects the OS to behave correctly and sort according
    to the locale. It relies on the C library for this. If the C library
    doesn't do it right, PostgreSQL won't do it right either. So you need to
    get Mac OS X to do the right thing.

    Your results match what I get on a Linux system without a properly
    generated fr_FR.UTF-8 locale. Libc falls back on the "C" locale, which
    sorts that way.

    If I generate the fr_FR.UTF-8 locale and run the sort (on the file "x"),
    I get the desired result:

    LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
    a
    A
    à
    e
    E
    é

    I don't know Mac OS X well, but this is making me wonder if maybe you're
    just missing the required information for the locale, so libc is falling
    back on the "C" locale.

    (Of course, being Mac OS X there are probably at least three out of date
    or simply false "man" pages describing the behaviour, none of which
    reflect the reality of a magic config key buried somewhere in NetInfo,
    for which the documentation is also completely out of date. Bitter? Me?
    Yeah, I admin a bunch of OS X machines on a business network.)

    Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least
    *thinks* it supports the fr_FR.UTF-8 locale:

    osx104$ LANG=xxx LC_ALL=xxx locale
    LANG="xxx"
    LC_COLLATE="C"
    LC_CTYPE="C"
    LC_MESSAGES="C"
    LC_MONETARY="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_ALL="C"

    osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale
    LANG="fr_FR.UTF-8"
    LC_COLLATE="fr_FR.UTF-8"
    LC_CTYPE="fr_FR.UTF-8"
    LC_MESSAGES="fr_FR.UTF-8"
    LC_MONETARY="fr_FR.UTF-8"
    LC_NUMERIC="fr_FR.UTF-8"
    LC_TIME="fr_FR.UTF-8"
    LC_ALL="fr_FR.UTF-8"

    osx104$ locale -a | grep fr_FR
    fr_FR
    fr_FR.ISO8859-1
    fr_FR.ISO8859-15
    fr_FR.UTF-8

    ... yet it clearly doesn't:

    osx104$ LANG=C LC_ALL=C sort x
    A
    E
    a
    e
    à
    é
    osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
    A
    E
    a
    e
    à
    é
    osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x
    A
    E
    a
    e
    à
    é

    Mac OS X seems to keep its locale config in /usr/share/locale . Looking
    there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However,
    they're identical to those for en_US.UTF-8:

    osx104$ cd /usr/share/locale
    osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE

    ... so your OS's localized collation support is broken/missing, at least
    if the same is true for more modern versions of OS X.

    --
    Craig Ringer

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedJan 12, '10 at 11:46a
activeJan 18, '10 at 3:34p
posts15
users7
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2022 Grokbase