FAQ
Full_Name: Jeffrey Sullivan
Version: 2.10
OS: Mac
Submission from: (NULL) (130.154.0.250)


Sort produces different results when sorting strings with non-alphanumeric
characters, depending on the operating system:

RHEL 5.2, R 2.10.0
-------------
v <- c("1","<0",">3","2")
Sys.setlocale("LC_COLLATE","en_US.UTF-8")
[1] "en_US.UTF-8"
sort(v)
[1] "<0" "1" "2" ">3"

Max OS 10.5.8, R 2.10.1
-------------------
v <- c("1","<0",">3","2")
Sys.setlocale("LC_COLLATE","en_US.UTF-8")
[1] "en_US.UTF-8"
sort(v)
[1] "<0" ">3" "1" "2"

Search Discussions

  • Prof Brian Ripley at Dec 22, 2009 at 12:18 pm
    As the help says

    The sort order for character vectors will depend on the collating
    sequence of the locale in use: see ?Comparison?.

    and that ref says

    Collation of
    non-letters (spaces, punctuation signs, hyphens, fractions and so
    on) is even more problematic.

    That different OSes use the same name for a locale does not make them
    the same locale.

    Note that R can be compiled to use ICU, which provides a
    well-considered collation suite. R on Mac OS X uses ICU, as does a
    Linux build if it is available -- so I would say that it is RHEL that
    is out of line here (it makes little sense to have < and > far apart
    in the collation sequence).

    Why did you report a documented difference as a bug?
    On Mon, 21 Dec 2009, jeffreys at rand.org wrote:

    Full_Name: Jeffrey Sullivan
    Version: 2.10
    OS: Mac
    Submission from: (NULL) (130.154.0.250)


    Sort produces different results when sorting strings with non-alphanumeric
    characters, depending on the operating system:

    RHEL 5.2, R 2.10.0
    -------------
    v <- c("1","<0",">3","2")
    Sys.setlocale("LC_COLLATE","en_US.UTF-8")
    [1] "en_US.UTF-8"
    sort(v)
    [1] "<0" "1" "2" ">3"

    Max OS 10.5.8, R 2.10.1
    -------------------
    v <- c("1","<0",">3","2")
    Sys.setlocale("LC_COLLATE","en_US.UTF-8")
    [1] "en_US.UTF-8"
    sort(v)
    [1] "<0" ">3" "1" "2"

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    --
    Brian D. Ripley, ripley at stats.ox.ac.uk
    Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
    University of Oxford, Tel: +44 1865 272861 (self)
    1 South Parks Road, +44 1865 272866 (PA)
    Oxford OX1 3TG, UK Fax: +44 1865 272595
  • Peter Dalgaard at Dec 22, 2009 at 12:59 pm

    Prof Brian Ripley wrote:
    That different OSes use the same name for a locale does not make them
    the same locale.

    Note that R can be compiled to use ICU, which provides a well-considered
    collation suite. R on Mac OS X uses ICU, as does a Linux build if it is
    available -- so I would say that it is RHEL that is out of line here (it
    makes little sense to have < and > far apart in the collation sequence).
    That's not it:
    v <- c("1","<0","<3","2")
    sort(v)
    [1] "<0" "1" "2" "<3"

    The point is rather that "special characters" are ignored during collation.

    Apparently, this comes from /usr/share/i18n/locales/iso14651_t1_common
    on Fedora; I wouldn't know how faithful to the ISO standard that is.

    --
    O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B
    c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
    (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
    ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
  • Prof Brian Ripley at Dec 22, 2009 at 1:37 pm

    On Tue, 22 Dec 2009, Peter Dalgaard wrote:

    Prof Brian Ripley wrote:
    That different OSes use the same name for a locale does not make them the
    same locale.

    Note that R can be compiled to use ICU, which provides a well-considered
    collation suite. R on Mac OS X uses ICU, as does a Linux build if it is
    available -- so I would say that it is RHEL that is out of line here (it
    makes little sense to have < and > far apart in the collation sequence).
    That's not it:
    v <- c("1","<0","<3","2")
    sort(v)
    [1] "<0" "1" "2" "<3"

    The point is rather that "special characters" are ignored during collation.
    Sometimes ....
    Apparently, this comes from /usr/share/i18n/locales/iso14651_t1_common on
    Fedora; I wouldn't know how faithful to the ISO standard that is.
    ISO 14651 is a version of the Unicode Collation Algorithm
    (http://www.unicode.org/reports/tr10/) which ICU uses. So other
    people have implemented the same set of rules to give different
    results -- which is quite possible given the number of non-prescribed
    choices that need to be made.

    We've seen too many anomalies from glibc to trust it: which is why ICU
    is used if available.

    --
    Brian D. Ripley, ripley at stats.ox.ac.uk
    Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
    University of Oxford, Tel: +44 1865 272861 (self)
    1 South Parks Road, +44 1865 272866 (PA)
    Oxford OX1 3TG, UK Fax: +44 1865 272595
  • Jeffrey M Sullivan at Dec 22, 2009 at 7:55 pm

    On Dec 22, 2009, at 4:18 AM, Prof Brian Ripley wrote:

    As the help says

    The sort order for character vectors will depend on the collating
    sequence of the locale in use: see ?Comparison?.

    and that ref says

    Collation of
    non-letters (spaces, punctuation signs, hyphens, fractions and so
    on) is even more problematic.

    That different OSes use the same name for a locale does not make
    them the same locale.

    Note that R can be compiled to use ICU, which provides a well-
    considered collation suite. R on Mac OS X uses ICU, as does a Linux
    build if it is available -- so I would say that it is RHEL that is
    out of line here (it makes little sense to have < and > far apart in
    the collation sequence).

    Why did you report a documented difference as a bug?
    Because it wasn't clear to me from the documentation what sort of
    "problematic" behaviors were covered as documented differences vs
    unexpected behavior. Other OSS projects I have been involved with have
    a "when in doubt, file a bug" policy. If that isn't the case with R, I
    won't do so in the future.

    Thank you for the pointer towards ICU. RHEL has some of the ICU
    libraries, but the icuSetCollate function returns a warning that R was
    not built with them. Including a reference to this function in the
    "See Also" for Comparison would make this info a little easier to find.

    Thanks for your time,
    Jeff
    On Mon, 21 Dec 2009, jeffreys at rand.org wrote:

    Full_Name: Jeffrey Sullivan
    Version: 2.10
    OS: Mac
    Submission from: (NULL) (130.154.0.250)


    Sort produces different results when sorting strings with non-
    alphanumeric
    characters, depending on the operating system:

    RHEL 5.2, R 2.10.0
    -------------
    v <- c("1","<0",">3","2")
    Sys.setlocale("LC_COLLATE","en_US.UTF-8")
    [1] "en_US.UTF-8"
    sort(v)
    [1] "<0" "1" "2" ">3"

    Max OS 10.5.8, R 2.10.1
    -------------------
    v <- c("1","<0",">3","2")
    Sys.setlocale("LC_COLLATE","en_US.UTF-8")
    [1] "en_US.UTF-8"
    sort(v)
    [1] "<0" ">3" "1" "2"

    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    --
    Brian D. Ripley, ripley at stats.ox.ac.uk
    Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
    University of Oxford, Tel: +44 1865 272861 (self)
    1 South Parks Road, +44 1865 272866 (PA)
    Oxford OX1 3TG, UK Fax: +44 1865 272595
    --
    Jeffrey Sullivan
    Senior Project Associate
    RAND Corporation

    Work : (310) 393-0411 x6883
    Fax : (310) 260-8147
    SIPR : jeffreys at sm.rand.pentagon.smil.mil
    JWICS: sullivanj at la.ic.gov

    -------------- next part --------------

    __________________________________________________________________________

    This email message is for the sole use of the intended r...{{dropped:8}}

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-devel @
categoriesr
postedDec 21, '09 at 7:40p
activeDec 22, '09 at 7:55p
posts5
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2022 Grokbase