FAQ
Hi everyone!

I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

When I try to sort latin1-encoded character vectors, R sometimes
crashes with a segmentation fault. I'm running OS X 10.5.8 and have
observed this behaviour both with the i386 and x86_64 builds, in the
R.app GUI as well as on the command line.

Here's a minimal example that reliably triggers the crash on my machine:

=====
print(sessionInfo())

words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
str(words)

print(table(Encoding(words)))
Encoding(words) <- "latin1" # this is the correct encoding!
print(table(Encoding(words)))

N <- 1000
words <- rep(words, length.out=N)

print(N)
for (i in 1:N) {
x <- words[1:i]
# the following line will crash for some i, depending on the particular
# strings in <words> and the subset selected for <x> above
order(x)
}
=====

The output I get from this code is appended at the end of the mail.
Note that R incorrectly declares the latin1 strings in <word> to have
UTF-8 encoding (this seems wrong to me because the \x escapes insert
raw bytes into the string). The crash only occurs if the correct
"latin1" encoding (or "unknown") is explicitly specified. Otherwise
the string handling code appears to ignore everything after the first
invalid multibyte character.

I haven't been able to trigger the bug without some kind of loop. The
crash always occurs at the same iteration, but this changes depending
on the contents of <words> and the specific subset selected in each
loop iteration. Also note that the 64-bit version of R gives a
different error message. If I omit the unrelated statement
"print(N)", the 64-bit version segfaults and the 32-bit version just
hangs with high CPU load. All this suggests to me that there must be
some insidious memory corruption or stack/range overflow in the
internal ordering code.

Can other people reproduce this problem on different platforms and
possibly with different versions of R?


BTW, I ran into the crash when trying to read.delim() a file in latin1
encoding, using either encoding="latin1" or fileEncoding="latin1", and
then converting it back and forth between a character vector and a
factor. I still don't understand what's going on there. The
behaviour of read.delim() seems to depend very much on my locale
settings when running R, which is rather unpleasant. Is there a way
to find out how strings are stored internally (i.e. getting the exact
byte representation) and whether R believes them to be in UTF-8 or
latin1 encoding?


Best regards,
Stefan Evert

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]





Output of sample code on my machine:
print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
i386-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats graphics grDevices utils datasets methods base
words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
"\xe4\xfc")
str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
print(table(Encoding(words)))
unknown UTF-8
2 5
Encoding(words) <- "latin1" # this is the correct encoding!
print(table(Encoding(words)))
latin1 unknown
5 2
N <- 1000
words <- rep(words, length.out=N)

print(N) [1] 1000
for (i in 1:N) {
+ x <- words[1:i]
+ # the following line will crash for some i, depending on the
particular
+ # strings in <words> and the subset selected for <x> above
+ order(x)
+ }

*** caught bus error ***
address 0x86, cause 'non-existent physical address'

Traceback:
1: order(x)
aborting ...
Bus error
64-bit version:
print(sessionInfo())
R version 2.9.2 Patched (2009-09-24 r49861)
x86_64-apple-darwin9.8.0

locale:
en_GB/en_GB/C/C/en_GB/en_GB

attached base packages:
[1] stats graphics grDevices utils datasets methods base
words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
"\xe4\xfc")
str(words)
chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
print(table(Encoding(words)))
unknown UTF-8
2 5
Encoding(words) <- "latin1" # this is the correct encoding!
print(table(Encoding(words)))
latin1 unknown
5 2
N <- 1000
words <- rep(words, length.out=N)

print(N) [1] 1000
for (i in 1:N) {
+ x <- words[1:i]
+ # the following line will crash for some i, depending on the
particular
+ # strings in <words> and the subset selected for <x> above
+ order(x)
+ }
Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
Execution halted

Search Discussions

  • Simon Urbanek at Sep 30, 2009 at 2:55 pm
    Stefan,
    On Sep 30, 2009, at 5:11 , Stefan Evert wrote:

    Hi everyone!

    I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
    R version 2.9.2 Patched (2009-09-24 r49861)
    i386-apple-darwin9.8.0

    When I try to sort latin1-encoded character vectors, R sometimes
    crashes with a segmentation fault. I'm running OS X 10.5.8 and have
    observed this behaviour both with the i386 and x86_64 builds, in the
    R.app GUI as well as on the command line.

    Here's a minimal example that reliably triggers the crash on my
    machine:

    ====> print(sessionInfo())

    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
    str(words)

    print(table(Encoding(words)))
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))

    N <- 1000
    words <- rep(words, length.out=N)

    print(N)
    for (i in 1:N) {
    x <- words[1:i]
    # the following line will crash for some i, depending on the
    particular
    # strings in <words> and the subset selected for <x> above
    order(x)
    }
    ====>
    The output I get from this code is appended at the end of the mail.
    Note that R incorrectly declares the latin1 strings in <word> to
    have UTF-8 encoding (this seems wrong to me because the \x escapes
    insert raw bytes into the string).
    It is correct, because you're in a UTF-8 locale (see l10n_info()) so
    all strings are UTF-8 by default - you're just manually creating a
    string that is not valid in UTF-8.

    The crash only occurs if the correct "latin1" encoding (or
    "unknown") is explicitly specified. Otherwise the string handling
    code appears to ignore everything after the first invalid multibyte
    character.

    I haven't been able to trigger the bug without some kind of loop.
    The crash always occurs at the same iteration, but this changes
    depending on the contents of <words> and the specific subset
    selected in each loop iteration. Also note that the 64-bit version
    of R gives a different error message. If I omit the unrelated
    statement "print(N)", the 64-bit version segfaults and the 32-bit
    version just hangs with high CPU load. All this suggests to me that
    there must be some insidious memory corruption or stack/range
    overflow in the internal ordering code.
    Yup:

    Program received signal EXC_BAD_ACCESS, Could not access memory.
    Reason: 13 at address: 0x0000000000000000
    0x0000000100167e0d in R_gc_internal (size_needed=1) at ../../../../
    R-2.9-branch/src/main/memory.c:1327
    1327 PROCESS_NODES();
    (gdb) bt
    #0 0x0000000100167e0d in R_gc_internal (size_needed=1) at ../../../../
    R-2.9-branch/src/main/memory.c:1327
    #1 0x000000010016a2bf in Rf_allocVector (type`7, length=0)
    at ../../../../R-2.9-branch/src/main/memory.c:1991
    #2 0x000000010016aa65 in R_alloc (nelem=<value temporarily
    unavailable, due to optimizations>, eltsize=<value temporarily
    unavailable, due to optimizations>) at ../../../../R-2.9-branch/src/
    main/memory.c:1669
    #3 0x000000010020f316 in Rf_translateCharUTF8 (x=<value temporarily
    unavailable, due to optimizations>) at ../../../../R-2.9-branch/src/
    main/sysutils.c:858
    #4 0x0000000100216140 in Rf_Scollate (a=0x1023c1518, b=0x0)
    at ../../../../R-2.9-branch/src/main/util.c:1691
    #5 0x00000001001f894e in orderVector1 (indx=<value temporarily
    unavailable, due to optimizations>, n=<value temporarily unavailable,
    due to optimizations>, key=0x11b024c00, nalast=TRUE, decreasingúLSE,
    rho=0x1020a4778) at ../../../../R-2.9-branch/src/main/sort.c:846
    #6 0x00000001001f9605 in orderVector [inlined] () at ../../../../
    R-2.9-branch/src/main/sort.c:888
    #7 do_order (call=<value temporarily unavailable, due to
    optimizations>, op=<value temporarily unavailable, due to
    optimizations>, args=0x11843fc38, rho=<value temporarily unavailable,
    due to optimizations>) at ../../../../R-2.9-branch/src/main/sort.c:891

    Note that b=0x0 in the call to Rf_Scollate -- seems like some array
    overflow in the sorting code... will need some more investigation ...


    In the meantime I can offer you a work-around -- working with non-
    native strings (latin1 in your case) is very expensive because they
    get converted all the time into the native locale, so you want to run
    words<-iconv(words,"latin1","")

    and then proceed - it's faster and doesn't crash ;).

    Can other people reproduce this problem on different platforms and
    possibly with different versions of R?


    BTW, I ran into the crash when trying to read.delim() a file in
    latin1 encoding, using either encoding="latin1" or
    fileEncoding="latin1", and then converting it back and forth between
    a character vector and a factor. I still don't understand what's
    going on there. The behaviour of read.delim() seems to depend very
    much on my locale settings when running R, which is rather unpleasant.
    ?? The whole point of a locale is that it declares how you are going
    to interact with the system. Handling of strings is entirely different
    depending on the encoding used by the locale - and that is the point
    of locales. When you are dealing with text (e.g. as files) you must
    always take the encoding into account and by default they are assumed
    to be in the same encoding as your locale - you really wouldn't want R
    to suddenly read all files as let's say eucJP even though your locale
    is UTF-8 ...

    Is there a way to find out how strings are stored internally (i.e.
    getting the exact byte representation) and whether R believes them
    to be in UTF-8 or latin1 encoding?
    charToRaw() will show you the raw bytes and you define using
    Encoding() how you want the string to be interpreted (supported is
    UTF-8, latin1 and unknown). If the encoding is known, R will convert
    it where needed. Normally R uses the native encoding of the locale
    you're running in. If you are dealing with files from other locales,
    you have to tell R accordingly - in most cases it's better to re-
    encode the strings (?iconv) than to work with the foreign encoding.

    Cheers,
    Simon



    Output of sample code on my machine:
    print(sessionInfo())
    R version 2.9.2 Patched (2009-09-24 r49861)
    i386-apple-darwin9.8.0

    locale:
    en_GB/en_GB/C/C/en_GB/en_GB

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
    "\xe4\xfc")
    str(words)
    chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
    print(table(Encoding(words)))
    unknown UTF-8
    2 5
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))
    latin1 unknown
    5 2
    N <- 1000
    words <- rep(words, length.out=N)

    print(N) [1] 1000
    for (i in 1:N) {
    + x <- words[1:i]
    + # the following line will crash for some i, depending on the
    particular
    + # strings in <words> and the subset selected for <x> above
    + order(x)
    + }

    *** caught bus error ***
    address 0x86, cause 'non-existent physical address'

    Traceback:
    1: order(x)
    aborting ...
    Bus error
    64-bit version:
    print(sessionInfo())
    R version 2.9.2 Patched (2009-09-24 r49861)
    x86_64-apple-darwin9.8.0

    locale:
    en_GB/en_GB/C/C/en_GB/en_GB

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc",
    "\xe4\xfc")
    str(words)
    chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
    print(table(Encoding(words)))
    unknown UTF-8
    2 5
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))
    latin1 unknown
    5 2
    N <- 1000
    words <- rep(words, length.out=N)

    print(N) [1] 1000
    for (i in 1:N) {
    + x <- words[1:i]
    + # the following line will crash for some i, depending on the
    particular
    + # strings in <words> and the subset selected for <x> above
    + order(x)
    + }
    Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
    Execution halted
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
  • Prof Brian Ripley at Oct 5, 2009 at 2:16 pm
    This was a missing PROTECT() in do_order.

    But I'll echo what Simon Urbanek said: don't do that but rather use
    the documented ways to re-encode the file as you read it. (Latin-1
    used to be needed for collation on Mac OS X as C-level collation in
    UTF-8 was completely broken -- but we have worked around that.)

    We provided fileEncoding= in read.table for those who failed to RTFM
    and thought encoding= was to set the file encoding, but it seems that
    encodings are simply too hard a concept for some R users.
    On Wed, 30 Sep 2009, Stefan Evert wrote:

    Hi everyone!

    I think I stumbled over a bug in the latest R 2.9.2 patched for OS X:
    R version 2.9.2 Patched (2009-09-24 r49861)
    i386-apple-darwin9.8.0

    When I try to sort latin1-encoded character vectors, R sometimes crashes with
    a segmentation fault. I'm running OS X 10.5.8 and have observed this
    behaviour both with the i386 and x86_64 builds, in the R.app GUI as well as
    on the command line.

    Here's a minimal example that reliably triggers the crash on my machine:

    =====
    print(sessionInfo())

    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
    str(words)

    print(table(Encoding(words)))
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))

    N <- 1000
    words <- rep(words, length.out=N)

    print(N)
    for (i in 1:N) {
    x <- words[1:i]
    # the following line will crash for some i, depending on the particular
    # strings in <words> and the subset selected for <x> above
    order(x)
    }
    =====

    The output I get from this code is appended at the end of the mail. Note that
    R incorrectly declares the latin1 strings in <word> to have UTF-8 encoding
    (this seems wrong to me because the \x escapes insert raw bytes into the
    string). The crash only occurs if the correct "latin1" encoding (or
    "unknown") is explicitly specified. Otherwise the string handling code
    appears to ignore everything after the first invalid multibyte character.

    I haven't been able to trigger the bug without some kind of loop. The crash
    always occurs at the same iteration, but this changes depending on the
    contents of <words> and the specific subset selected in each loop iteration.
    Also note that the 64-bit version of R gives a different error message. If I
    omit the unrelated statement "print(N)", the 64-bit version segfaults and the
    32-bit version just hangs with high CPU load. All this suggests to me that
    there must be some insidious memory corruption or stack/range overflow in the
    internal ordering code.

    Can other people reproduce this problem on different platforms and possibly
    with different versions of R?


    BTW, I ran into the crash when trying to read.delim() a file in latin1
    encoding, using either encoding="latin1" or fileEncoding="latin1", and then
    converting it back and forth between a character vector and a factor. I
    still don't understand what's going on there. The behaviour of read.delim()
    seems to depend very much on my locale settings when running R, which is
    rather unpleasant. Is there a way to find out how strings are stored
    internally (i.e. getting the exact byte representation) and whether R
    believes them to be in UTF-8 or latin1 encoding?


    Best regards,
    Stefan Evert

    [ stefan.evert at uos.de | http://purl.org/stefan.evert ]





    Output of sample code on my machine:
    print(sessionInfo())
    R version 2.9.2 Patched (2009-09-24 r49861)
    i386-apple-darwin9.8.0

    locale:
    en_GB/en_GB/C/C/en_GB/en_GB

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
    str(words)
    chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
    print(table(Encoding(words)))
    unknown UTF-8
    2 5
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))
    latin1 unknown
    5 2
    N <- 1000
    words <- rep(words, length.out=N)

    print(N) [1] 1000
    for (i in 1:N) {
    + x <- words[1:i]
    + # the following line will crash for some i, depending on the particular
    + # strings in <words> and the subset selected for <x> above
    + order(x)
    + }

    *** caught bus error ***
    address 0x86, cause 'non-existent physical address'

    Traceback:
    1: order(x)
    aborting ...
    Bus error
    64-bit version:
    print(sessionInfo())
    R version 2.9.2 Patched (2009-09-24 r49861)
    x86_64-apple-darwin9.8.0

    locale:
    en_GB/en_GB/C/C/en_GB/en_GB

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base
    words <- c("aa", "ab", "a\xfc", "a\xe4", "b\xe4", "b\xfc", "\xe4\xfc")
    str(words)
    chr [1:7] "aa" "ab" "a\xfc" "a\xe4" "b\xe4" "b\xfc" ...
    print(table(Encoding(words)))
    unknown UTF-8
    2 5
    Encoding(words) <- "latin1" # this is the correct encoding!
    print(table(Encoding(words)))
    latin1 unknown
    5 2
    N <- 1000
    words <- rep(words, length.out=N)

    print(N) [1] 1000
    for (i in 1:N) {
    + x <- words[1:i]
    + # the following line will crash for some i, depending on the particular
    + # strings in <words> and the subset selected for <x> above
    + order(x)
    + }
    Error in order(x) : 'translateCharUTF8' must be called on a CHARSXP
    Execution halted
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel
    --
    Brian D. Ripley, ripley at stats.ox.ac.uk
    Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
    University of Oxford, Tel: +44 1865 272861 (self)
    1 South Parks Road, +44 1865 272866 (PA)
    Oxford OX1 3TG, UK Fax: +44 1865 272595

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupr-devel @
categoriesr
postedSep 30, '09 at 9:11a
activeOct 5, '09 at 2:16p
posts3
users3
websiter-project.org
irc#r

People

Translate

site design / logo © 2023 Grokbase