Hi all,

I'm moving a database from PG 7.2.4 to 8.2.6. I have already run
iconv on the dump file like so:

iconv -c -f UTF-8 -t UTF-8 -o out.dmp in.dmp

But I'm still getting this error when loading the data into the new
database:

ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
CONTEXT: COPY article, line 2

FWIW this is the second database I've moved this way and for the
first one, iconv fixed all the byte sequence errors. No such luck
this time.

The 7.2.4 database has encoding UNICODE, and the 8.2.6 one is in UTF-8.

To make matters even more fun, the data is in Traditional Chinese
characters, which I don't read, so there seems to be no way for me to
identify the problem bits. I've loaded the dump file into a hex
editor and searched for the value that's reported as the problem but
it's not in the file.

Is there anything I can do to fix this?

Thanks in advance,

janine

Search Discussions

  • Tom Lane at Jan 17, 2008 at 11:39 pm

    Janine Sisk writes:
    But I'm still getting this error when loading the data into the new
    database:
    ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1
    The reason PG doesn't like this sequence is that it corresponds to
    a Unicode "surrogate pair" code point, which is not supposed to
    ever appear in UTF-8 representation --- surrogate pairs are a kluge for
    UTF-16 to deal with Unicode code points of more than 16 bits. See

    http://en.wikipedia.org/wiki/UTF-16

    I think you need a version of iconv that knows how to fold surrogate
    pairs into proper UTF-8 form. It might also be that the data is
    outright broken --- if this sequence isn't followed by another
    surrogate-pair sequence then it isn't valid Unicode by anybody's
    interpretation.

    7.2.x unfortunately didn't check Unicode data carefully, and would
    have let this data pass without comment ...

    regards, tom lane
  • Albe Laurenz at Jan 18, 2008 at 8:00 am

    Tom Lane wrote:
    But I'm still getting this error when loading the data into the new
    database:
    ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1
    The reason PG doesn't like this sequence is that it corresponds to
    a Unicode "surrogate pair" code point, which is not supposed to
    ever appear in UTF-8 representation --- surrogate pairs are a kluge for
    UTF-16 to deal with Unicode code points of more than 16 bits.
    0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
    when interpreted as a high surrogare and followed by a low surrogate,
    would correspond to the UTF-16 encoding of a code point
    between 0x88400 and 0x887FF (depending on the value of the low surrogate).

    These code points do not correspond to any valid character.
    So - unless there is a flaw in my reasoning - there's something
    fishy with these data anyway.

    Janine, could you give us a hex dump of that line from the copy statement?

    Yours,
    Laurenz Albe
  • Janine Sisk at Jan 18, 2008 at 6:09 pm

    On Jan 18, 2008, at 12:00 AM, Albe Laurenz wrote:

    0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
    when interpreted as a high surrogare and followed by a low surrogate,
    would correspond to the UTF-16 encoding of a code point
    between 0x88400 and 0x887FF (depending on the value of the low
    surrogate).

    These code points do not correspond to any valid character.
    So - unless there is a flaw in my reasoning - there's something
    fishy with these data anyway.

    Janine, could you give us a hex dump of that line from the copy
    statement?
    Certainly. Do you want to see it as it came from the old database,
    or after I ran it through iconv? Although iconv wasn't able to solve
    this problem it did fix others in other tables; unfortunately I have
    no way of knowing if it also mangled some data at the same time.

    The version of iconv I have does know about UTF16 so I tried using
    that as the "from" encoding instead of UTF8, but the result had new
    errors in places where the original data was good, so that was
    obviously a step backwards.

    BTW, in case it matters I found out I misidentified the version of PG
    this data came from - it's actually 7.3.6.

    thanks,

    janine
  • Albe Laurenz at Jan 21, 2008 at 8:15 am

    Janine Sisk wrote:
    0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
    when interpreted as a high surrogare and followed by a low surrogate,
    would correspond to the UTF-16 encoding of a code point
    between 0x88400 and 0x887FF (depending on the value of the low surrogate).

    These code points do not correspond to any valid character.
    So - unless there is a flaw in my reasoning - there's something
    fishy with these data anyway.

    Janine, could you give us a hex dump of that line from the copy statement?
    Certainly. Do you want to see it as it came from the old database,
    or after I ran it through iconv? Although iconv wasn't able to solve
    this problem it did fix others in other tables; unfortunately I have
    no way of knowing if it also mangled some data at the same time.
    Both; but the "before" dump is of course more likely to give a clue.

    Yours,
    Laurenz Albe

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedJan 17, '08 at 11:02p
activeJan 21, '08 at 8:15a
posts5
users3
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase