FAQ
Hello Gophers,


In my little Go program I grab an UTF-16 LE XML File from a Windoze Server.
I found a few hints on the internet how to handle the parsing with Go:

import (
"encoding/xml"
"golang.org/x/net/html/charset")

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)
This was posted on Stackoverflow by user moraes
<http://stackoverflow.com/users/125967/moraes>. But I still don´t
understand the basic procedure of handling UTF-16 files.

1. Do I have to convert the UTF-16 encoded file to UTF-8 prior to further
decode it with the above commands. Or will the "NewDecoder" handle this.

2. How do handle the line in the XML files which denotes the encoding. Do I
have to manually replace it with <?xml version="1.0" encoding="UTF-8"?>




Thanks for your input.


Tobias

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

  • Giulio Iotti at Sep 1, 2015 at 6:13 pm

    On Tuesday, September 1, 2015 at 4:01:48 PM UTC+3, Tobias S. wrote:
    Hello Gophers,


    In my little Go program I grab an UTF-16 LE XML File from a Windoze
    Server. I found a few hints on the internet how to handle the parsing with
    Go:

    import (
    "encoding/xml"
    "golang.org/x/net/html/charset")

    decoder := xml.NewDecoder(reader)
    decoder.CharsetReader = charset.NewReaderLabel
    err = decoder.Decode(&parsed)
    This was posted on Stackoverflow by user moraes
    <http://stackoverflow.com/users/125967/moraes>. But I still don´t
    understand the basic procedure of handling UTF-16 files.

    1. Do I have to convert the UTF-16 encoded file to UTF-8 prior to further
    decode it with the above commands. Or will the "NewDecoder" handle this.
    You have to do it in CharsetReader; In the example you pasted,
    charset.NewReaderLabel handles the decoding/encoding for you. Internally Go
    only handles utf-8.

    2. How do handle the line in the XML files which denotes the encoding. Do
    I have to manually replace it with <?xml version="1.0" encoding="UTF-8"?>
    I think this should be the real encoding; in your case utf-16. The value
    found in the <?xml?> tag is passed to CharsetReader as first argument. The
    second argument is the reader itself.

    CharsetReader must return the Reader of the utf-8 encoded contents, or an
    error. This is exactly (and not incidentally) what NewReaderLabel does :)

    --
    Giulio Iotti

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Tobias S. at Sep 3, 2015 at 2:04 pm
    Thanks for your clarification. My problem now seems to be the BOM. The code
    looks like this:

    b, _ := ioutil.ReadAll(xmlFile)
    text := strings.NewReader(string(b))
    decoder := xml.NewDecoder(text)
    decoder.CharsetReader = charset.NewReaderLabel


    When I print out the text variable I get:

    &{??<?xml version="1.0" encoding="UTF-16"?>


    At the start of the file. The two leading question marks are probably the
    BOM marks. I get the error message:

    XML syntax error on line 1: invalid UTF-8

    from the decoder....







    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Andrey mirtchovski at Sep 3, 2015 at 3:36 pm
    charset encoding should be able to handle BOM (because the unicode
    transforms it uses do). two things to try: see what DetermineEncoding
    says about your text, and then add your test file to the
    sniffTestCases inside the charset package's charset_test.go.

    if the latter sounds like too much, just print the first line of your
    "text" variable fmt'd using %q and let's see exactly those bytes are.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Tobias S. at Sep 4, 2015 at 2:23 pm
    Hi Andrey,

    here is the printout with %q of the first few characters:

    &{"\xff\xfe<\x00?\x00x\x00m\x00l\x00
    \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00\"\x001\x00.\


    I will try your suggestions later on and let you know what´s going on.



    On Thursday, September 3, 2015 at 5:36:32 PM UTC+2, andrey mirtchovski
    wrote:
    charset encoding should be able to handle BOM (because the unicode
    transforms it uses do). two things to try: see what DetermineEncoding
    says about your text, and then add your test file to the
    sniffTestCases inside the charset package's charset_test.go.

    if the latter sounds like too much, just print the first line of your
    "text" variable fmt'd using %q and let's see exactly those bytes are.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Andrey mirtchovski at Sep 4, 2015 at 7:34 pm
    The problem lies with encoding/xml's design: in order to use the
    charset reader the xml library needs to examine the first line of text
    from the xml file (where the encoding is specified). unfortunately
    that first line contains invalid UTF-8 already, and libxml barfs
    before it even figures out what the encoding should be to pass it to
    our charset reader.

    To solve this we can cheat and pass the input through the charset
    reader before it goes to encoding/xml. Unfortunately in that case the
    XML library will find an UTF-16 encoding specified in the xml headers
    and will complain that it has no charset reader to convert that. We
    solve this by supplying a dummy charset reader.

    This is inelegant but I don't see another solution, at least not a
    quick one. There is talk of redesigning xml, so if you file a bug
    there may be a chance to get this fixed somehow in the library.

    I've attached the test program and the test utf-16le encoded file I
    created from your initial input. Sample run below:

    $ hexdump -C bomtest.txt
    00000000 ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 |..<.?.x.m.l. .v.|
    00000010 65 00 72 00 73 00 69 00 6f 00 6e 00 3d 00 22 00 |e.r.s.i.o.n.=.".|
    00000020 31 00 2e 00 30 00 22 00 20 00 65 00 6e 00 63 00 |1...0.". .e.n.c.|
    00000030 6f 00 64 00 69 00 6e 00 67 00 3d 00 22 00 55 00 |o.d.i.n.g.=.".U.|
    00000040 54 00 46 00 2d 00 31 00 36 00 22 00 3f 00 3e 00 |T.F.-.1.6.".?.>.|
    00000050 20 00 20 00 0d 00 0a 00 3c 00 4f 00 75 00 74 00 | . .....<.O.u.t.|
    00000060 65 00 72 00 3e 00 3c 00 49 00 6e 00 6e 00 65 00 |e.r.>.<.I.n.n.e.|
    00000070 72 00 3e 00 74 00 65 00 73 00 74 00 3c 00 2f 00 |r.>.t.e.s.t.<./.|
    00000080 49 00 6e 00 6e 00 65 00 72 00 3e 00 3c 00 2f 00 |I.n.n.e.r.>.<./.|
    00000090 4f 00 75 00 74 00 65 00 72 00 3e 00 0d 00 0a 00 |O.u.t.e.r.>.....|
    000000a0 3c 00 2f 00 78 00 6d 00 6c 00 3e 00 0d 00 0a 00 |<./.x.m.l.>.....|
    000000b0
    $ go run t.go
    test
    $

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Tobias S. at Sep 7, 2015 at 9:59 am
    Thank you very much, for your workaround code, it works !!

    Also many thanks to all who posted in order to solve the problem. You are a
    special community, always willing to help.



    On Friday, September 4, 2015 at 9:34:37 PM UTC+2, andrey mirtchovski wrote:

    The problem lies with encoding/xml's design: in order to use the
    charset reader the xml library needs to examine the first line of text
    from the xml file (where the encoding is specified). unfortunately
    that first line contains invalid UTF-8 already, and libxml barfs
    before it even figures out what the encoding should be to pass it to
    our charset reader.

    To solve this we can cheat and pass the input through the charset
    reader before it goes to encoding/xml. Unfortunately in that case the
    XML library will find an UTF-16 encoding specified in the xml headers
    and will complain that it has no charset reader to convert that. We
    solve this by supplying a dummy charset reader.

    This is inelegant but I don't see another solution, at least not a
    quick one. There is talk of redesigning xml, so if you file a bug
    there may be a chance to get this fixed somehow in the library.

    I've attached the test program and the test utf-16le encoded file I
    created from your initial input. Sample run below:

    $ hexdump -C bomtest.txt
    00000000 ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 |..<.?.x.m.l.
    .v.|
    00000010 65 00 72 00 73 00 69 00 6f 00 6e 00 3d 00 22 00
    e.r.s.i.o.n.=.".|
    00000020 31 00 2e 00 30 00 22 00 20 00 65 00 6e 00 63 00 |1...0.".
    .e.n.c.|
    00000030 6f 00 64 00 69 00 6e 00 67 00 3d 00 22 00 55 00
    o.d.i.n.g.=.".U.|
    00000040 54 00 46 00 2d 00 31 00 36 00 22 00 3f 00 3e 00
    T.F.-.1.6.".?.>.|
    00000050 20 00 20 00 0d 00 0a 00 3c 00 4f 00 75 00 74 00 | .
    .....<.O.u.t.|
    00000060 65 00 72 00 3e 00 3c 00 49 00 6e 00 6e 00 65 00
    e.r.>.<.I.n.n.e.|
    00000070 72 00 3e 00 74 00 65 00 73 00 74 00 3c 00 2f 00
    r.>.t.e.s.t.<./.|
    00000080 49 00 6e 00 6e 00 65 00 72 00 3e 00 3c 00 2f 00
    I.n.n.e.r.>.<./.|
    00000090 4f 00 75 00 74 00 65 00 72 00 3e 00 0d 00 0a 00
    O.u.t.e.r.>.....|
    000000a0 3c 00 2f 00 78 00 6d 00 6c 00 3e 00 0d 00 0a 00
    <./.x.m.l.>.....|
    000000b0
    $ go run t.go
    test
    $
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Konstantin Khomoutov at Sep 3, 2015 at 3:39 pm

    On Thu, 3 Sep 2015 07:04:20 -0700 (PDT) "Tobias S." wrote:


    Thanks for your clarification. My problem now seems to be the BOM.
    The code looks like this:

    b, _ := ioutil.ReadAll(xmlFile)
    text := strings.NewReader(string(b))
    decoder := xml.NewDecoder(text)
    decoder.CharsetReader = charset.NewReaderLabel
    Overengeneered. os.File already implements io.Reader,
    so just do

    decoder := xml.NewDecoder(xmlFile)
    decoder.CharsetReader = charset.NewReaderLabel
    When I print out the text variable I get:

    &{??<?xml version="1.0" encoding="UTF-16"?>

    At the start of the file. The two leading question marks are probably
    the BOM marks. I get the error message:

    XML syntax error on line 1: invalid UTF-8

    from the decoder....
    OK, so I'd then employ buffering and its ability to "peek" at the data,
    literally, and discard it, if needed:
    http://play.golang.org/p/zGrNnYRkPF

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Nigel Tao at Sep 4, 2015 at 12:49 am
    Adding Andy and Marcel for their thoughts re how a UTF-16 charset
    reader from golang.org/x/net/html/charset should handle BOMs.

    On Fri, Sep 4, 2015 at 12:04 AM, Tobias S. wrote:
    Thanks for your clarification. My problem now seems to be the BOM. The code
    looks like this:

    b, _ := ioutil.ReadAll(xmlFile)
    text := strings.NewReader(string(b))
    decoder := xml.NewDecoder(text)
    decoder.CharsetReader = charset.NewReaderLabel


    When I print out the text variable I get:

    &{??<?xml version="1.0" encoding="UTF-16"?>


    At the start of the file. The two leading question marks are probably the
    BOM marks. I get the error message:

    XML syntax error on line 1: invalid UTF-8

    from the decoder....
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.
  • Andy Balholm at Sep 4, 2015 at 3:30 pm
    As I understand the WHATWG spec, the BOMs are to be left in the decoded output, to be ignored by the tokenizer. But I can’t imagine what problem it would cause for the decoder to remove them.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/d/optout.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedSep 1, '15 at 1:01p
activeSep 7, '15 at 9:59a
posts10
users6
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase