FAQ
I'm parsing a 2GB XML file that contains the illegal entity  for some
reason. I don't have access to the source data and the file is generated
weekly so I need to be lax about those entities (replace them with a space
or nothing).

Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

First, should it cause an error?
Second, do you have any tips on how to remove those entities on the fly?

I was thinking of writing an XmlCharReader that implements io.Reader that
filters the entities but might need some tips on how to do that.

Ideas?

Search Discussions

  • Kamil Kisiel at Sep 6, 2012 at 8:52 pm
    I encountered the same problem in a project I was recently working on. I
    solved it by implementing a Reader that filters out invalid UTF-8
    characters from the stream.

    The code for the reader is
    here: https://github.com/kisielk/gorge/blob/master/util/util.go

    On Thursday, September 6, 2012 1:39:29 PM UTC-7, hannson wrote:

    I'm parsing a 2GB XML file that contains the illegal entity  for some
    reason. I don't have access to the source data and the file is generated
    weekly so I need to be lax about those entities (replace them with a space
    or nothing).

    Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

    First, should it cause an error?
    Second, do you have any tips on how to remove those entities on the fly?

    I was thinking of writing an XmlCharReader that implements io.Reader that
    filters the entities but might need some tips on how to do that.

    Ideas?
  • Hannson at Sep 6, 2012 at 9:11 pm
    It's probably not the exact same problem. I still get the same result when
    using your code. The thing is the xml parser decodes the entity into 0x08
    after the input is read. I tried a similar solution myself before I figured
    out it was an xml entity but not an illegal byte .

    See: http://play.golang.org/p/0I7mBae3K7

    Having tried adding [#8 = ""] to the Entity map in xml.Decoder without
    change I see no other possibility than to write a filter that searches and
    replaces those illegal entities.
    On Thursday, September 6, 2012 8:45:00 PM UTC, Kamil Kisiel wrote:

    I encountered the same problem in a project I was recently working on. I
    solved it by implementing a Reader that filters out invalid UTF-8
    characters from the stream.

    The code for the reader is here:
    https://github.com/kisielk/gorge/blob/master/util/util.go

    On Thursday, September 6, 2012 1:39:29 PM UTC-7, hannson wrote:

    I'm parsing a 2GB XML file that contains the illegal entity  for some
    reason. I don't have access to the source data and the file is generated
    weekly so I need to be lax about those entities (replace them with a space
    or nothing).

    Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

    First, should it cause an error?
    Second, do you have any tips on how to remove those entities on the fly?

    I was thinking of writing an XmlCharReader that implements io.Reader that
    filters the entities but might need some tips on how to do that.

    Ideas?
  • Kamil Kisiel at Sep 6, 2012 at 10:03 pm
    Ah sorry, I misunderstood the problem. I see what's happening now, the
    decoder is seeing it as a character entity but it's one that's outside of
    the valid range: http://www.xml.com/axml/testaxml.htm (section 2.2 -
    Characters). You'd have to either modify the decoder to either ignore these
    instead of returning an error or else filter them out somehow before
    decoding.
    On Thursday, September 6, 2012 2:11:05 PM UTC-7, hannson wrote:

    It's probably not the exact same problem. I still get the same result when
    using your code. The thing is the xml parser decodes the entity into 0x08
    after the input is read. I tried a similar solution myself before I figured
    out it was an xml entity but not an illegal byte .

    See: http://play.golang.org/p/0I7mBae3K7

    Having tried adding [#8 = ""] to the Entity map in xml.Decoder without
    change I see no other possibility than to write a filter that searches and
    replaces those illegal entities.
    On Thursday, September 6, 2012 8:45:00 PM UTC, Kamil Kisiel wrote:

    I encountered the same problem in a project I was recently working on. I
    solved it by implementing a Reader that filters out invalid UTF-8
    characters from the stream.

    The code for the reader is here:
    https://github.com/kisielk/gorge/blob/master/util/util.go

    On Thursday, September 6, 2012 1:39:29 PM UTC-7, hannson wrote:

    I'm parsing a 2GB XML file that contains the illegal entity  for
    some reason. I don't have access to the source data and the file is
    generated weekly so I need to be lax about those entities (replace them
    with a space or nothing).

    Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

    First, should it cause an error?
    Second, do you have any tips on how to remove those entities on the fly?

    I was thinking of writing an XmlCharReader that implements io.Reader
    that filters the entities but might need some tips on how to do that.

    Ideas?
  • Hannson at Sep 6, 2012 at 11:12 pm
    Yeah I think I'll rip out the entity code from xml.Decoder and use it in a
    filter. I can't modify the decoder because I might share the code later and
    I'd rather not have to hack every release of Go to work for this particular
    file.

    For now I'll just remove the entity from the file and see what happens.
    On Thursday, September 6, 2012 10:03:18 PM UTC, Kamil Kisiel wrote:

    Ah sorry, I misunderstood the problem. I see what's happening now, the
    decoder is seeing it as a character entity but it's one that's outside of
    the valid range: http://www.xml.com/axml/testaxml.htm (section 2.2 -
    Characters). You'd have to either modify the decoder to either ignore these
    instead of returning an error or else filter them out somehow before
    decoding.
    On Thursday, September 6, 2012 2:11:05 PM UTC-7, hannson wrote:

    It's probably not the exact same problem. I still get the same result
    when using your code. The thing is the xml parser decodes the entity into
    0x08 after the input is read. I tried a similar solution myself before I
    figured out it was an xml entity but not an illegal byte .

    See: http://play.golang.org/p/0I7mBae3K7

    Having tried adding [#8 = ""] to the Entity map in xml.Decoder without
    change I see no other possibility than to write a filter that searches and
    replaces those illegal entities.
    On Thursday, September 6, 2012 8:45:00 PM UTC, Kamil Kisiel wrote:

    I encountered the same problem in a project I was recently working on. I
    solved it by implementing a Reader that filters out invalid UTF-8
    characters from the stream.

    The code for the reader is here:
    https://github.com/kisielk/gorge/blob/master/util/util.go

    On Thursday, September 6, 2012 1:39:29 PM UTC-7, hannson wrote:

    I'm parsing a 2GB XML file that contains the illegal entity  for
    some reason. I don't have access to the source data and the file is
    generated weekly so I need to be lax about those entities (replace them
    with a space or nothing).

    Here's an example of the error: http://play.golang.org/p/PQNusIo_Ix

    First, should it cause an error?
    Second, do you have any tips on how to remove those entities on the
    fly?

    I was thinking of writing an XmlCharReader that implements io.Reader
    that filters the entities but might need some tips on how to do that.

    Ideas?
  • Jan Mercl at Sep 7, 2012 at 9:38 am

    On Sep 6, 2012 10:39 PM, "hannson" wrote:
    Ideas?
    sed?

    -j
  • Mike Samuel at Sep 7, 2012 at 6:17 pm

    On Friday, September 7, 2012 5:38:08 AM UTC-4, Jan Mercl wrote:
    On Sep 6, 2012 10:39 PM, "hannson" <han...@gmail.com <javascript:>> wrote:
    Ideas?
    sed?

    -j
    It's possible using sed, but not trivial to get correct for arbitrary
    markup. Consider that the "&#8;" sequence in

    <![CDATA[ --> &#8;]]> <![CDATA[ ... ]]>

    should not be fixed, but the one in

    <!-- <![CDATA[ -->&#8; <![CDATA[ ... ]]>

    should be fixed.

    To handle it in the general case, you have to write a SAX parser in sed and
    that still won't handle illegal codepoints introduced via external entity
    inclusion.

    That said, well-formed but numerically invalid character references inside
    CDATA sections are probably rare, and CDATA section boundary tokens inside
    comments are probably rarer still.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedSep 6, '12 at 8:52p
activeSep 7, '12 at 6:17p
posts7
users4
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase