FAQ
The tough parts of HTML5 are that it's constantly changing and (as you've
mentioned) the other specs like javascript and MathML that are tied into
it. If it became a core part of Go's stdlib, that would be nice, but I
don't know if it could satisfy contemporary and future users. This is
another tough question that ties into #5 and others — depending on how far
you go, you could be implementing most of a browser, aside from the
rendering. So, I think sticking to the HTML side of things would be
simplest and most stable.

One thing I can say with confidence relates to #10: I think xpath, or
something like it, is the most convenient way of dealing with this style of
markup.

On Friday, February 1, 2013 10:03:22 PM UTC-5, Nigel Tao wrote:

(Design discussions would normally be sent to
golan...@googlegroups.com <javascript:> (BCC'ed), but I am trawling wide
for
feedback).

The exp/html package in tip provides a spec-compliant HTML5 parser. As
Go 1.1 is approaching, this package will likely either be promoted to
html, or move to the go.net sub-repository. If the former, this will
require freezing its API against incompatible changes, as per
http://golang.org/doc/go1compat.html It is unlikely that exp/html
will gain additional features before Go 1.1 is released, but ideally
the API that we freeze will still allow adding compatible features in
the future.

If you have had any problems with the API, any feature requests, or
comments in general, then now is the time to speak up. Below is a list
of known concerns.

0. Should Node be a struct or an interface?

1. There aren't enough hooks to support <script> tags, including ones
that call document.write. On the other hand, we do not want to mandate
a particular JS implementation.

2. It is not proven that the Node type can support the DOM API.

3. Even without scripting, it is not proven that the Node type can
support rendering: it is not obvious how to attach style and layout
information. On the other hand, we do not want to mandate a particular
style and layout implementation.

4. The parser assumes that the input is UTF-8. It is possible that
this is perfectly reasonable and the io.Reader given to it can be
responsible for auto-detecting the encoding and converting to UTF-8,
but it has not yet been proven. For example, there may be subtle
interaction with document.write.

5. The parser doesn't return the parse tree until it is complete. A
renderer may want to render a partially downloaded page if the network
is slow. It may also want to start the fetch of an <img>'s or
<script>'s src before parsing is complete. Do we want to support
incremental rendering, or does the complexity outweigh the benefit?
Should the API be that the caller pushes bytes to a parser multiple
times, instead of or alternatively to giving a parser an io.Reader
once?

6. The Node struct type has a Namespace string field for SVG or MathML
elements. These are rare, and could also be folded into the existing
Data string field. Eliminating the Namespace field might save a little
bit of memory.

7. The exp/html/atom list of atoms (and their hashes) needs to be
finalized. Relatedly, should an element Node provide API to look up an
attribute by atom (e.g. atom.Href, atom.Id).

8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
may constrain future refactoring and optimization of the tokenizer.

9. A Parser reaches into a Tokenizer to set a tokenizer's internal
state based on parser state. For example, how "<![CDATA[foo]]>" is
tokenized depends on whether or not we are in "foreign content" such
as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
<title> inside regular HTML, but not for a <title> inside SVG inside
HTML. Ideally, a Tokenizer should not need to expose its state and
tokenization of an io.Reader is the same regardless of whether a
parser is driving that tokenizer, but that may turn out to be
impossible given the complexities of the HTML5 spec.

10. Should there be additional API to ease walking the Node tree? If
so, what should it look like?

11. A radical option is to remove the existing support for parsing
foreign content: SVG and MathML. It would mean losing 100% compliance
with the HTML5 specification, but it would also significantly simplify
the implementation (e.g. see issues 6 and 9 above, and things like
element tags are case-insensitive for HTML in general, but
case-sensitive for SVG inside HTML). Ideally, we would retain the
option to re-introduce SVG and MathML support in a future version, if
the benefits were re-assessed to outweigh the costs.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Search Discussions

  • Andy Balholm at Feb 2, 2013 at 3:58 am
    At this point, the package is well-suited to "static" uses: web scraping,
    building trees and rendering them as HTML, transforming content, etc. The
    question we face is, Is that all we want it to be used for?

    If we are content to have a package for static HTML manipulation, we will
    probably be OK if we freeze the API for Go 1.1. But if we want the package
    to keep growing into something you could build a browser around, we will
    need to move it to go.net so that it can continue to develop without being
    tied to a limited API.

    For me personally, an API freeze would be beneficial, because I use it for
    static HTML manipulation, and I would just as soon avoid having to update
    my code all of the time. But for the project as a whole, it might be better
    to keep moving, so that we don't end up needing to have two HTML libraries.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jimmy frasche at Feb 2, 2013 at 7:42 am
    Admittedly I haven't had a chance to use exp/html yet but I've been meaning to.

    I don't know if it's worth it to make the library so capable that you
    could use it in a web browser's codebase. That's a noble goal and all,
    but the vast majority of people are going to be using it for scraping
    and checking and manipulating. A simple, bullet proof parser is going
    to serve the vast majority of users just as well as a precision
    engineered jet engine. In theory, I'd love a pure Go version of
    something like phantomjs that I can import, but I'd take the html API
    as is now and I don't know how useful something in between those two
    is.

    My only real concern with the API as it is today may be misreading of
    a comment on Render. The last paragraph: (I rewrote the html tags to
    [tag] to avoid confusing anybody's e-mail client)
    Programmatically constructed trees are typically also 'well-formed',
    but it is possible to construct a tree that looks innocuous but, when
    rendered and re-parsed, results in a different tree. A simple example
    is that a solitary text node would become a tree containing [html],
    [head] and [body] elements. Another example is that the programmatic
    equivalent of "a[head]b[/head]c" becomes "[html][head][head/][body]abc[/body][/html]"
    Does this imply (I am very sorry I haven't tested this) that is I
    Render a tree from ParseFragment that I get a complete HTML document
    spat out? If so the API is seriously lacking a RenderFragment
    function. It's unfortuantely not uncommon in CMS-y application to
    store a document fragment produced by some awful WYSIWYG editor in the
    DB. Without the ability to render a fragment of html it can't be used
    for any automatic stripping of non-whitelisted tags or any of a myriad
    of like tasks. That would be really unfortunate and limit it's
    usefulness to me significantly. I hope I'm just misreading the docs.

    As for the points, I can't speak to all of them, since as I have said
    I haven't actually used the API yet, but

    0 - struct seems fine to me. Maybe having typed nodes instead of a
    types tags would be cleaner in some ways but it seem like a lot of
    bother for not much gain.

    5 - I'm sure there are plenty of people who would benefit from a
    streaming API aside from browser implementers. If that means go.net
    instead of the stdlib so be it, unless it's possible to add a new API
    to the package later without breaking compatibility. The only use I
    would have for such a thing is making semantically correct "teasers"
    and even then I'd probably just parse the whole document fragment and
    cache the results and move on to more interesting things.

    7 - Do you mean something like the DOM getElementByID? It could be
    useful, but stuff like that is really inadequate compared to a full
    query language. See my comment to point 10.

    8 - Reading that method's documentation makes me nervous. I don't see
    the use but I do see potential danger. That looks like something that
    shouldn't be exported. If there is a good use for it that others have
    found I'll stick to just personally avoiding it, however.

    10 - The two main contenders would be xpath and a sizzle-esque CSS
    selector parser + an equivalent of querySelectorAll. Both would be
    great (*cough* querySelectorAll *cough*). It would be easier to
    implement them in the html package but it shouldn't be too difficult
    to have them be 3rd party libs and if they get good enough they could
    always be merged in in a future release.

    11 - I'd rather have a parser that I know won't break when given some
    obscure document fragment. If it's valid I want to be able to process
    it safely. I do not envy you having to work with the w3c's "everything
    has to touch everything else" specs. My hat's off to you. Thanks for
    all the great work so far.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andy Balholm at Feb 2, 2013 at 4:31 pm

    On Friday, February 1, 2013 11:41:38 PM UTC-8, soapboxcicero wrote:
    My only real concern with the API as it is today may be misreading of
    a comment on Render. The last paragraph: (I rewrote the html tags to
    [tag] to avoid confusing anybody's e-mail client)
    Programmatically constructed trees are typically also 'well-formed',
    but it is possible to construct a tree that looks innocuous but, when
    rendered and re-parsed, results in a different tree. A simple example
    is that a solitary text node would become a tree containing [html],
    [head] and [body] elements. Another example is that the programmatic
    equivalent of "a[head]b[/head]c" becomes
    "[html][head][head/][body]abc[/body][/html]"

    Does this imply (I am very sorry I haven't tested this) that is I
    Render a tree from ParseFragment that I get a complete HTML document
    spat out? If so the API is seriously lacking a RenderFragment
    function. It's unfortuantely not uncommon in CMS-y application to
    store a document fragment produced by some awful WYSIWYG editor in the
    DB. Without the ability to render a fragment of html it can't be used
    for any automatic stripping of non-whitelisted tags or any of a myriad
    of like tasks. That would be really unfortunate and limit it's
    usefulness to me significantly. I hope I'm just misreading the docs.
    As you hoped, you are misreading the docs. If you render a single
    html.Node,
    you get just that node. The following program prints "<img/>":

    package main

    import (
    "exp/html"
    "exp/html/atom"
    "os"
    )

    func main() {
    n := &html.Node{
    Data: "img",
    DataAtom: atom.Img,
    Type: html.ElementNode,
    }
    html.Render(os.Stdout, n)
    }

    10 - The two main contenders would be xpath and a sizzle-esque CSS
    selector parser + an equivalent of querySelectorAll. Both would be
    great (*cough* querySelectorAll *cough*). It would be easier to
    implement them in the html package but it shouldn't be too difficult
    to have them be 3rd party libs and if they get good enough they could
    always be merged in in a future release.
    Do you mean something like code.google.com/p/cascadia?

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Yunge at Feb 2, 2013 at 6:12 am
    Just like Andy, I'm using exp/html just for web scraping for now.

    But Go would be a foundation for browser/OS, or for the "world"(maybe it is
    too early to say).

    On Saturday, February 2, 2013 11:03:22 AM UTC+8, Nigel Tao wrote:

    (Design discussions would normally be sent to
    golan...@googlegroups.com <javascript:> (BCC'ed), but I am trawling wide
    for
    feedback).

    The exp/html package in tip provides a spec-compliant HTML5 parser. As
    Go 1.1 is approaching, this package will likely either be promoted to
    html, or move to the go.net sub-repository. If the former, this will
    require freezing its API against incompatible changes, as per
    http://golang.org/doc/go1compat.html It is unlikely that exp/html
    will gain additional features before Go 1.1 is released, but ideally
    the API that we freeze will still allow adding compatible features in
    the future.

    If you have had any problems with the API, any feature requests, or
    comments in general, then now is the time to speak up. Below is a list
    of known concerns.

    0. Should Node be a struct or an interface?

    1. There aren't enough hooks to support <script> tags, including ones
    that call document.write. On the other hand, we do not want to mandate
    a particular JS implementation.

    2. It is not proven that the Node type can support the DOM API.

    3. Even without scripting, it is not proven that the Node type can
    support rendering: it is not obvious how to attach style and layout
    information. On the other hand, we do not want to mandate a particular
    style and layout implementation.

    4. The parser assumes that the input is UTF-8. It is possible that
    this is perfectly reasonable and the io.Reader given to it can be
    responsible for auto-detecting the encoding and converting to UTF-8,
    but it has not yet been proven. For example, there may be subtle
    interaction with document.write.

    5. The parser doesn't return the parse tree until it is complete. A
    renderer may want to render a partially downloaded page if the network
    is slow. It may also want to start the fetch of an <img>'s or
    <script>'s src before parsing is complete. Do we want to support
    incremental rendering, or does the complexity outweigh the benefit?
    Should the API be that the caller pushes bytes to a parser multiple
    times, instead of or alternatively to giving a parser an io.Reader
    once?

    6. The Node struct type has a Namespace string field for SVG or MathML
    elements. These are rare, and could also be folded into the existing
    Data string field. Eliminating the Namespace field might save a little
    bit of memory.

    7. The exp/html/atom list of atoms (and their hashes) needs to be
    finalized. Relatedly, should an element Node provide API to look up an
    attribute by atom (e.g. atom.Href, atom.Id).

    8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
    may constrain future refactoring and optimization of the tokenizer.

    9. A Parser reaches into a Tokenizer to set a tokenizer's internal
    state based on parser state. For example, how "<![CDATA[foo]]>" is
    tokenized depends on whether or not we are in "foreign content" such
    as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
    <title> inside regular HTML, but not for a <title> inside SVG inside
    HTML. Ideally, a Tokenizer should not need to expose its state and
    tokenization of an io.Reader is the same regardless of whether a
    parser is driving that tokenizer, but that may turn out to be
    impossible given the complexities of the HTML5 spec.

    10. Should there be additional API to ease walking the Node tree? If
    so, what should it look like?

    11. A radical option is to remove the existing support for parsing
    foreign content: SVG and MathML. It would mean losing 100% compliance
    with the HTML5 specification, but it would also significantly simplify
    the implementation (e.g. see issues 6 and 9 above, and things like
    element tags are case-insensitive for HTML in general, but
    case-sensitive for SVG inside HTML). Ideally, we would retain the
    option to re-introduce SVG and MathML support in a future version, if
    the benefits were re-assessed to outweigh the costs.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Yunge at Feb 2, 2013 at 6:49 am
    I suggest Go team talk to Chromium/Chrome OS team, just an advice.

    On Saturday, February 2, 2013 11:03:22 AM UTC+8, Nigel Tao wrote:

    (Design discussions would normally be sent to
    golan...@googlegroups.com <javascript:> (BCC'ed), but I am trawling wide
    for
    feedback).

    The exp/html package in tip provides a spec-compliant HTML5 parser. As
    Go 1.1 is approaching, this package will likely either be promoted to
    html, or move to the go.net sub-repository. If the former, this will
    require freezing its API against incompatible changes, as per
    http://golang.org/doc/go1compat.html It is unlikely that exp/html
    will gain additional features before Go 1.1 is released, but ideally
    the API that we freeze will still allow adding compatible features in
    the future.

    If you have had any problems with the API, any feature requests, or
    comments in general, then now is the time to speak up. Below is a list
    of known concerns.

    0. Should Node be a struct or an interface?

    1. There aren't enough hooks to support <script> tags, including ones
    that call document.write. On the other hand, we do not want to mandate
    a particular JS implementation.

    2. It is not proven that the Node type can support the DOM API.

    3. Even without scripting, it is not proven that the Node type can
    support rendering: it is not obvious how to attach style and layout
    information. On the other hand, we do not want to mandate a particular
    style and layout implementation.

    4. The parser assumes that the input is UTF-8. It is possible that
    this is perfectly reasonable and the io.Reader given to it can be
    responsible for auto-detecting the encoding and converting to UTF-8,
    but it has not yet been proven. For example, there may be subtle
    interaction with document.write.

    5. The parser doesn't return the parse tree until it is complete. A
    renderer may want to render a partially downloaded page if the network
    is slow. It may also want to start the fetch of an <img>'s or
    <script>'s src before parsing is complete. Do we want to support
    incremental rendering, or does the complexity outweigh the benefit?
    Should the API be that the caller pushes bytes to a parser multiple
    times, instead of or alternatively to giving a parser an io.Reader
    once?

    6. The Node struct type has a Namespace string field for SVG or MathML
    elements. These are rare, and could also be folded into the existing
    Data string field. Eliminating the Namespace field might save a little
    bit of memory.

    7. The exp/html/atom list of atoms (and their hashes) needs to be
    finalized. Relatedly, should an element Node provide API to look up an
    attribute by atom (e.g. atom.Href, atom.Id).

    8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
    may constrain future refactoring and optimization of the tokenizer.

    9. A Parser reaches into a Tokenizer to set a tokenizer's internal
    state based on parser state. For example, how "<![CDATA[foo]]>" is
    tokenized depends on whether or not we are in "foreign content" such
    as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
    <title> inside regular HTML, but not for a <title> inside SVG inside
    HTML. Ideally, a Tokenizer should not need to expose its state and
    tokenization of an io.Reader is the same regardless of whether a
    parser is driving that tokenizer, but that may turn out to be
    impossible given the complexities of the HTML5 spec.

    10. Should there be additional API to ease walking the Node tree? If
    so, what should it look like?

    11. A radical option is to remove the existing support for parsing
    foreign content: SVG and MathML. It would mean losing 100% compliance
    with the HTML5 specification, but it would also significantly simplify
    the implementation (e.g. see issues 6 and 9 above, and things like
    element tags are case-insensitive for HTML in general, but
    case-sensitive for SVG inside HTML). Ideally, we would retain the
    option to re-introduce SVG and MathML support in a future version, if
    the benefits were re-assessed to outweigh the costs.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rodrigo Moraes at Feb 2, 2013 at 1:43 pm
    I see that most of your concerns are related to the DOM/Tree aspects.
    I've been using the package for a couple of weeks to extract contents
    from pages and only used the tokenizer mostly, and I found the API
    quite good and the implementation solid. I've parsed some really
    really ugly and messed HTML. I've only use Parse()/Render() to
    preprocess and "normalize" the HTML (that is, remove unclosed tags)
    before parsing.

    So here are two observations.

    - Sorry if this sounds silly. Is it too much performance gain to have
    Tokenizer.Next() returning TokenType instead of simply a Token? I
    believe that's why it is like it is, and if so it may be reasonable. I
    just thought that returning a Token would be simpler and more
    convenient. The only complication in the API is the "Next {Raw}
    [ Token | Text | TagName {TagAttr} ]" part, and even that is easy to
    get.

    - Maybe the Tokenizer should belong to a sub-package? It really seems
    that there are two separate things here: the pull parser (aka
    Tokenizer) and the DOM-ish API. The former is the base for the latter
    (or new implementations of the latter!) so maybe it should belong to
    its own fundamental package.

    That's it for now. Great work, Nigel.

    -- rodrigo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nigel Tao at Feb 3, 2013 at 1:12 am

    On Sun, Feb 3, 2013 at 12:43 AM, Rodrigo Moraes wrote:
    - Sorry if this sounds silly. Is it too much performance gain to have
    Tokenizer.Next() returning TokenType instead of simply a Token?
    The performance impact is significant. Token.Data is a string, so
    having Tokenizer.Next return a Token would require []byte to string
    conversions on every step. This creates a lot more garbage, which is
    unnecessary garbage if e.g. all you want is to scrape the <a> tags and
    ignore everything else.

    $ go test -test.bench='Low|High' exp/html
    PASS
    BenchmarkLowLevelTokenizer 2000 964264 ns/op 81.06 MB/s 5066
    B/op 25 allocs/op
    BenchmarkHighLevelTokenizer 1000 1563499 ns/op 49.99 MB/s
    103414 B/op 3221 allocs/op
    ok exp/html 3.849s

    - Maybe the Tokenizer should belong to a sub-package?
    Maybe, but point 9 in my OP describes some of the layering violation
    that the parser does to influence token state. Also, supporting
    document.write from script will surely affect the tokenizer. I'm not
    sure if we can consider its API as finalized.

    That's it for now. Great work, Nigel.
    Kudos should also go to Andy Balholm, who has written a significant
    chunk of exp/html.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • John Nagle at Feb 3, 2013 at 2:48 am

    On 2/2/2013 5:12 PM, Nigel Tao wrote:
    On Sun, Feb 3, 2013 at 12:43 AM, Rodrigo Moraes
    wrote:
    - Maybe the Tokenizer should belong to a sub-package?
    Maybe, but point 9 in my OP describes some of the layering violation
    that the parser does to influence token state.
    Painfully true. You can't reliably tokenize HTML without
    information from the parsing level.

    Also painfully true: the awful cases in HTML parsing aren't rare.

    The good news is that the HTML5 spec, painful though it
    is, covers all this stuff. There's no longer a need to have separate
    "Netscape/Mozilla" and "Internet Explorer" parsing modes.

    John Nagle

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Patrick Mylund Nielsen at Feb 3, 2013 at 2:50 am
    We'll see.

    On Sun, Feb 3, 2013 at 3:47 AM, John Nagle wrote:
    On 2/2/2013 5:12 PM, Nigel Tao wrote:
    On Sun, Feb 3, 2013 at 12:43 AM, Rodrigo Moraes
    wrote:
    - Maybe the Tokenizer should belong to a sub-package?
    Maybe, but point 9 in my OP describes some of the layering violation
    that the parser does to influence token state.
    Painfully true. You can't reliably tokenize HTML without
    information from the parsing level.

    Also painfully true: the awful cases in HTML parsing aren't rare.

    The good news is that the HTML5 spec, painful though it
    is, covers all this stuff. There's no longer a need to have separate
    "Netscape/Mozilla" and "Internet Explorer" parsing modes.

    John Nagle

    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Rodrigo Moraes at Feb 3, 2013 at 2:23 pm

    On Feb 2, 11:12 pm, Nigel Tao wrote:
    The performance impact is significant. Token.Data is a string, so
    having Tokenizer.Next return a Token would require []byte to string
    conversions on every step. This creates a lot more garbage, which is
    unnecessary garbage if e.g. all you want is to scrape the <a> tags and
    ignore everything else.
    Token could provide access to strings using methods only. Still it
    would allocate struct data unnecessarily, so this is just a thought.
    The Next() API is not a big deal.
    Maybe, but point 9 in my OP describes some of the layering violation
    that the parser does to influence token state.
    Fair enough. I guess making a couple of fields public or adding some
    hooks would help there, but then you compromise the API in the long
    run.

    -- rodrigo

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • John Nagle at Feb 2, 2013 at 5:38 pm

    On 2/1/2013 7:03 PM, Nigel Tao wrote:
    4. The parser assumes that the input is UTF-8. It is possible that
    this is perfectly reasonable and the io.Reader given to it can be
    responsible for auto-detecting the encoding and converting to UTF-8,
    but it has not yet been proven. For example, there may be subtle
    interaction with document.write.
    The HTML5 spec allows multiple character encodings, and there
    is a defined procedure for this:

    http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

    Yes, it's a huge pain, and sometimes the parser has to start
    over from the beginning of the document after encountering a
    "charset" element. Every browser does it. If you want to
    process real-world HTML, you have to do it. (I have a web
    crawler running in Python, and I have to deal with this.)

    A parser which does not do this is not "standards-compliant".

    John Nagle


    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Andy Balholm at Feb 2, 2013 at 9:19 pm

    On Saturday, February 2, 2013 9:37:03 AM UTC-8, John Nagle wrote:

    The HTML5 spec allows multiple character encodings, and there
    is a defined procedure for this:


    http://www.w3.org/html/wg/drafts/html/master/syntax.html#determining-the-character-encoding

    Yes, it's a huge pain, and sometimes the parser has to start
    over from the beginning of the document after encountering a
    "charset" element. Every browser does it. If you want to
    process real-world HTML, you have to do it. (I have a web
    crawler running in Python, and I have to deal with this.)

    A parser which does not do this is not "standards-compliant".
    The html package itself does not implement the spec for character-set
    detection,
    because there is no support in the standard library for non-utf8 charsets.
    It would definitely be more convenient to have the charset detection built
    in,
    but this is how I do it:

    var metaCharsetSelector = cascadia.MustCompile(`meta[charset],
    meta[http-equiv="Content-Type"]`)

    // findCharset returns the character encoding to be used to interpret the
    // page's content.
    func findCharset(declaredContentType string, content []byte) (charset
    string) {
    defer func() {
    if ce := compatibilityEncodings[charset]; ce != "" {
    charset = ce
    }
    }()

    cs := charsetFromContentType(declaredContentType)
    if cs != "" {
    return cs
    }

    if len(content) > 1024 {
    content = content[:1024]
    }

    if len(content) >= 2 {
    if content[0] == 0xfe && content[1] == 0xff {
    return "utf-16be"
    }
    if content[0] == 0xff && content[1] == 0xfe {
    return "utf-16le"
    }
    }

    if len(content) >= 3 && content[0] == 0xef && content[1] == 0xbb &&
    content[2] == 0xbf {
    return "utf-8"
    }

    if strings.Contains(declaredContentType, "html") || declaredContentType ==
    "" {
    // Look for a <meta> tag giving the encoding.
    tree, err := html.Parse(bytes.NewBuffer(content))
    if err == nil {
    for _, n := range metaCharsetSelector.MatchAll(tree) {
    a := make(map[string]string)
    for _, attr := range n.Attr {
    a[attr.Key] = attr.Val
    }
    if charsetAttr := a["charset"]; charsetAttr != "" {
    return strings.ToLower(charsetAttr)
    }
    if strings.EqualFold(a["http-equiv"], "Content-Type") {
    cs = charsetFromContentType(a["content"])
    if cs != "" {
    return cs
    }
    }
    }
    }
    }

    // Try to detect UTF-8.
    // First eliminate any partial rune that may be split by the 1024-byte
    boundary.
    for i := len(content) - 1; i >= 0 && i > len(content)-4; i-- {
    b := content[i]
    if b < 128 {
    break
    }
    if utf8.RuneStart(b) {
    content = content[:i]
    break
    }
    }
    if utf8.Valid(content) {
    return "utf-8"
    }

    return "windows-1252"
    }

    func charsetFromContentType(t string) string {
    t = strings.ToLower(t)
    _, params, _ := mime.ParseMediaType(t)
    return params["charset"]
    }

    // compatibilityEncodings contains character sets that should be
    misinterpreted
    // for compatibility. The encodings that are commented out are not yet
    // implemented by the Mahonia library.
    var compatibilityEncodings = map[string]string{
    // "euc-kr": "windows-949",
    // "euc-jp": "cp51932",
    "gb2312": "gbk",
    "gb_2312-80": "gbk",
    // "iso-2022-jp": "cp50220",
    "iso-8859-1": "windows-1252",
    "iso-8859-9": "windows-1254",
    "iso-8859-11": "windows-874",
    // "ks_c_5601-1987": "windows-949",
    // "shift_jis": "windows-31j",
    "tis-620": "windows-874",
    "us-ascii": "windows-1252",
    }

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Nigel Tao at Feb 3, 2013 at 1:27 am

    On Sun, Feb 3, 2013 at 9:11 AM, John Nagle wrote:
    Until you try parsing large amounts of real-world HTML, it's
    hard to appreciate just how awful some of what's out there is.
    I am not saying that we can ignore non UTF-8 encodings. I think we all
    agree that a large amout of real-world HTML is like that.

    What I am saying is that it may be feasible for this to be done by an
    io.Reader implementation instead of by html.Parser or html.Tokenizer
    per se, and still be spec compliant. For example, bufio.Reader wraps
    another io.Reader so that not every package needs to do its own
    buffering. Andy Balholm's code snippet is a step towards proof by
    example that a similar approach to encoding conversion is feasible. If
    you know that the input HTML is UTF-8, then you don't need to pay any
    cost, otherwise you can wrap the input in an autodetect.Reader (or
    whatever the hypothetical package would be). As I said in the OP, how
    this plays with document.write remains an open question.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Kees Varekamp at Feb 3, 2013 at 12:08 am
    +1 vote for "Just using it to read static html - loving it the way it is."

    Except perhaps also +1 vote for some sort of builtin xpath-like query
    mechanism.

    Kees

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Dave Cheney at Feb 3, 2013 at 12:15 am
    If it hasn't already been suggested, I think some time in the go.net
    subrepo would help gain confidence that the API is correct and
    complete.
    On Sat, Feb 2, 2013 at 2:50 PM, Steve McCoy wrote:
    The tough parts of HTML5 are that it's constantly changing and (as you've
    mentioned) the other specs like javascript and MathML that are tied into it.
    If it became a core part of Go's stdlib, that would be nice, but I don't
    know if it could satisfy contemporary and future users. This is another
    tough question that ties into #5 and others — depending on how far you go,
    you could be implementing most of a browser, aside from the rendering. So, I
    think sticking to the HTML side of things would be simplest and most stable.

    One thing I can say with confidence relates to #10: I think xpath, or
    something like it, is the most convenient way of dealing with this style of
    markup.

    On Friday, February 1, 2013 10:03:22 PM UTC-5, Nigel Tao wrote:

    (Design discussions would normally be sent to
    golan...@googlegroups.com (BCC'ed), but I am trawling wide for
    feedback).

    The exp/html package in tip provides a spec-compliant HTML5 parser. As
    Go 1.1 is approaching, this package will likely either be promoted to
    html, or move to the go.net sub-repository. If the former, this will
    require freezing its API against incompatible changes, as per
    http://golang.org/doc/go1compat.html It is unlikely that exp/html
    will gain additional features before Go 1.1 is released, but ideally
    the API that we freeze will still allow adding compatible features in
    the future.

    If you have had any problems with the API, any feature requests, or
    comments in general, then now is the time to speak up. Below is a list
    of known concerns.

    0. Should Node be a struct or an interface?

    1. There aren't enough hooks to support <script> tags, including ones
    that call document.write. On the other hand, we do not want to mandate
    a particular JS implementation.

    2. It is not proven that the Node type can support the DOM API.

    3. Even without scripting, it is not proven that the Node type can
    support rendering: it is not obvious how to attach style and layout
    information. On the other hand, we do not want to mandate a particular
    style and layout implementation.

    4. The parser assumes that the input is UTF-8. It is possible that
    this is perfectly reasonable and the io.Reader given to it can be
    responsible for auto-detecting the encoding and converting to UTF-8,
    but it has not yet been proven. For example, there may be subtle
    interaction with document.write.

    5. The parser doesn't return the parse tree until it is complete. A
    renderer may want to render a partially downloaded page if the network
    is slow. It may also want to start the fetch of an <img>'s or
    <script>'s src before parsing is complete. Do we want to support
    incremental rendering, or does the complexity outweigh the benefit?
    Should the API be that the caller pushes bytes to a parser multiple
    times, instead of or alternatively to giving a parser an io.Reader
    once?

    6. The Node struct type has a Namespace string field for SVG or MathML
    elements. These are rare, and could also be folded into the existing
    Data string field. Eliminating the Namespace field might save a little
    bit of memory.

    7. The exp/html/atom list of atoms (and their hashes) needs to be
    finalized. Relatedly, should an element Node provide API to look up an
    attribute by atom (e.g. atom.Href, atom.Id).

    8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
    may constrain future refactoring and optimization of the tokenizer.

    9. A Parser reaches into a Tokenizer to set a tokenizer's internal
    state based on parser state. For example, how "<![CDATA[foo]]>" is
    tokenized depends on whether or not we are in "foreign content" such
    as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
    <title> inside regular HTML, but not for a <title> inside SVG inside
    HTML. Ideally, a Tokenizer should not need to expose its state and
    tokenization of an io.Reader is the same regardless of whether a
    parser is driving that tokenizer, but that may turn out to be
    impossible given the complexities of the HTML5 spec.

    10. Should there be additional API to ease walking the Node tree? If
    so, what should it look like?

    11. A radical option is to remove the existing support for parsing
    foreign content: SVG and MathML. It would mean losing 100% compliance
    with the HTML5 specification, but it would also significantly simplify
    the implementation (e.g. see issues 6 and 9 above, and things like
    element tags are case-insensitive for HTML in general, but
    case-sensitive for SVG inside HTML). Ideally, we would retain the
    option to re-introduce SVG and MathML support in a future version, if
    the benefits were re-assessed to outweigh the costs.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.

    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Patrick Mylund Nielsen at Feb 3, 2013 at 12:19 am
    Yeah, I agree with this. I use exp/html in several (useful) production
    applications, and I love how clean it is, but can't say for sure that
    everything is perfect. HTML5 itself is becoming so complicated that the
    browsers don't even agree on how to implement. I also don't feel like there
    is a lot of pressure to get it into the standard library asap.

    On Sun, Feb 3, 2013 at 1:15 AM, Dave Cheney wrote:

    If it hasn't already been suggested, I think some time in the go.net
    subrepo would help gain confidence that the API is correct and
    complete.
    On Sat, Feb 2, 2013 at 2:50 PM, Steve McCoy wrote:
    The tough parts of HTML5 are that it's constantly changing and (as you've
    mentioned) the other specs like javascript and MathML that are tied into it.
    If it became a core part of Go's stdlib, that would be nice, but I don't
    know if it could satisfy contemporary and future users. This is another
    tough question that ties into #5 and others — depending on how far you go,
    you could be implementing most of a browser, aside from the rendering. So, I
    think sticking to the HTML side of things would be simplest and most stable.
    One thing I can say with confidence relates to #10: I think xpath, or
    something like it, is the most convenient way of dealing with this style of
    markup.

    On Friday, February 1, 2013 10:03:22 PM UTC-5, Nigel Tao wrote:

    (Design discussions would normally be sent to
    golan...@googlegroups.com (BCC'ed), but I am trawling wide for
    feedback).

    The exp/html package in tip provides a spec-compliant HTML5 parser. As
    Go 1.1 is approaching, this package will likely either be promoted to
    html, or move to the go.net sub-repository. If the former, this will
    require freezing its API against incompatible changes, as per
    http://golang.org/doc/go1compat.html It is unlikely that exp/html
    will gain additional features before Go 1.1 is released, but ideally
    the API that we freeze will still allow adding compatible features in
    the future.

    If you have had any problems with the API, any feature requests, or
    comments in general, then now is the time to speak up. Below is a list
    of known concerns.

    0. Should Node be a struct or an interface?

    1. There aren't enough hooks to support <script> tags, including ones
    that call document.write. On the other hand, we do not want to mandate
    a particular JS implementation.

    2. It is not proven that the Node type can support the DOM API.

    3. Even without scripting, it is not proven that the Node type can
    support rendering: it is not obvious how to attach style and layout
    information. On the other hand, we do not want to mandate a particular
    style and layout implementation.

    4. The parser assumes that the input is UTF-8. It is possible that
    this is perfectly reasonable and the io.Reader given to it can be
    responsible for auto-detecting the encoding and converting to UTF-8,
    but it has not yet been proven. For example, there may be subtle
    interaction with document.write.

    5. The parser doesn't return the parse tree until it is complete. A
    renderer may want to render a partially downloaded page if the network
    is slow. It may also want to start the fetch of an <img>'s or
    <script>'s src before parsing is complete. Do we want to support
    incremental rendering, or does the complexity outweigh the benefit?
    Should the API be that the caller pushes bytes to a parser multiple
    times, instead of or alternatively to giving a parser an io.Reader
    once?

    6. The Node struct type has a Namespace string field for SVG or MathML
    elements. These are rare, and could also be folded into the existing
    Data string field. Eliminating the Namespace field might save a little
    bit of memory.

    7. The exp/html/atom list of atoms (and their hashes) needs to be
    finalized. Relatedly, should an element Node provide API to look up an
    attribute by atom (e.g. atom.Href, atom.Id).

    8. Is Tokenizer.Raw worth keeping? Does anyone use it? Its presence
    may constrain future refactoring and optimization of the tokenizer.

    9. A Parser reaches into a Tokenizer to set a tokenizer's internal
    state based on parser state. For example, how "<![CDATA[foo]]>" is
    tokenized depends on whether or not we are in "foreign content" such
    as SVG or MathML. Similarly, 'raw text' tokens are allowed for a
    <title> inside regular HTML, but not for a <title> inside SVG inside
    HTML. Ideally, a Tokenizer should not need to expose its state and
    tokenization of an io.Reader is the same regardless of whether a
    parser is driving that tokenizer, but that may turn out to be
    impossible given the complexities of the HTML5 spec.

    10. Should there be additional API to ease walking the Node tree? If
    so, what should it look like?

    11. A radical option is to remove the existing support for parsing
    foreign content: SVG and MathML. It would mean losing 100% compliance
    with the HTML5 specification, but it would also significantly simplify
    the implementation (e.g. see issues 6 and 9 above, and things like
    element tags are case-insensitive for HTML in general, but
    case-sensitive for SVG inside HTML). Ideally, we would retain the
    option to re-introduce SVG and MathML support in a future version, if
    the benefits were re-assessed to outweigh the costs.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.

    For more options, visit https://groups.google.com/groups/opt_out.
    --
    You received this message because you are subscribed to the Google Groups
    "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

    --
    You received this message because you are subscribed to the Google Groups "golang-nuts" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscribe@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedFeb 2, '13 at 3:50a
activeFeb 3, '13 at 2:23p
posts17
users10
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase