FAQ
Hi All,

I created a native html parser based on libhubbub, a parser library used by
the netsurf browser project. There were quite a few html pages that didn't
parse correctly on tautologistics's html parser so I thought it might be
easier pulling in a parser from an existing web browser. I considered
using webkit & firefox, but those browsers had too many external
dependencies. The parser can operate in blocking or non-blocking mode, and
streamed (chunked) data. The wonderful jsdom library
uses tautologistics/node-htmlparser by default, but one can choose this
parser as the overriding default. The readme shows an example of how this
is done.

Github:
https://github.com/deanmao/node-hubbub

To install:
npm install hubbub

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Search Discussions

  • Matt at Oct 26, 2012 at 3:06 pm
    Nice. BTW in the docs you say: "npm install jsdom" where I think you mean
    node-hubbub.

    Matt.
    On Fri, Oct 26, 2012 at 6:07 AM, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used
    by the netsurf browser project. There were quite a few html pages that
    didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library
    uses tautologistics/node-htmlparser by default, but one can choose this
    parser as the overriding default. The readme shows an example of how this
    is done.

    Github:
    https://github.com/deanmao/node-hubbub

    To install:
    npm install hubbub


    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines:
    https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Dean Mao at Oct 26, 2012 at 9:46 pm
    ah thanks, must have been a brain fart.

    On Fri, Oct 26, 2012 at 8:06 AM, Matt wrote:

    Nice. BTW in the docs you say: "npm install jsdom" where I think you mean
    node-hubbub.

    Matt.
    On Fri, Oct 26, 2012 at 6:07 AM, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used
    by the netsurf browser project. There were quite a few html pages that
    didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library
    uses tautologistics/node-htmlparser by default, but one can choose this
    parser as the overriding default. The readme shows an example of how this
    is done.

    Github:
    https://github.com/deanmao/node-hubbub

    To install:
    npm install hubbub


    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines:
    https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines:
    https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Domenic Denicola at Oct 26, 2012 at 3:57 pm
    Very nice. As maintainer of jsdom, I've been looking for a replacement
    default HTML parser that could solve many of the parsing issues we've
    encountered. I'll put you on the shortlist. Thanks for announcing.
    On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used
    by the netsurf browser project. There were quite a few html pages that
    didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library
    uses tautologistics/node-htmlparser by default, but one can choose this
    parser as the overriding default. The readme shows an example of how this
    is done.

    Github:
    https://github.com/deanmao/node-hubbub

    To install:
    npm install hubbub

    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Jérémy Lal at Oct 26, 2012 at 3:58 pm
    What's wrong with the htmlparser2 module used by cheerio ?
    On 26/10/2012 17:57, Domenic Denicola wrote:
    Very nice. As maintainer of jsdom, I've been looking for a replacement default HTML parser that could solve many of the parsing issues we've encountered. I'll put you on the shortlist. Thanks for announcing.

    On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used by the netsurf browser project. There were quite a few html pages that didn't parse correctly on tautologistics's html parser so I thought it might be easier pulling in a parser from an existing web browser. I considered using webkit & firefox, but those browsers had too many external dependencies. The parser can operate in blocking or non-blocking mode, and streamed (chunked) data. The wonderful jsdom library uses tautologistics/node-htmlparser by default, but one can choose this parser as the overriding default. The readme shows an example of how this is done.

    Github:
    https://github.com/deanmao/node-hubbub <https://github.com/deanmao/node-hubbub>

    To install:
    npm install hubbub


    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Dean Mao at Oct 26, 2012 at 6:52 pm
    There may not be anything wrong with existing parsers. It took me several
    months before I discovered certain problematic html pages
    with tautologistics's parser. I just figured an html parser from an actual
    web browser would likely be more complaint than the parsers written as
    library implementations. libhubbub is used in actual web browsers, so it's
    tested on some pretty ugly html pages already. Since libhubbub is written
    in C, I can make use of libuv's thread pool to schedule workers to do the
    parsing asynchronously, which will probably see performance increases for
    multi-core machines.

    However, I didn't want people to re-learn a new api, so it was a matter of
    converting libhubbub into an api-compatible node-htmlparser library. If
    you use it like tautologistics's parser, it will operate as a blocking call
    since the original api does not support non-blocking semantics. I have yet
    to add documentation on how one can use it in a non-blocking mode.

    On Fri, Oct 26, 2012 at 8:58 AM, Jérémy Lal wrote:

    What's wrong with the htmlparser2 module used by cheerio ?
    On 26/10/2012 17:57, Domenic Denicola wrote:
    Very nice. As maintainer of jsdom, I've been looking for a replacement
    default HTML parser that could solve many of the parsing issues we've
    encountered. I'll put you on the shortlist. Thanks for announcing.
    On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library
    used by the netsurf browser project. There were quite a few html pages
    that didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library uses
    tautologistics/node-htmlparser by default, but one can choose this parser
    as the overriding default. The readme shows an example of how this is done.
    Github:
    https://github.com/deanmao/node-hubbub <
    https://github.com/deanmao/node-hubbub>
    To install:
    npm install hubbub


    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines:
    https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines:
    https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Simon at Oct 29, 2012 at 3:48 am
    Domenic,

    I'd be curious to know what parsers you are considering and if you have
    some tests / html examples that are tripping up the existing parser..

    On Friday, October 26, 2012 10:57:06 PM UTC+7, Domenic Denicola wrote:

    Very nice. As maintainer of jsdom, I've been looking for a replacement
    default HTML parser that could solve many of the parsing issues we've
    encountered. I'll put you on the shortlist. Thanks for announcing.
    On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used
    by the netsurf browser project. There were quite a few html pages that
    didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library
    uses tautologistics/node-htmlparser by default, but one can choose this
    parser as the overriding default. The readme shows an example of how this
    is done.

    Github:
    https://github.com/deanmao/node-hubbub

    To install:
    npm install hubbub

    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en
  • Domenic Denicola at Oct 29, 2012 at 5:52 am
    Simon, as to your last question, here's a start:

    https://github.com/tmpvar/jsdom/issues?labels=parsing&page=1&state=open
    On Sunday, October 28, 2012 11:48:35 PM UTC-4, Simon wrote:

    Domenic,

    I'd be curious to know what parsers you are considering and if you have
    some tests / html examples that are tripping up the existing parser..

    On Friday, October 26, 2012 10:57:06 PM UTC+7, Domenic Denicola wrote:

    Very nice. As maintainer of jsdom, I've been looking for a replacement
    default HTML parser that could solve many of the parsing issues we've
    encountered. I'll put you on the shortlist. Thanks for announcing.
    On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:

    Hi All,

    I created a native html parser based on libhubbub, a parser library used
    by the netsurf browser project. There were quite a few html pages that
    didn't parse correctly on tautologistics's html parser so I thought it
    might be easier pulling in a parser from an existing web browser. I
    considered using webkit & firefox, but those browsers had too many external
    dependencies. The parser can operate in blocking or non-blocking mode, and
    streamed (chunked) data. The wonderful jsdom library
    uses tautologistics/node-htmlparser by default, but one can choose this
    parser as the overriding default. The readme shows an example of how this
    is done.

    Github:
    https://github.com/deanmao/node-hubbub

    To install:
    npm install hubbub

    --
    Job Board: http://jobs.nodejs.org/
    Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
    You received this message because you are subscribed to the Google
    Groups "nodejs" group.
    To post to this group, send email to nodejs@googlegroups.com
    To unsubscribe from this group, send email to
    nodejs+unsubscribe@googlegroups.com
    For more options, visit this group at
    http://groups.google.com/group/nodejs?hl=en?hl=en

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupnodejs @
categoriesnodejs
postedOct 26, '12 at 2:03p
activeOct 29, '12 at 5:52a
posts8
users5
websitenodejs.org
irc#node.js

People

Translate

site design / logo © 2022 Grokbase