|
Dean Mao |
at Oct 26, 2012 at 6:52 pm
|
⇧ |
| |
There may not be anything wrong with existing parsers. It took me several
months before I discovered certain problematic html pages
with tautologistics's parser. I just figured an html parser from an actual
web browser would likely be more complaint than the parsers written as
library implementations. libhubbub is used in actual web browsers, so it's
tested on some pretty ugly html pages already. Since libhubbub is written
in C, I can make use of libuv's thread pool to schedule workers to do the
parsing asynchronously, which will probably see performance increases for
multi-core machines.
However, I didn't want people to re-learn a new api, so it was a matter of
converting libhubbub into an api-compatible node-htmlparser library. If
you use it like tautologistics's parser, it will operate as a blocking call
since the original api does not support non-blocking semantics. I have yet
to add documentation on how one can use it in a non-blocking mode.
On Fri, Oct 26, 2012 at 8:58 AM, Jérémy Lal wrote:What's wrong with the htmlparser2 module used by cheerio ?
On 26/10/2012 17:57, Domenic Denicola wrote:
Very nice. As maintainer of jsdom, I've been looking for a replacement
default HTML parser that could solve many of the parsing issues we've
encountered. I'll put you on the shortlist. Thanks for announcing.
On Friday, October 26, 2012 6:07:48 AM UTC-4, Dean Mao wrote:
Hi All,
I created a native html parser based on libhubbub, a parser library
used by the netsurf browser project. There were quite a few html pages
that didn't parse correctly on tautologistics's html parser so I thought it
might be easier pulling in a parser from an existing web browser. I
considered using webkit & firefox, but those browsers had too many external
dependencies. The parser can operate in blocking or non-blocking mode, and
streamed (chunked) data. The wonderful jsdom library uses
tautologistics/node-htmlparser by default, but one can choose this parser
as the overriding default. The readme shows an example of how this is done.
--
Job Board:
http://jobs.nodejs.org/Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-GuidelinesYou received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to
nodejs@googlegroups.comTo unsubscribe from this group, send email to
nodejs+
unsubscribe@googlegroups.comFor more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en --
Job Board:
http://jobs.nodejs.org/Posting guidelines:
https://github.com/joyent/node/wiki/Mailing-List-Posting-GuidelinesYou received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to
nodejs@googlegroups.comTo unsubscribe from this group, send email to
nodejs+
unsubscribe@googlegroups.comFor more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en