seberino at spawar.navy.mil wrote:
I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?
May I suggest html5lib [1]? It's based on the parsing section of the
WHATWG "HTML5" spec [2] which is in turn based on the behavior of major
web browsers so it should parse more or less* any invalid markup you
throw at it. Despite the name "html5lib" it works with any (X)HTML
document. By default, you have the option of producing a minidom tree,
an ElementTree, or a "simpletree" - a lightweight DOM-like
html5lib-specific tree.

If you are happy to pull from SVN I recommend that version; it has a few
bug fixes over the 0.2 release as well as improved features including
better error reporting and detection of encoding from <meta> elements
(the next release is imminent).

[1] http://code.google.com/p/html5lib/
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

* There might be a problem if e.g. the document uses a character
encoding that python does not support, otherwise it should parse anything.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 5 | next ›
Discussion Overview
grouppython-list @
postedMar 2, '07 at 11:32p
activeMar 3, '07 at 1:46a



site design / logo © 2022 Grokbase