Michael Spalinski wrote:
: I would like to write a Python script that would read an HTML document and
: extract table contents from it. Eg. each table could be a list of tuples
: with data from the rows. I thought htmllib would provide the basic tools
: for this, but I can't find any example that would be of use.
: So - does anyone have a Python snippet that looks for tables and gets at
: the data?
It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
class that scans for TABLE, TR and TD (and maybe TH) tags.
from htmllib import HTMLParser
def __init__(self, formatter=None):
self.tablelist = 
self.current_table = None
self.table_stack = None # for nested tables
def start_table(self, attributes):
if self.current_table is not None:
self.table_stack = self.current_table, self.table_stack
self.current_table = 
self.current_table, self.table_stack = self.table_stack
def start_tr(self, attributes):
def start_td(self, attributes):
def handle_data(self, data):
The result is in self.tablelist (as list of the tables, since you can have
more than one table in a document).
I haven't really tested this, so it might need a little more work,
but I think you get the idea. You need to read the module docs for
sgmllib and htmllib (http://www.python.org/doc/current/lib/).