FAQ
I would like to write a Python script that would read an HTML document and
extract table contents from it. Eg. each table could be a list of tuples
with data from the rows. I thought htmllib would provide the basic tools
for this, but I can't find any example that would be of use.

So - does anyone have a Python snippet that looks for tables and gets at
the data?

M.

Search Discussions

  • Phil Hunt at May 23, 1999 at 7:58 pm
    In article <87u2t4apxm.fsf at schwinger.harvard.edu>
    mspal at sangria.harvard.edu "Michael Spalinski" writes:
    I would like to write a Python script that would read an HTML document and
    extract table contents from it. Eg. each table could be a list of tuples
    with data from the rows. I thought htmllib would provide the basic tools
    for this, but I can't find any example that would be of use.

    So - does anyone have a Python snippet that looks for tables and gets at
    the data?
    I'm not aware of anything that does this, but it shouldn't be
    particularly hard to write one. Get it to look for <table> tags,
    and then within these, look for <tr>, <th> and <td>, & use
    the contents of these to build up a List containing a List containing
    table items.

    --
    Phil Hunt....philh at vision25.demon.co.uk
  • Michael P. Reilly at May 24, 1999 at 4:09 pm
    Michael Spalinski wrote:

    : I would like to write a Python script that would read an HTML document and
    : extract table contents from it. Eg. each table could be a list of tuples
    : with data from the rows. I thought htmllib would provide the basic tools
    : for this, but I can't find any example that would be of use.

    : So - does anyone have a Python snippet that looks for tables and gets at
    : the data?

    It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
    class that scans for TABLE, TR and TD (and maybe TH) tags.

    from htmllib import HTMLParser
    class TableExtractor(HTMLParser):
    def __init__(self, formatter=None):
    HTMLParser.__init__(self, formatter):
    self.tablelist = []
    self.current_table = None
    self.table_stack = None # for nested tables
    def start_table(self, attributes):
    if self.current_table is not None:
    self.table_stack = self.current_table, self.table_stack
    self.current_table = []
    def end_table(self):
    self.tablelist.append(self.current_table)
    if self.table_stack:
    self.current_table, self.table_stack = self.table_stack
    def start_tr(self, attributes):
    self.current_table.append([])
    def end_tr(self):
    pass
    def start_td(self, attributes):
    self.current_table[-1].append([])
    def end_td(self):
    pass
    def handle_data(self, data):
    if self.current_table:
    self.current_table[-1][-1].append(data)

    The result is in self.tablelist (as list of the tables, since you can have
    more than one table in a document).

    I haven't really tested this, so it might need a little more work,
    but I think you get the idea. You need to read the module docs for
    sgmllib and htmllib (http://www.python.org/doc/current/lib/).

    -Arcege
  • Tom Bryan at May 24, 1999 at 4:44 pm

    "Michael P. Reilly" wrote:
    Michael Spalinski wrote:

    : I would like to write a Python script that would read an HTML document and
    : extract table contents from it. Eg. each table could be a list of tuples
    : with data from the rows. I thought htmllib would provide the basic tools
    : for this, but I can't find any example that would be of use.

    : So - does anyone have a Python snippet that looks for tables and gets at
    : the data?

    It shouldn't be to hard to make a subclass of the htmllib.HTMLParser
    class that scans for TABLE, TR and TD (and maybe TH) tags.
    Depending on what he wants to do, this may or may not be a good idea.
    I've found HTMLParser to be a rather slow solution for parsing many
    files for a small subset of tags. It does a lot of extra work to
    process all of the tags. (If I remember correctly, it calls a method
    at every tag, even if that method doesn't really do anything.)

    If he just wants to process an HTML file now and then, it probably
    doen't matter. If he's extracting all of the tables from hundreds
    of HTML documents on a site, he probably will notice the speed problem.
    In the second case, I'd probably just write something that looks
    for the first TABLE tag in the file and grabs everything up to the
    first /TABLE tag. The re module would do this nicely. Splitting the
    rows and columns out of the table might be a pain, but I *imagine* that
    it would still be faster than an HTMLParser solution. Of course,
    imaginations are tricky things. :)

    I've thought about subclassing HTMLParser so that it could be used
    to process just a few tags (like table-related tags) quickly. Has
    anyone else done such a thing?

    --
    tbryan at zarlut.utexas.edu
    Remove the z from this address to reply.
    Stop spam! http://spam.abuse.net/spam/
  • Tom Bryan at May 24, 1999 at 5:52 pm
    Tom Bryan wrote:
    >
    In the second case, I'd probably just write something that looks
    for the first TABLE tag in the file and grabs everything up to the
    first /TABLE tag.
    which only works if there are no nested tables, of course

    --
    tbryan at zarlut.utexas.edu
    Remove the z from this address to reply.
    Stop spam! http://spam.abuse.net/spam/
  • Jeffrey Kunce at May 24, 1999 at 5:22 pm

    I would like to write a Python script that would read an HTML
    document and extract table contents from it. ...
    Take a look at htmlTableParse.py at http://starship.python.net/~jjkunce/

    It may work for you, or at least give you some ideas.

    --Jeff
  • Michael Spalinski at May 24, 1999 at 1:42 pm

    "Jeffrey" == Jeffrey Kunce <kuncej at mail.conservation.state.mo.us> writes:
    I would like to write a Python script that would read an HTML
    document and extract table contents from it. ...
    Jeffrey> Take a look at htmlTableParse.py at
    Jeffrey> http://starship.python.net/~jjkunce/

    Jeffrey> It may work for you, or at least give you some ideas.

    This is exactly what I was looking for. Not only is it a working example,
    but it does pretty much what I wanted.

    Many thanks!

    M.
  • M.A.Miller at May 24, 1999 at 5:43 pm

    "Michael" == Michael Spalinski <mspal at sangria.harvard.edu> writes:
    So - does anyone have a Python snippet that looks for
    tables and gets at the data?
    There is a table parser and an usage example at
    http://www.npl.uiuc.edu/~miller/python/#HTMLTools . Tables are
    stored in a list, so multiple tables in a single document can be
    handled - but only if they are sequential. I don't make any
    attenpt to handle nested tables.

    Mike
  • Magnus L. Hetland at May 24, 1999 at 7:50 pm

    Michael Spalinski <mspal at sangria.harvard.edu> writes:

    I would like to write a Python script that would read an HTML document and
    extract table contents from it. Eg. each table could be a list of tuples
    with data from the rows. I thought htmllib would provide the basic tools
    for this, but I can't find any example that would be of use.

    So - does anyone have a Python snippet that looks for tables and gets at
    the data?
    I know there have been several responses -- but as a compulsive
    minimalist, I just couldn't resist trying to make a small solution...

    ------ start table parser ------

    from re import compile, findall, I, S

    flags = I+S
    tpat = compile("<table[^>]*>.*?</table>",flags)
    rpat = compile("<tr[^>]>.*?</tr>",flags)
    dpat = compile("<td[^>]>(.*?)</td>",flags)

    data = open("data.html").read()
    result = []

    for table in findall(tpat,data):
    result.append([])
    for row in findall(rpat,table):
    result[-1].append([])
    for cell in findall(dpat,row):
    result[-1][-1].append(cell)
    result[-1][-1] = tuple(result[-1][-1])

    ------- stop table parser -------
    M.
    --

    Magnus
    Lie
    Hetland http://arcadia.laiv.org <arcadia at laiv.org>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMay 23, '99 at 1:14p
activeMay 24, '99 at 7:50p
posts9
users7
websitepython.org

People

Translate

site design / logo © 2022 Grokbase