FAQ
I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...

I have built one that works fine using the HTMLParser.HTMLParser and
it works fine.... on good HTML. Having done a google it looks like
parsing dodgy HTML and having HTMLParser choke is a common theme.

I would have difficulties using regular expressions as I want to
modify local reference URLS as well as absolute ones.

It would be nice to just override the error handling of HTMLParser -
but short of digging in the source code it's not a documented
technique :-)

Anyone got any suggestions - this is to go on a server as a CGI - and
I don't have shell access or anything like that, so I'd like to avoid
installing mxTidy. Anyone know an HTML parsing library that will allow
me to rewrite out most of the page unmodified and just modify the
contents of some of the tags.

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html

Search Discussions

  • Robert Brewer at Jul 7, 2004 at 2:47 pm

    Fuzzyman wrote:
    I am trying to parse an HTML page an only modify URLs within tags -
    e.g. inside IMG, A, SCRIPT, FRAME tags etc...

    I have built one that works fine using the HTMLParser.HTMLParser and
    it works fine.... on good HTML. Having done a google it looks like
    parsing dodgy HTML and having HTMLParser choke is a common theme.
    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/


    FuManChu
  • Fuzzyman at Jul 7, 2004 at 8:24 pm
    "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.69.1089211879.5135.python-list at python.org>...
    Fuzzyman wrote:
    I am trying to parse an HTML page an only modify URLs within tags -
    e.g. inside IMG, A, SCRIPT, FRAME tags etc...

    I have built one that works fine using the HTMLParser.HTMLParser and
    it works fine.... on good HTML. Having done a google it looks like
    parsing dodgy HTML and having HTMLParser choke is a common theme.
    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/
    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :-)

    Regards,



    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html

    FuManChu
  • John J. Lee at Jul 7, 2004 at 10:48 pm

    michael at foord.net (Fuzzyman) writes:

    "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.69.1089211879.5135.python-list at python.org>...
    Fuzzyman wrote:
    I am trying to parse an HTML page an only modify URLs within tags -
    e.g. inside IMG, A, SCRIPT, FRAME tags etc...

    I have built one that works fine using the HTMLParser.HTMLParser and
    it works fine.... on good HTML. Having done a google it looks like
    parsing dodgy HTML and having HTMLParser choke is a common theme.
    Use sgmllib instead (or htmllib, which adds a few bits and bobs on top
    of sgmllib). sgmllib.SGMLParser (and htmllib.HTMLParser) is more
    robust than HTMLParser.HTMLParser. OTOH, HTMLParser.HTMLParser is
    more suitable for XHTML.

    I remember that sorting out the precise differences between the two
    libraries (htmllib and HTMLParser) was mildly painful and confusing,
    so you might find it useful to look at ClientForm as an example,
    because it can use both htmllib and HTMLParser modules.

    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/
    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :-)
    In general, Murphy has more shots at anything that both parses *and*
    builds a tree, so sticking to just a parser (eg. sgmllib) is
    advantagous in that respect. However, microdom is a tree-building
    library that claims to be relatively tolerant of bad HTML.


    John
  • Richard at Jul 7, 2004 at 11:04 pm

    michael at foord.net (Fuzzyman) writes:
    "Robert Brewer" <fumanchu at amor.org> wrote in message
    news:<mailman.69.1089211879.5135.python-list at python.org>...
    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/
    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :-)
    From the BeautifulSoup page:
    "You can modify a Tag or NavigableText in place. Printing it out as a
    string will print the new markup text."

    And really, it handles *any* HTML, no matter how crappy - I'm using it to
    deal with pages that have random <span> and </span> in them with no
    matching end / start tags. Eugh.

    Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    side:

    soup = BeautifulSoup()
    soup.feed(source_html)
    for tag, attr in (('img', 'src'), ('a', 'href')):
    for tag in soup(tag):
    if tag.get(attr):
    tag[attr] = rewrite_url(tag[attr])
    print soup


    Richard
  • Fuzzyman at Jul 8, 2004 at 7:16 am
    richard <richardjones at optushome.com.au> wrote in message news:<40ec817a$0$25460$afc38c87 at news.optusnet.com.au>...
    michael at foord.net (Fuzzyman) writes:
    "Robert Brewer" <fumanchu at amor.org> wrote in message
    news:<mailman.69.1089211879.5135.python-list at python.org>...
    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/
    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :-)
    From the BeautifulSoup page:

    "You can modify a Tag or NavigableText in place. Printing it out as a
    string will print the new markup text."

    And really, it handles *any* HTML, no matter how crappy - I'm using it to
    deal with pages that have random <span> and </span> in them with no
    matching end / start tags. Eugh.

    Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    side:

    soup = BeautifulSoup()
    soup.feed(source_html)
    for tag, attr in (('img', 'src'), ('a', 'href')):
    for tag in soup(tag):
    if tag.get(attr):
    tag[attr] = rewrite_url(tag[attr])
    print soup


    Richard
    Brilliant Richard.
    I did hack together a version that worked inside the Tag class of
    BeautifulSoup - but your suggestion is much more elegant. I've already
    written rewrite_url - twice now :-) Should work fine........

    Thanks

    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html
  • Fuzzyman at Jul 8, 2004 at 8:52 am
    richard <richardjones at optushome.com.au> wrote in message news:<40ec817a$0$25460$afc38c87 at news.optusnet.com.au>...
    michael at foord.net (Fuzzyman) writes:
    "Robert Brewer" <fumanchu at amor.org> wrote in message
    news:<mailman.69.1089211879.5135.python-list at python.org>...
    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/
    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :-)
    From the BeautifulSoup page:

    "You can modify a Tag or NavigableText in place. Printing it out as a
    string will print the new markup text."

    And really, it handles *any* HTML, no matter how crappy - I'm using it to
    deal with pages that have random <span> and </span> in them with no
    matching end / start tags. Eugh.

    Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    side:

    soup = BeautifulSoup()
    soup.feed(source_html)
    for tag, attr in (('img', 'src'), ('a', 'href')):
    for tag in soup(tag):
    if tag.get(attr):
    tag[attr] = rewrite_url(tag[attr])
    print soup


    Richard
    Haha - just switched to BS and so far it works like a dream...
    building a CGI proxy for escaping restricted/censored internet
    environments...

    Thanks for the help.

    Regards,

    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 7, '04 at 10:35a
activeJul 8, '04 at 8:52a
posts7
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase