FAQ
Dear All,

I have hopefully a very simple problem. I wish to parse an html page and
extract everything between the <body> tags.

E.g.
<head>
<body>
<b>afsdf</b>
</body>
</head>

Would give
<body>
<b>afsdf</b>
</body>

I've been playing about with htmllib with no successful. Any suggestions?

Thanks

Colin

Search Discussions

  • Batista, Facundo at Jul 8, 2004 at 4:12 pm
    [C Gillespie]

    #- I've been playing about with htmllib with no successful. Any
    #- suggestions?

    Go for DOM.

    This article is good:

    http://www-106.ibm.com/developerworks/linux/library/l-pxml.html

    . Facundo
  • C Gillespie at Jul 8, 2004 at 4:17 pm
    I tried DOM, but it doesn't like my html :(
    "Batista, Facundo" <FBatista at uniFON.com.ar> wrote in message
    news:mailman.127.1089303361.5135.python-list at python.org...
    [C Gillespie]

    #- I've been playing about with htmllib with no successful. Any
    #- suggestions?

    Go for DOM.

    This article is good:

    http://www-106.ibm.com/developerworks/linux/library/l-pxml.html

    . Facundo
  • Batista, Facundo at Jul 8, 2004 at 4:24 pm
    Well, actually that article talks about more tools (xmllib, for example),
    with a lot of references.

    Read it.

    . Facundo

    #- -----Mensaje original-----
    #- De: C Gillespie [mailto:csgcsg39 at hotmail.com]
    #- Enviado el: Jueves, 08 de Julio de 2004 01:18 PM
    #- Para: python-list at python.org
    #- Asunto: Re: Parsing html
    #-
    #-
    #- I tried DOM, but it doesn't like my html :(
    #- "Batista, Facundo" <FBatista at uniFON.com.ar> wrote in message
    #- news:mailman.127.1089303361.5135.python-list at python.org...
    #- > [C Gillespie]
    #- >
    #- > #- I've been playing about with htmllib with no successful. Any
    #- > #- suggestions?
    #- >
    #- > Go for DOM.
    #- >
    #- > This article is good:
    #- >
    #- > http://www-106.ibm.com/developerworks/linux/library/l-pxml.html
    #- >
    #- > . Facundo
    #-
    #-
    #- --
    #- http://mail.python.org/mailman/listinfo/python-list
    #-
  • William Park at Jul 8, 2004 at 7:22 pm

    C Gillespie wrote:
    Dear All,

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.

    E.g.
    <head>
    <body>
    <b>afsdf</b>
    </body>
    </head>

    Would give
    <body>
    <b>afsdf</b>
    </body>

    I've been playing about with htmllib with no successful. Any suggestions?

    Thanks

    Colin
    1. Take a look at
    http://freshmeat.net/projects/bashdiff/
    and if you want give it try then I'll give you some pointers.
    Essentially,
    x=()
    array -p '<body>' -q '</body>' x "..."

    2. In Python, read the whole thing as string. Delete everything before
    '<body>' and everything after '</body>'.

    3. Use your editor. :-)

    --
    William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
    Toronto, Ontario, Canada
  • Leif K-Brooks at Jul 8, 2004 at 7:37 pm

    C Gillespie wrote:
    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.
    People are actually suggesting using DOM for this?! A simple approach is
    much better:

    def get_body(html):
    body_start = html.find('<body')
    body_end = html.find('</body>', body_start) + 7
    return html[body_start:body_end]
  • Richard Brodie at Jul 9, 2004 at 10:32 am

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.
    People are actually suggesting using DOM for this?! A simple approach is
    much better:
    "For every complex problem, there is a solution that is simple ... and wrong"
    Yes, it will work, some of the time. However, it doesn't handle the following
    properly (there are probably others).

    1. Comments.
    2. CDATA sections.
    3. White space.
    4. Mixed or upper case.

    The advantage of using a proper parser is that it caters for these sort of things,
    and you only have to get it right once. OTOH, these advantages are largely
    negated, if you can't be sure your input HTML is valid. What works best for
    you depends on what you are using it for.
  • Lee Harr at Jul 8, 2004 at 8:50 pm

    On 2004-07-08, C Gillespie wrote:
    Dear All,

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.
    I have not used it yet,
    but I hear that Beatiful Soup
    works well:

    http://www.crummy.com/software/BeautifulSoup/
  • Wes weston at Jul 8, 2004 at 11:18 pm

    C Gillespie wrote:
    Dear All,

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.

    E.g.
    <head>
    <body>
    <b>afsdf</b>
    </body>
    </head>

    Would give
    <body>
    <b>afsdf</b>
    </body>

    I've been playing about with htmllib with no successful. Any suggestions?

    Thanks

    Colin
    #--------------------------------------------------------------------------
    def TokenizeHTML( s ):
    #return a list containing two types of tokens:
    # 1. html tokens starting with '<' and ending with '>'
    # 2. strings between '>' and '<'
    state = 0
    htmlStr = ""
    str = ""
    list = []
    for ch in s:
    if state == 0: #initial state; detection state
    if ch == '<':
    state = 1
    htmlStr += ch
    else:
    state = 2
    str += ch
    elif state == 1: #html state; in a <> pair
    htmlStr += ch
    if ch == '>':
    state = 0
    list.append(htmlStr)
    htmlStr = ""
    elif state == 2: #non html state; not in a <> pair
    if ch == '<':
    state = 1
    list.append(str)
    str = ""
    htmlStr = "<"
    else:
    str += ch
    if len(str) > 0:
    list.append(str)
    return list
  • C Gillespie at Jul 9, 2004 at 12:27 pm
    Dear All,

    Thanks for all the suggestions, much appreciated.

    Colin
  • Thomas Guettler at Jul 9, 2004 at 1:02 pm

    Am Thu, 08 Jul 2004 17:04:24 +0100 schrieb C Gillespie:

    Dear All,

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.

    E.g.
    <head>
    <body>
    <b>afsdf</b>
    </body>
    </head>

    Would give
    <body>
    <b>afsdf</b>
    </body>

    I've been playing about with htmllib with no successful. Any suggestions?
    HTML can be broken in many ways. If you want
    a solution which can read most of the HTML on the
    web, you can use tidy and use XML as output.


    XML can be handled much easier with SAX/DOM.

    Regards,
    Thomas
  • Istvan Albert at Jul 9, 2004 at 1:34 pm
    You could use pyparsing too:

    http://pyparsing.sourceforge.net/

    i.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 8, '04 at 4:04p
activeJul 9, '04 at 1:34p
posts12
users9
websitepython.org

People

Translate

site design / logo © 2022 Grokbase