FAQ
Hi guys...

I've done a subclass of SGMLParser
to handle the contents of a web page,
but i'm not able to handle the <BR> tag

can someone help me???

S.G.A S.p.A.
Nucleo Sistemi Informativi
Luca Calderano

Search Discussions

  • Behrang Dadsetan at Jul 31, 2003 at 9:54 pm

    Luca Calderano wrote:
    Hi guys...

    I've done a subclass of SGMLParser
    to handle the contents of a web page,
    but i'm not able to handle the <BR> tag

    can someone help me???

    S.G.A S.p.A.
    Nucleo Sistemi Informativi
    Luca Calderano
    I do not know SGMLParser.. but HTML is not SGML nor any subset. It is
    some ill language which one even rarely finds "pure" (written in the way
    the spec says it MUST be)

    I believe SGML does not like none closing tags. BR is one of the many
    none closing tags in HTML (also look at IMG or HR)

    Depending on what you are doing you should maybe use XHTML as an input
    if you can (XML well-formed HTML, XML being a subset of SGML) or you
    should probably look for a completely different parser "technology".
    Maybe HTMLParser will help you a little more.

    Do not forget, random downloaded HTML from Internet is often broken.
    You might rather want to use tidylib (corrects broken HTML code into
    XHTML) and a XHTML/SGML parser or a DOM.

    Hope it helps even though the effort I took to check my statements was
    small :)

    Regards,
    Ben.
  • John Roth at Jul 31, 2003 at 10:06 pm
    "Behrang Dadsetan" <ben at dadsetan.com> wrote in message
    news:bgc3ad$o57$1 at online.de...
    Luca Calderano wrote:
    Hi guys...

    I've done a subclass of SGMLParser
    to handle the contents of a web page,
    but i'm not able to handle the <BR> tag

    can someone help me???

    S.G.A S.p.A.
    Nucleo Sistemi Informativi
    Luca Calderano
    I do not know SGMLParser.. but HTML is not SGML nor any subset. It is
    some ill language which one even rarely finds "pure" (written in the way
    the spec says it MUST be)

    I believe SGML does not like none closing tags. BR is one of the many
    none closing tags in HTML (also look at IMG or HR)

    Depending on what you are doing you should maybe use XHTML as an input
    if you can (XML well-formed HTML, XML being a subset of SGML) or you
    should probably look for a completely different parser "technology".
    Maybe HTMLParser will help you a little more.

    Do not forget, random downloaded HTML from Internet is often broken.
    You might rather want to use tidylib (corrects broken HTML code into
    XHTML) and a XHTML/SGML parser or a DOM.

    Hope it helps even though the effort I took to check my statements was
    small :)
    You're basically correct, though. You can't parse HTML with either
    an SGML or an XML parser. You also can't parse it reliably if it has
    Javascript embedded that generates HTML.

    John Roth
    Regards,
    Ben.
  • Carl Banks at Jul 31, 2003 at 11:17 pm

    Luca Calderano wrote:
    Hi guys...

    I've done a subclass of SGMLParser
    to handle the contents of a web page,
    but i'm not able to handle the <BR> tag

    can someone help me???

    I'm very familiar with sgmllib. Can you describe the problem you're
    having?

    If you are using <br> as a self-closing tag as is done in XHTML (i.e.,
    <br/>), then this cannot be done with sgmllib. You should use
    HTMLParser instead--only you might have to make some minor changes to
    your code.


    --
    CARL BANKS
  • Steven Taschuk at Aug 1, 2003 at 3:01 am
    Quoth Behrang Dadsetan:
    [...]
    I believe SGML does not like none closing tags. BR is one of the many
    none closing tags in HTML (also look at IMG or HR)
    SGML has facilities for allowing closing tags to be omitted. At
    <http://www.w3.org/TR/html4/sgml/dtd.html>, for example, we see

    <!ELEMENT BR - O EMPTY -- forced line break -->

    That "- O" means the start tag is mandatory but the end tag may be
    omitted. If it is omitted, SGML parsers infer where it belongs
    (which in this case is immediately after the start tag).

    --
    Steven Taschuk w_w
    staschuk at telusplanet.net ,-= U
    1 1
  • Luca Calderano at Aug 5, 2003 at 4:43 pm
    I got it using SGMLParser!

    ...
    def unknown_starttag(self, tag, attrs):
    if tag == 'br':
    self.data.append('\n')
    ...

    Thanks all!

    S.G.A S.p.A.
    Nucleo Sistemi Informativi
    Luca Calderano



    -----Messaggio originale-----
    Da: python-list-admin at python.org
    [mailto:python-list-admin at python.org]Per conto di Luca Calderano
    Inviato: gioved? 31 luglio 2003 15.11
    A: Python Mailing List (E-mail)
    Oggetto: handle <BR> tags


    Hi guys...

    I've done a subclass of SGMLParser
    to handle the contents of a web page,
    but i'm not able to handle the <BR> tag

    can someone help me???

    S.G.A S.p.A.
    Nucleo Sistemi Informativi
    Luca Calderano

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 31, '03 at 1:10p
activeAug 5, '03 at 4:43p
posts6
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase