FAQ

Robert Oschler wrote:
Is there a module/function to remove all the HTML entities
from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)?
Grab cleanhtml.py from the bottom of
http://www.aminus.org/rbre/python/index.html -- you should be able to
quickly rewrite the Plaintext class and just limit it to replacing (or
removing) entities--at least the regex is already written for you.

HTH!


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org

Search Discussions

  • Robert Brewer at Jul 25, 2004 at 8:21 pm

    Robert Oschler wrote:
    Is there a module/function to remove all the HTML entities
    from an HTML document (e.g. - &nbsp, &amp, &apos, etc.)?
    Grab cleanhtml.py from the bottom of
    http://www.aminus.org/rbre/python/index.html -- you should be able to
    quickly rewrite the Plaintext class and just limit it to replacing (or
    removing) entities--at least the regex is already written for you.

    HTH!


    Robert Brewer
    MIS
    Amor Ministries
    fumanchu at amor.org
  • Christopher T King at Jul 25, 2004 at 9:30 pm

    On Sun, 25 Jul 2004, Robert Oschler wrote:

    Is there a module/function to remove all the HTML entities from an HTML
    document (e.g. - &nbsp, &amp, &apos, etc.)?
    htmllib has this capability, but if you're not doing any other HTML
    parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
    does nicely:

    import re
    import htmlentitydefs

    def convertentity(m):
    if m.group(1)=='#':
    try:
    return chr(int(m.group(2)))
    except ValueError:
    return '&#%s;' % m.group(2)
    try:
    return htmlentitydefs.entitydefs[m.group(2)]
    except KeyError:
    return '&%s;' % m.group(2)

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'

    Unknown or invalid entities are left in &xxx; format, while also leaving
    Unicode entities in &#nnn; format. If you want a Unicode string to be
    returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
    and 'htmlentitydefs.entitydefs[m.group(2)]' with
    'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

    Hope this helps.
  • Robert Oschler at Jul 26, 2004 at 10:08 pm
    "Christopher T King" <squirrel at WPI.EDU> wrote in message
    news:Pine.LNX.4.44.0407251706010.20890-100000 at ccc6.wpi.edu...
    htmllib has this capability, but if you're not doing any other HTML
    parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
    does nicely:

    import re
    import htmlentitydefs

    def convertentity(m):
    if m.group(1)=='#':
    try:
    return chr(int(m.group(2)))
    except ValueError:
    return '&#%s;' % m.group(2)
    try:
    return htmlentitydefs.entitydefs[m.group(2)]
    except KeyError:
    return '&%s;' % m.group(2)

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'

    Unknown or invalid entities are left in &xxx; format, while also leaving
    Unicode entities in &#nnn; format. If you want a Unicode string to be
    returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
    and 'htmlentitydefs.entitydefs[m.group(2)]' with
    'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

    Hope this helps.
    Chris,

    I believe the line that reads:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    Should read:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

    Once I made that change it worked like a charm. I'm showing the correction
    for future Usenet searchers.

    So you can pass a function to re.sub() as the replacement patttern? Very
    cool, I didn't know that. I think you could spend a year just learning
    regular expressions and still miss something.


    Thanks,
    Robert.
  • Christopher T King at Jul 27, 2004 at 1:23 pm

    On Mon, 26 Jul 2004, Robert Oschler wrote:

    I believe the line that reads:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    Should read:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)
    Oops, you're right, mea culpa :)
    So you can pass a function to re.sub() as the replacement patttern? Very
    cool, I didn't know that. I think you could spend a year just learning
    regular expressions and still miss something.
    That feature is only mentioned briefly in the online docs, and not at all
    in sre.sub's docstring. Surprising, since it's indeed a very useful
    feature.
  • Robert Oschler at Jul 27, 2004 at 5:40 pm
    "Christopher T King" <squirrel at WPI.EDU> wrote in message
    news:Pine.LNX.4.44.0407270916570.3374-100000 at ccc1.wpi.edu...
    That feature is only mentioned briefly in the online docs, and not at all
    in sre.sub's docstring. Surprising, since it's indeed a very useful
    feature.
    Chris,

    Speaking of learning cool things by osmosis, do you know of a well commented
    source of Python code, perhaps an Open Source project, that I could study to
    learn more interesting techniques like the regexp tip you shared? I find
    that studying other people's code is the best way to avoid getting in a
    programming rut.

    Thanks.

    --
    Robert
  • Christopher T King at Jul 29, 2004 at 4:13 pm

    On Tue, 27 Jul 2004, Robert Oschler wrote:

    Speaking of learning cool things by osmosis, do you know of a well commented
    source of Python code, perhaps an Open Source project, that I could study to
    learn more interesting techniques like the regexp tip you shared? I find
    that studying other people's code is the best way to avoid getting in a
    programming rut.
    I seem to recall reading about that re.sub trick in something linked from
    Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
    are often links there to interesting and useful code snippets from
    ActiveState's Python Cookbook and other sources; I'd say start there if
    you want to find neat tricks you can do with Python.

    I'm not sure of any particularly "well commented" Python projects though
    (I've never really looked into that), but you'll probably find some
    interesting small projects in the Vaults of Parnassus
    (http://www.vex.net/parnassus/).
  • Robert Oschler at Jul 30, 2004 at 10:33 pm
    "Christopher T King" <squirrel at WPI.EDU> wrote in message
    news:Pine.LNX.4.44.0407281552260.29764-100000 at ccc3.wpi.edu...
    I seem to recall reading about that re.sub trick in something linked from
    Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
    are often links there to interesting and useful code snippets from
    ActiveState's Python Cookbook and other sources; I'd say start there if
    you want to find neat tricks you can do with Python.

    I'm not sure of any particularly "well commented" Python projects though
    (I've never really looked into that), but you'll probably find some
    interesting small projects in the Vaults of Parnassus
    (http://www.vex.net/parnassus/).
    Thanks Chris and thanks for all your other help.

    With your Python skill you should work for Google. Too bad you don't, you'd
    be a wealthy man soon (Google IPO). Wish I did. :)

    --
    Robert
  • Christopher T King at Jul 31, 2004 at 2:03 am

    On Fri, 30 Jul 2004, Robert Oschler wrote:

    With your Python skill you should work for Google. Too bad you don't,
    you'd be a wealthy man soon (Google IPO). Wish I did. :)
    Thanks for the compliment. :) To work at Google is my dream job, and I'm
    sure that of many others on this list, too (makes me wonder if any Google
    employees read this list...).
  • Michael Scarlett at Jul 26, 2004 at 3:47 am
    "Robert Oschler" <no_replies at fake_email_address.invalid> wrote in message news:<X9UMc.12838$QO.3354 at bignews5.bellsouth.net>...
    Is there a module/function to remove all the HTML entities from an HTML
    document (e.g. - &nbsp, &amp, &apos, etc.)?

    If not I'll just write one myself but I figured I'd save myself some time.

    Thanks,

    check out mark pilgrims site: http://diveintopython.org/html_processing/index.html

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 25, '04 at 7:50p
activeJul 31, '04 at 2:03a
posts10
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase