FAQ
I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply. I've finished a system
to remove simple tags, but I want all CSS and JS to be removed. What
re pattern could I use to do that?

I've tried
'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

Search Discussions

  • Ina at Dec 15, 2006 at 4:45 pm

    i80and wrote:
    I'm working on a program to remove tags from a HTML document, leaving
    just the content, but I want to do it simply. I've finished a system
    to remove simple tags, but I want all CSS and JS to be removed. What
    re pattern could I use to do that?

    I've tried
    '<script[\S\s]*/script>'
    but that didn't work properly. I'm fairly basic in my knowledge of
    Python, so I'm still trying to learn re.
    What pattern would work?
    I use re.compile("<script.*?</script>",re.DOTALL)
    for scripts. I strip this out first since my tag stripping re will
    strip out script tags as well hope this was of help.
  • Tim Chase at Dec 15, 2006 at 4:52 pm

    I've tried
    '<script[\S\s]*/script>'
    but that didn't work properly. I'm fairly basic in my knowledge of
    Python, so I'm still trying to learn re.
    What pattern would work?
    I use re.compile("<script.*?</script>",re.DOTALL)
    for scripts. I strip this out first since my tag stripping re will
    strip out script tags as well hope this was of help.
    This won't catch various alterations of

    <
    script
    >
    doEvil()
    <
    /
    script
    >

    which is valid html/xhtml.

    For less valid html, but still attemptable, one might find
    something like

    <scrip<script>hah</script>t>doEvil()</script>

    which, if you nuke your pattern, leaves the valid but unwanted

    <script>doEvil()</script>

    I'd propose that it's better to use something such as
    BeautifulSoup that actually parses the HTML, and then skim
    through it whitelisting the tags you plan to allow, and skipping
    the emission of any tags that don't make the whitelist.

    -tkc
  • I80and at Dec 15, 2006 at 4:56 pm
    I'm working on a program to remove tags from a HTML document, leaving
    just the content, but I want to do it simply. I've finished a system
    to remove simple tags, but I want all CSS and JS to be removed. What
    re pattern could I use to do that?

    I've tried
    '<script[\S\s]*/script>'
    but that didn't work properly. I'm fairly basic in my knowledge of
    Python, so I'm still trying to learn re.
    What pattern would work?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 15, '06 at 2:47p
activeDec 15, '06 at 4:56p
posts4
users3
websitepython.org

3 users in discussion

I80and: 2 posts Tim Chase: 1 post Ina: 1 post

People

Translate

site design / logo © 2022 Grokbase