FAQ
Hello Folks:

I want to globally change the following: <a href="http://
www.mysite.org/?page=contacts"><font color="#269BD5">

into: <a href="pages/contacts.htm"><font color="#269BD5">

You'll notice that the match would be http://www.mysite.org/?page= but
I also need to add a ".htm" to the end of "contacts" so it becomes
"contacts.htm" This part of the URL is variable, so how can I use a
combination of Python and/or a regular expression to replace the match
the above and also add a ".htm" to the end of that variable part?

Here are a few dummy URLs for example so you can see the pattern and
the variable too.

<a href="http://www.mysite.org/?page=newsletter"><font
color="#269BD5">

change to: <a href="pages/newsletter.htm"><font color="#269BD5">

<a href="http://www.mysite.org/?page=faq">

change to: <a href="pages/faq.htm">

So, again the script needs to replace all the full absolute URL links
with nothing and replace the PHP "?page=" with just the variable page
name (i.e. contacts) plus the ".htm"

Is there a combination of Python code and/or regex that can do this?
Any help would be greatly appreciated!

Kevin

Search Discussions

  • Peter Otten at Apr 30, 2010 at 8:27 pm

    KevinUT wrote:

    Hello Folks:

    I want to globally change the following: <a href="http://
    www.mysite.org/?page=contacts"><font color="#269BD5">

    into: <a href="pages/contacts.htm"><font color="#269BD5">

    You'll notice that the match would be http://www.mysite.org/?page= but
    I also need to add a ".htm" to the end of "contacts" so it becomes
    "contacts.htm" This part of the URL is variable, so how can I use a
    combination of Python and/or a regular expression to replace the match
    the above and also add a ".htm" to the end of that variable part?

    Here are a few dummy URLs for example so you can see the pattern and
    the variable too.

    <a href="http://www.mysite.org/?page=newsletter"><font
    color="#269BD5">

    change to: <a href="pages/newsletter.htm"><font color="#269BD5">

    <a href="http://www.mysite.org/?page=faq">

    change to: <a href="pages/faq.htm">

    So, again the script needs to replace all the full absolute URL links
    with nothing and replace the PHP "?page=" with just the variable page
    name (i.e. contacts) plus the ".htm"

    Is there a combination of Python code and/or regex that can do this?
    Any help would be greatly appreciated!
    Don't know if the following will in practice be more reliable than a simple
    regex, but here goes:

    import sys
    import urlparse
    from BeautifulSoup import BeautifulSoup as BS

    if __name__ == "__main__":
    html = open(sys.argv[1]).read()
    bs = BS(html)
    for a in bs("a"):
    href = a["href"]
    url = urlparse.urlparse(href)
    if url.netloc == "www.mysite.org":
    qs = urlparse.parse_qs(url.query)
    a["href"] = "pages/" + qs[u"page"][0] + ".htm"
    print
    print bs

    Peter
  • Tim Chase at Apr 30, 2010 at 8:55 pm

    On 04/30/2010 02:54 PM, KevinUT wrote:
    I want to globally change the following:<a href="http://
    www.mysite.org/?page=contacts"><font color="#269BD5">

    into:<a href="pages/contacts.htm"><font color="#269BD5">
    Normally I'd just do this with sed on a *nix-like OS:

    find . -iname '*.html' -exec sed -i.BAK
    's at href="http://www.mysite.org/?page=\([^"]*\)@href="pages/\1.htm at g'
    {} \;

    This finds all the HTML files (*.html) under the current
    directory ('.') calling sed on each one. Sed then does the
    substitution you describe, changing

    href="http://www.mysite.org/?page=<whatever>

    into

    href="pages/<whatever>.htm

    moving the original file to a .BAK file (you can omit the
    "-i.BAK" parameter if you don't want this backup behavior;
    alternatively assuming you don't have any pre-existing .BAK
    files, after you've vetted the results, you can then use

    find . -name '*.BAK' -exec rm {} \;

    to delete them all) and then overwrites the original with the
    modified results.

    Yes, one could hack up something in Python, perhaps adding some
    real HTML-parsing brains to it, but for the most part, that
    one-liner should do what you need. Unless you're stuck on Win32
    with no Cygwin-like toolkit

    -tkc
  • Novocastrian_Nomad at May 1, 2010 at 10:57 pm
    One single line regex solution would be:

    re.sub(r'http\://www.mysite.org/\?page=([^"]+)',r'pages/\1.htm',html)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 30, '10 at 7:54p
activeMay 1, '10 at 10:57p
posts4
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase