FAQ
I am reading "Python for Dummies" and found the following example of a
web crawler that I thought was interesting. The first time I keyed
the program and executed it I didn't understand it well enough to
debug it so I just skipped it. A few days later I realized that it
failed after a few seconds and I wanted to know if it was a
shortcoming of Python, a mistype on my part or just an inherent
problem with the script so I retyped it and started trying to figure
out what went wrong.

Please keep in mind I am very new to coding so I have tried RTFM
without much success. I have a basic understanding of what the
application is doing but I want to understand WHY it is doing it or
what the rationale is for doing it. Not necessarily how it does it..
In any case here is the gist of the app.

1 - a new spider is created
2 - it takes a single argument which is a web address (http://
www.google.com)
3 - the spider pulls a copy of the page source
4 - the spider parses it for links and if the link is on the same
domain and has not already been parsed then it appends the link to the
list of pages to be parsed

Being new I have a couple of questions that I am hoping someone can
answer with some degree of detail.

----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
----------------------------------------------------------

I get the idea that we're allocating some memory that looks like a
file so formatter.dumbwriter can manipulate it. The results are
passed to formatter.abstractformatter which does something else to the
HTML code. The results are then passed to "f" which is then passed to
htmllib.HTMLParser so it can parse the html for links. I guess I
don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

formatter.DumbWriter = "This class is suitable for reflowing a
sequence of paragraphs."
formatter.AbstractFormatter = "The standard formatter. This
implementation has demonstrated wide applicability to many writers,
and may be used directly in most circumstances. It has been used to
implement a full-featured World Wide Web browser." <-- huh?

So.. What is dumbwriter and abstractformatter doing with this HTML and
why does it need to be done before parser.feed() gets a hold of it?

The last question is.. I can't find any documentation to explain
where the "anchorlist" attribute came from? Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

----------------------
anchor_bgn( href, name, type)
This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the HREF attribute for <A> tags) within the document. The
list of hyperlinks is available as the data attribute anchorlist.
----------------------

So .. How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist? Is this
something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method? It just seems a bit obscure and certainly not something I
would have figured out on my own. Does this make me a poor developer
who should find another hobby? I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

The last question I have is about debugging. The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes.

I'm smart enough to know that whenever you take in some input that you
should do some basic type checking to make sure that whatever you are
trying to manipulate (especially if it originates from outside of your
application) won't cause your application to crash. If you're
expecting an ASCII character then make sure you're not getting an
object or string of text.

How would an experienced python developer check the contents of "html"
to make sure its not something else other than a blob of HTML code? I
should note an obviously catch-22.. How do I check the HTML in such
a way that the check itself doesn't possibly crash the app? I thought
about:

try:
parser.feed(html)
except parser.HTMLParseError:
parser.close()


.... but i'm not sure if that is right or not? The app still crashes
so obviously i'm doing something wrong.


Here is the full app for your review.

Thank you for any help you can provide! I greatly appreciate it!


#!/usr/bin/python

#these modules do most of the work
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO

def log_stdout(msg):
"""Print msg to the screen."""
print msg

def get_page(url, log):
"""Retrieve URL and return comments, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body

def find_links(html):
"""return a list of links in HTML"""
#We're using the parser just to get the hrefs
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist

class Spider:
"""
The heart of this program, finds all links within a web site.

run() contains the main loop.
process_page() retrieves each page and finds the links.
"""

def __init__(self, startURL, log=None):
#this method sets initial values
self.URLs = set() #create a set
self.URLs.add(startURL) #add the start url to the set
self.include = startURL
self._links_to_process = [startURL]
if log is None:
#use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log

def run(self):
#process list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)

def url_in_site(self, link):
#checks weather the link starts with the base URL
return link.startswith(self.include)

def process_page(self, url):
#retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#handle relative links
link = urlparse.urljoin(url,link)
self.log("Checking: " + link)
#make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)

if __name__ == '__main__':
#this code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL

Search Discussions

  • John J. Lee at Aug 20, 2007 at 7:38 pm
    "dogatemycomputer at gmail.com" <dogatemycomputer at gmail.com> writes:
    [...]
    ----------------------------------------------------------
    f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
    parser = htmllib.HTMLParser(f)
    parser.feed(html)
    parser.close()
    return parser.anchorlist
    ----------------------------------------------------------

    I get the idea that we're allocating some memory that looks like a
    file so formatter.dumbwriter can manipulate it.
    Don't worry too much about memory. The "StringIO()" probably only
    really allocates the memory needed for the "bookkeeping" that StringIO
    does for its own internal purposes, not the memory needed to actually
    store the HTML. Later, when you use the object, Python will
    dynamically (== at run time) allocate the necessary memory for the
    HTML, when the .write() method is called on the StringIO instance.
    Python handles the memory allocation for you -- though of course the
    code you write affects how much memory gets used.

    Note:

    - The StringIO is where the *output* HTML goes.

    - The formatter.DumbWriter likely doesn't do anything with the
    StringIO() at the time it's passed (it hasn't even seen your HTML
    yet, so how could it?). Instead, it just squirrels away the
    StringIO() for later use.
    The results are
    passed to formatter.abstractformatter which does something else to the
    HTML code.
    Again, nothing much happens right away on the "f = ..." line. The
    formatter.AbstractFormatter just keeps the formatter so it can use it
    to format HTML later on.

    The results are then passed to "f" which is then passed to
    The results are not "passed" to f. Instead, the results are given a
    name, "f". You can give a single object as many names as you like.

    htmllib.HTMLParser so it can parse the html for links. I guess I
    htmllib.HTMLParser wants the formatter so it can format output
    (e.g. you might want to write out the same page with some of the links
    removed). It doesn't need the formatter to parse the HTML.
    HTMLParser itself is responsible for the parsing -- as the name
    implies.

    don't understand with any great detail as to why this is happening.
    I know someone is going to say that I should RTFM so here is the gist
    of the documentation:

    formatter.DumbWriter = "This class is suitable for reflowing a
    sequence of paragraphs."
    formatter.AbstractFormatter = "The standard formatter. This
    implementation has demonstrated wide applicability to many writers,
    and may be used directly in most circumstances. It has been used to
    implement a full-featured World Wide Web browser." <-- huh?
    The web browser in question was called "Grail". Grail has been
    resting for some time now. By today's standards, "full-featured" is a
    bit of a stretch.

    But I wouldn't worry too much about what they're trying to say there
    yet (it has to do with the way the formatter.AbstractFormatter class
    is structured, not what it actually does "out of the box").

    So.. What is dumbwriter and abstractformatter doing with this HTML and
    why does it need to be done before parser.feed() gets a hold of it?
    The "heavy lifting" only really actually starts happening when you
    call parser.feed(). Before that, you're just setting the stage.

    The last question is.. I can't find any documentation to explain
    where the "anchorlist" attribute came from? Here is the only
    reference to this attribute that I can find anywhere in the Python
    documentation.

    ----------------------
    anchor_bgn( href, name, type)
    This method is called at the start of an anchor region. The
    arguments correspond to the attributes of the <A> tag with the same
    names. The default implementation maintains a list of hyperlinks
    (defined by the HREF attribute for <A> tags) within the document. The
    list of hyperlinks is available as the data attribute anchorlist.
    ----------------------
    That is indeed the (only) documentation for .anchorlist . What more
    were you expecting to see?

    So .. How does an average developer figure out that parser returns a
    list of hyperlinks in an attribute called anchorlist? Is this
    They keep the Library Reference under their pillow :-)

    And strictly it doesn't *return* a list of links. And that's
    certainly not HTMLParser's main function in life. It merely makes
    such a list available as a convenience. In fact, many people instead
    use module sgmllib, which provides no such convenience, but otherwise
    does the same parsing work as module htmllib.

    something that you just "figure out" or is there some book I should be
    reading that documents all of the attributes for a particular
    method? It just seems a bit obscure and certainly not something I
    would have figured out on my own. Does this make me a poor developer
    who should find another hobby? I just need to know if there is
    something wrong with me or if this is a reasonable question to ask.
    But you *did* figure it out. How else is it that you come to be
    explaining it to us?

    Keep in mind that *nobody* knows all of the standard library. I've
    been writing Python code full time for years, and I often bump into
    whole standard library modules whose existence I'd forgotten about, or
    was never really aware of in the first place. The more you know about
    what it can do, the more convenience you'll get out of it, is all.

    The last question I have is about debugging. The spider is capable
    of parsing links until it reaches:

    "html = get_page(http://www.google.com/jobs/fortune)" which returns
    the contents of a pdf document, assigns the pdf contents to html which
    is later passed to parser.feed(html) which crashes. [...]
    How would an experienced python developer check the contents of "html"
    to make sure its not something else other than a blob of HTML code? I
    should note an obviously catch-22.. How do I check the HTML in such
    a way that the check itself doesn't possibly crash the app? I thought
    about:

    try:
    parser.feed(html)
    except parser.HTMLParseError:
    parser.close()


    .... but i'm not sure if that is right or not? The app still crashes
    so obviously i'm doing something wrong.
    That kind of idea is often the best way. In this case, though, you
    probably want to do an up-front check by looking at the HTTP
    Content-Type header (Google for it), something like this:

    response = urllib2.urlopen(url)
    html = response.read()
    if response.info()["Content-Type"] == "text/html":
    parse(html)


    John
  • Gabriel Genellina at Aug 20, 2007 at 8:18 pm

    On 20 ago, 15:44, "dogatemycompu... at gmail.com" wrote:

    ----------------------------------------------------------
    f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
    parser = htmllib.HTMLParser(f)
    parser.feed(html)
    parser.close()
    return parser.anchorlist
    ----------------------------------------------------------
    The htmllib.HTMLParser class is hard to use. I would replace those
    lines with:

    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):
    def __init__(self):
    HTMLParser.__init__(self)
    self.anchorlist = []

    def handle_starttag(self, tag, attrs):
    if tag=="a":
    href = dict(attrs).get("href")
    if href:
    self.anchorlist.append(href)

    parser = MyHTMLParser()
    parser.feed(htmltext)
    print parser.anchorlist

    The anchorlist attribute, defined by myself here, is a list containing
    all href attributes found in the page.
    See <http://docs.python.org/lib/module-HTMLParser.html>
    I get the idea that we're allocating some memory that looks like a
    file so formatter.dumbwriter can manipulate it. The results are
    passed to formatter.abstractformatter which does something else to the
    HTML code. The results are then passed to "f" which is then passed to
    htmllib.HTMLParser so it can parse the html for links. I guess I
    don't understand with any great detail as to why this is happening.
    I know someone is going to say that I should RTFM so here is the gist
    of the documentation:
    Don't even try to understand it - it's a mess. Use the HTMLParser
    module instead.
    The last question is.. I can't find any documentation to explain
    where the "anchorlist" attribute came from? Here is the only
    reference to this attribute that I can find anywhere in the Python
    documentation.
    And that's all you will find.
    So .. How does an average developer figure out that parser returns a
    list of hyperlinks in an attribute called anchorlist? Is this
    Usually, those attributes are hyperlinked and you can find them in the
    documentation index. Not for this one :(
    something that you just "figure out" or is there some book I should be
    reading that documents all of the attributes for a particular
    method? It just seems a bit obscure and certainly not something I
    would have figured out on my own. Does this make me a poor developer
    who should find another hobby? I just need to know if there is
    something wrong with me or if this is a reasonable question to ask.
    It's a very reasonable question. The attribute should be documented
    properly. But the class itself is a bit old; I don't never use it
    anymore.
    The last question I have is about debugging. The spider is capable
    of parsing links until it reaches:

    "html = get_page(http://www.google.com/jobs/fortune)" which returns
    the contents of a pdf document, assigns the pdf contents to html which
    is later passed to parser.feed(html) which crashes.
    You can verify the Content-Type header before processing. Quoting the
    get_page method:
    def get_page(url, log):
    """Retrieve URL and return comments, log errors."""
    try:
    page = urllib2.urlopen(url)
    except urllib2.URLError:
    log("Error retrieving: " + url)
    return ''
    body = page.read()
    page.close()
    return body
    From <http://docs.python.org/lib/module-urllib2.html>, the urlopen
    method returns a file-like object, which has an additional info()
    method holding the response headers. You can get the Content-Type
    using page.info().gettype(), which should be text/html or text/xhtml.
    For any other type, just return '' as you do for any error.

    --
    Gabriel Genellina
  • Dogatemycomputer at Aug 20, 2007 at 8:48 pm
    Those responses were both very helpful. John's additional type
    checking is straight forward and easy to implement. I will also
    rewrite the application a second time using the class Gabriel
    offered. Both of these suggestions will help gain some insight into
    how Python works.

    "Don't even try to understand it - it's a mess. Use the HTMLParser
    module instead."

    I personally think the application itself "feels" more complicated
    than it needs to be but its possible that is just my inexperience. I'm
    going to do some reading about the HTMLParser module. I'm sure I
    could make this spider a bit more functional in the process.

    Thank you again for all of your help!!
  • Stefan Behnel at Aug 21, 2007 at 6:44 am

    dogatemycomputer at gmail.com wrote:
    I personally think the application itself "feels" more complicated
    than it needs to be but its possible that is just my inexperience. I'm
    going to do some reading about the HTMLParser module. I'm sure I
    could make this spider a bit more functional in the process.
    That's because you are using the standard library to parse HTML. While
    HTMLParser can do what you want it to, it's rather hard to use, especially for
    new users.

    If you want to give lxml.html a try, a web spider would be something like this:

    import lxml.html as H

    def crawl(url, page_dict, depth=2, link_type="a"):
    html = H.parse(url).getroot()
    html.make_links_absolute()

    page_dict[url] = (link_type, html)

    for element, attribute_type, href in html.iterlinks():
    if href not in page_dict:
    if element.tag in ("a", "img"): # ignore other link types
    crawl(href, page_dict, depth-1, element.tag)

    page_dict = {}
    crawl("httt://www.google.com", page_dict, 2)

    # and now do something with the pages in page_dict.

    lxml can actually do a lot more for you, just look through the docs to get an
    idea. You can find lxml here:

    http://codespeak.net/lxml

    lxml.html is not yet released, though. Its first release (as part of lxml 2.0)
    is expected around the end of august. You can find some docs here:

    http://codespeak.net/lxml/dev

    and you can (easily) install it from Subversion sources:

    http://codespeak.net/svn/lxml/trunk

    Have fun,
    Stefan
  • John J. Lee at Aug 21, 2007 at 9:36 pm
    Gabriel Genellina <gagsl-py2 at yahoo.com.ar> writes:
    [...]
    Don't even try to understand it - it's a mess. Use the HTMLParser
    module instead.
    [...]

    Module sgmllib (and therefore module htmllib also) is more tolerant of
    bad HTML than module HTMLParser.


    John
  • Gabriel Genellina at Aug 22, 2007 at 3:30 am

    On 21 ago, 18:36, j... at pobox.com (John J. Lee) wrote:
    Gabriel Genellina <gagsl-... at yahoo.com.ar> writes:

    [...]> Don't even try to understand it - it's a mess. Use the HTMLParser
    module instead.
    [...]

    Module sgmllib (and therefore module htmllib also) is more tolerant of
    bad HTML than module HTMLParser.
    I had the impression it was the opposite; anyway, neither of them can
    handle really bad html.
    I just don't *like* htmllib.HTMLParser - but that's only a matter of
    taste.

    --
    Gabriel Genellina
  • Stefan Behnel at Aug 22, 2007 at 6:47 am

    Gabriel Genellina wrote:
    On 21 ago, 18:36, j... at pobox.com (John J. Lee) wrote:
    Gabriel Genellina <gagsl-... at yahoo.com.ar> writes:

    [...]> Don't even try to understand it - it's a mess. Use the HTMLParser
    module instead.
    [...]

    Module sgmllib (and therefore module htmllib also) is more tolerant of
    bad HTML than module HTMLParser.
    I had the impression it was the opposite; anyway, neither of them can
    handle really bad html.
    I just don't *like* htmllib.HTMLParser - but that's only a matter of
    taste.
    lxml.html handles bad HTML and it's a powerful tool that is very easy to use.
    And if one day you have to deal with really, *really* broken tag soup, it also
    comes with BeautifulSoup parser integration.

    Stefan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 20, '07 at 6:44p
activeAug 22, '07 at 6:47a
posts8
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase