FAQ
Hi all,

Im learning web scraping with python from the following link
http://www.packtpub.com/article/web-scraping-with-python

To work with it, mechanize to be installed
I installed mechanize using

sudo apt-get install python-mechanize

As given in the tutorial, i tried the code as below

import mechanize
BASE_URL = "http://www.packtpub.com/article-network"
br = mechanize.Browser()
data = br.open(BASE_URL).get_data()

Received the following error

File "webscrap.py", line 4, in <module>
data = br.open(BASE_URL).get_data()
File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 209,
in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 261,
in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: request
disallowed by robots.txt


Any Ideas? Welcome
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20091015/1766a55a/attachment.htm>

Search Discussions

  • Chris Rebert at Oct 15, 2009 at 8:00 am

    On Thu, Oct 15, 2009 at 12:39 AM, Raji Seetharaman wrote:
    Hi all,

    Im learning web scraping with python from the following link
    http://www.packtpub.com/article/web-scraping-with-python

    To work with it,? mechanize to be installed
    I installed mechanize using

    sudo apt-get install python-mechanize

    As given in the tutorial, i tried the code as below

    import mechanize
    BASE_URL = "http://www.packtpub.com/article-network"
    br = mechanize.Browser()
    data = br.open(BASE_URL).get_data()

    Received the following error

    File "webscrap.py", line 4, in <module>
    ??? data = br.open(BASE_URL).get_data()
    ? File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 209,
    in open
    ??? return self._mech_open(url, data, timeout=timeout)
    ? File "/usr/lib/python2.6/dist-packages/mechanize/_mechanize.py", line 261,
    in _mech_open
    ??? raise response
    mechanize._response.httperror_seek_wrapper: HTTP Error 403: request
    disallowed by robots.txt
    Apparently that website's tutorial and robots.txt are not in sync.
    robots.txt is part of the Robot Exclusion Standard
    (http://en.wikipedia.org/wiki/Robots_exclusion_standard) and is the
    standard way websites specify which webpages should and should not be
    accessed programmatically. In this case, that site's robots.txt is
    forbidding access to the webpage in question from autonomous programs.

    There's probably a way to tell mechanize to ignore robots.txt though,
    given the standard is not enforced server-side; programs just follow
    it voluntarily.

    Cheers,
    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 15, '09 at 7:39a
activeOct 15, '09 at 8:00a
posts2
users2
websitepython.org

2 users in discussion

Chris Rebert: 1 post Raji Seetharaman: 1 post

People

Translate

site design / logo © 2022 Grokbase