FAQ
I have a working urlopen routine which opens
a url, parses it for <a> tags and prints out
the links in the page. On some sites, wikipedia for
instance, i get a

HTTP error 403, forbidden.

What is the difference in accessing the site through a web browser
and opening/reading the URL with python urllib2.urlopen?

Search Discussions

  • Chris Rebert at Feb 28, 2011 at 6:19 am

    On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw wrote:
    I have a working urlopen routine which opens
    a url, parses it for <a> tags and prints out
    the links in the page. On some sites, wikipedia for
    instance, i get a

    HTTP error 403, forbidden.

    What is the difference in accessing the site through a web browser
    and opening/reading the URL with python urllib2.urlopen?
    The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
    "By default, the URLopener class sends a User-Agent header of
    urllib/VVV, where VVV is the urllib version number."
    ? http://docs.python.org/library/urllib.html

    Some sites block obvious non-search-engine bots based on their HTTP
    User-Agent header value.

    You can override the urllib default:
    http://docs.python.org/library/urllib.html#urllib.URLopener.version

    Sidenote: Wikipedia has a proper API for programmatic browsing, likely
    hence why it's blocking your program.

    Cheers,
    Chris
  • Steven D'Aprano at Feb 28, 2011 at 10:31 am

    On Sun, 27 Feb 2011 22:19:18 -0800, Chris Rebert wrote:
    On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw wrote:
    I have a working urlopen routine which opens a url, parses it for <a>
    tags and prints out the links in the page. On some sites, wikipedia for
    instance, i get a

    HTTP error 403, forbidden.

    What is the difference in accessing the site through a web browser and
    opening/reading the URL with python urllib2.urlopen?
    [...]
    Sidenote: Wikipedia has a proper API for programmatic browsing, likely
    hence why it's blocking your program.
    What he said. Please don't abuse Wikipedia by screen-scraping it.


    --
    Steven
  • Grant Edwards at Feb 28, 2011 at 3:21 pm

    On 2011-02-28, Chris Rebert wrote:
    On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw wrote:
    I have a working urlopen routine which opens
    a url, parses it for <a> tags and prints out
    the links in the page. On some sites, wikipedia for
    instance, i get a

    HTTP error 403, forbidden.

    What is the difference in accessing the site through a web browser
    and opening/reading the URL with python urllib2.urlopen?
    The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
    Sometimes you also need to set the Referrer header for pages that
    don't allow direct-linking from "outside".

    As somebody else has already said, if the site provides an API that
    they want you to use you should do so rather than hammering their web
    server with a screen-scraper.

    Not only is is a lot less load on the site, it's usually a lot easier.

    --
    Grant Edwards grant.b.edwards Yow! Look DEEP into the
    at OPENINGS!! Do you see any
    gmail.com ELVES or EDSELS ... or a
    HIGHBALL?? ...
  • Terry Reedy at Feb 28, 2011 at 5:44 pm

    On 2/28/2011 10:21 AM, Grant Edwards wrote:

    As somebody else has already said, if the site provides an API that
    they want you to use you should do so rather than hammering their web
    server with a screen-scraper.
    If there any generic method for finding out 'if the site provides an
    API" and specifically, how to find Wikipedia's?

    I looked as the Wikipedia articles on API and web services and did not
    find any mention of thiers (though there is one for Amazon).

    --
    Terry Jan Reedy
  • Chris Rebert at Feb 28, 2011 at 8:48 pm

    On Mon, Feb 28, 2011 at 9:44 AM, Terry Reedy wrote:
    On 2/28/2011 10:21 AM, Grant Edwards wrote:
    As somebody else has already said, if the site provides an API that
    they want you to use you should do so rather than hammering their web
    server with a screen-scraper.
    If there any generic method for finding out 'if the site provides an API"
    and specifically, how to find Wikipedia's?

    I looked as the Wikipedia articles on API and web services and did not find
    any mention of thiers (though there is one for Amazon).
    Technically it's Wikipedia's underlying wiki software (MediaWiki)'s API:
    http://www.mediawiki.org/wiki/API

    Cheers,
    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedFeb 28, '11 at 5:38a
activeFeb 28, '11 at 8:48p
posts6
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase