|| at Feb 28, 2011 at 3:21 pm
On 2011-02-28, Chris Rebert wrote:
On Sun, Feb 27, 2011 at 9:38 PM, monkeys paw wrote:
I have a working urlopen routine which opens
a url, parses it for <a> tags and prints out
the links in the page. On some sites, wikipedia for
instance, i get a
HTTP error 403, forbidden.
What is the difference in accessing the site through a web browser
and opening/reading the URL with python urllib2.urlopen?
The User-Agent header (http://en.wikipedia.org/wiki/User_agent ).
Sometimes you also need to set the Referrer header for pages that
don't allow direct-linking from "outside".
As somebody else has already said, if the site provides an API that
they want you to use you should do so rather than hammering their web
server with a screen-scraper.
Not only is is a lot less load on the site, it's usually a lot easier.
Grant Edwards grant.b.edwards Yow! Look DEEP into the
at OPENINGS!! Do you see any
gmail.com ELVES or EDSELS ... or a