In article <cdac0350.0307191527.755df3e1 at posting.google.com>, jeff wrote:

im trying to pull tags off a website using python ive got a few things
running that have the potential to work its just i cant get them to
becuase of certain errors?

basically i dont what to download the images and all the stuff just
the html and then work from there, i think its timing out because its
trying to downlaod the images as well which i dont what to do as this
would decrease the speed of what im trying to achieve, the URL used is
only that for an example
A web page is made up of many separate components. When you
"download a webpage" you generally are fetching the HTML code,
and you will not get any images unless you specifically
download those by their own URLs.

this is my source


#!/usr/bin/env python
import re
import urllib

file = urllib.urlretrieve("http://images.google.com/images?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=rabbit"
, "temp1.tmp")
Two things:

Don't use the name "file" as the name of your variable, as that
is now the standard way to access a file (used instead of open)

Why save the file and then read it back in?

I might do something like...

text = urllib.urlopen('http://www.example.org')
for line in text.readlines():
print line

# searching the file content line by line:
keyword = re.compile(r"</a>")

for line in text:
result = keyword.search (line)
if result:
print result.group(1), ":", line,
There are no parentheses in your regex, so I do not
think you will ever have a group(1)
import re
keyword = re.compile(r"</a>")
x = 'abc </a> def'
z = keyword.search(x)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: no such group

keyword = re.compile(r"(</a>)")

and these are the errors im getting
C:\Python22>python tagyourit.py
Traceback (most recent call last):
File "tagyourit.py", line 5, in ?
file = urllib.urlretrieve("http://images.google.com/image
8&oe=UTF-8&q=rabbit" , "temp1.tmp")
Is this newline (between image and 8 really there? Maybe
there is a problem with the URL...

File "C:\PYTHON22\lib\urllib.py", line 80, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, dat
File "C:\PYTHON22\lib\urllib.py", line 210, in retrieve
fp = self.open(url, data)
File "C:\PYTHON22\lib\urllib.py", line 178, in open
return getattr(self, name)(url)
File "C:\PYTHON22\lib\urllib.py", line 292, in open_http
File "C:\PYTHON22\lib\httplib.py", line 695, in endheaders
File "C:\PYTHON22\lib\httplib.py", line 581, in _send_outpu
File "C:\PYTHON22\lib\httplib.py", line 548, in send
File "C:\PYTHON22\lib\httplib.py", line 532, in connect
raise socket.error, msg

I think maybe you just are not getting any response at
all from your try to fetch. Can you get any other URL ?
Maybe google is watching user-agent strings to try to keep
spiders out of their pages?

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 4 | next ›
Discussion Overview
grouppython-list @
postedJul 19, '03 at 11:27p
activeJul 21, '03 at 10:30a

3 users in discussion

Jeff: 2 posts Lee Harr: 1 post John J. Lee: 1 post



site design / logo © 2022 Grokbase