FAQ
I am trying to crawl webpages in citeseer domain (a collection of research
papers mostly in computer science).

I have used the following code snippet.

#####
import urllib

sock = urllib.urlopen("http://citeseer.ist.psu.edu")
webcontent = sock.read().split('\n')
sock.close()
print webcontent
########

Then I get the following error message.


['<!--#set var="TITLE" value="Server error!"', '--><!--#include
virtual="include/top.html" -->', '', ' <!--#if
expr="$REDIRECT_ERROR_NOTES" -->', '', ' The server encountered an
internal error and was ', ' unable to complete your request.', '', '
<!--#include virtual="include/spacer.html" -->', '', ' Error message:', '
<br /><!--#echo encoding="none" var="REDIRECT_ERROR_NOTES" -->', '', '
<!--#else -->', '', ' The server encountered an internal error and was ',
' unable to complete your request. Either the server is', ' overloaded
or there was an error in a CGI script.', '', ' <!--#endif -->', '',
'<!--#include virtual="include/bottom.html" -->', '']

However, the url is valid and it works fine if I open the url in my web
browser.
Or, if I use a different url (http://www.google.com instead of
http://citeseer.ist.psu.edu),
then it works.

What is wrong?
Could it be that the citeseer webserver checks the http request, and it sees
something
that it doesn't like and reject the request?
What should I do?

Thank you.

Best regards,
Yookyung

Search Discussions

  • Charlespina at Dec 30, 2005 at 2:37 am
    I went to the URL you posted, and it looks like that error is the
    content you should be recieving. Try refreshing your browser cache, you
    could be loading a cached page.

    Charles

    yookyung wrote:
    I am trying to crawl webpages in citeseer domain (a collection of research
    papers mostly in computer science).

    I have used the following code snippet.

    #####
    import urllib

    sock = urllib.urlopen("http://citeseer.ist.psu.edu")
    webcontent = sock.read().split('\n')
    sock.close()
    print webcontent
    ########

    Then I get the following error message.


    ['<!--#set var="TITLE" value="Server error!"', '--><!--#include
    virtual="include/top.html" -->', '', ' <!--#if
    expr="$REDIRECT_ERROR_NOTES" -->', '', ' The server encountered an
    internal error and was ', ' unable to complete your request.', '', '
    <!--#include virtual="include/spacer.html" -->', '', ' Error message:', '
    <br /><!--#echo encoding="none" var="REDIRECT_ERROR_NOTES" -->', '', '
    <!--#else -->', '', ' The server encountered an internal error and was ',
    ' unable to complete your request. Either the server is', ' overloaded
    or there was an error in a CGI script.', '', ' <!--#endif -->', '',
    '<!--#include virtual="include/bottom.html" -->', '']

    However, the url is valid and it works fine if I open the url in my web
    browser.
    Or, if I use a different url (http://www.google.com instead of
    http://citeseer.ist.psu.edu),
    then it works.

    What is wrong?
    Could it be that the citeseer webserver checks the http request, and it sees
    something
    that it doesn't like and reject the request?
    What should I do?

    Thank you.

    Best regards,
    Yookyung

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedDec 30, '05 at 2:20a
activeDec 30, '05 at 2:37a
posts2
users2
websitepython.org

2 users in discussion

Charlespina: 1 post Yookyung: 1 post

People

Translate

site design / logo © 2022 Grokbase