FAQ
Hi everyone
I am trying to build my own web crawler for an experiement and I don't
know how to access HTTP protocol with python.

Also, Are there any Opensource Parsing engine for HTML documents
available in Python too? That would be great.

Search Discussions

  • Dan Stromberg at Jun 29, 2008 at 2:38 am

    On Sat, 28 Jun 2008 19:03:39 -0700, disappearedng wrote:

    Hi everyone
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.

    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    Check out BeautifulSoup. I don't recall what license it uses, but the
    source is available, and it deals well with not-necessarily-beautiful-
    inside HTML.
  • Benjamin at Jun 29, 2008 at 2:40 am
    On Jun 28, 9:03?pm, disappeare... at gmail.com wrote:
    Hi everyone
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.
    Look at the httplib module.
    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
  • Victor Noagbodji at Jun 29, 2008 at 3:22 am

    Hi everyone Hello
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.
    urllib2: http://docs.python.org/lib/module-urllib2.html
    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    BeautifulSoup:
    http://www.crummy.com/software/BeautifulSoup/
    http://www.crummy.com/software/BeautifulSoup/documentation.html

    All the best

    --
    NOAGBODJI Paul Victor
  • Stefan Behnel at Jun 29, 2008 at 5:26 am

    disappearedng at gmail.com wrote:
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.

    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
    BeautifulSoup and threadable, all of which should be helpful for your crawler.

    http://codespeak.net/lxml/

    Stefan
  • Sebastian "lunar" Wiesner at Jun 29, 2008 at 9:23 am

    Stefan Behnel <stefan_ml at behnel.de>:

    disappearedng at gmail.com wrote:
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.

    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    Try lxml.html. It parses broken HTML, supports HTTP, is much faster than
    BeautifulSoup and threadable, all of which should be helpful for your
    crawler.
    You should mention its powerful features like XPATH and CSS selection
    support and its easy API here, too ;)

    --
    Freedom is always the freedom of dissenters.
    (Rosa Luxemburg)
  • Larry Bates at Jun 30, 2008 at 12:52 pm

    disappearedng at gmail.com wrote:
    Hi everyone
    I am trying to build my own web crawler for an experiement and I don't
    know how to access HTTP protocol with python.

    Also, Are there any Opensource Parsing engine for HTML documents
    available in Python too? That would be great.
    Check on Mechanize. It wraps Beautiful Soup inside of methods that aid in
    website crawling.

    http://pypi.python.org/pypi/mechanize/0.1.7b

    -Larry

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJun 29, '08 at 2:03a
activeJun 30, '08 at 12:52p
posts7
users7
websitepython.org

People

Translate

site design / logo © 2022 Grokbase