FAQ
Hi all,

Here is my first version of my web crawler gocrawl:
https://github.com/PuerkitoBio/gocrawl

It features:

- Full control over the URLs to visit, inspect and query
- Crawl delays applied per host
- Obedience to robots.txt rules
- Concurrent execution using goroutines
- Configurable logging using the builtin Go logger
- Open, customizable design by providing hooks into the execution logic

It's obviously an early release, but I think it reached a point where it
can do useful work and behave as expected. There are already a few ways to
customize it, using a custom Fetcher, a custom Logger and the LogFlags for
verbosity level, the URLSelector function to filter links of interest, and
the Visitor function that does the actual work on the content of the
visited page.

I have quite a few ideas for v0.2, namely a way to set crawling priorities
and better customization (my goal would be to make it easy to build
features on top of gocrawl, like a redis-persisted cache library, for
example). I'm thinking of adding hooks (callback funcs) such as:

- Starting : called on Run() to allow startup customization, such as
reading seeds from a DB. Would return []string to be used as seeds.
- Visited: called when an url has been visited.
- Enqueued: called when an url has been enqueued for a visit. Together
with Visited and Starting, would allow for persistence/recovery mechanisms
in case of unexpected exit.

Suggestions are welcome!

Thanks,
Martin

--

Search Discussions

  • Martin Angers at Oct 29, 2012 at 2:01 am
    Hi again,

    I pushed a new version to the master branch:
    https://github.com/PuerkitoBio/gocrawl

    It now supports making a HEAD request prior to the GET request (thanks
    dmitrybond), and of course, a way to decide whether or not to go on with
    the GET, via an extender method. Full docs are in the README. I still have
    a few things left on the drawing board, mostly a way to re-enqueue URLs if
    a fetch error occurs or in the case of redirections (at the moment, as
    described in issue #3, it follows and visits redirections, bypassing the
    Filter()).

    Thanks,
    Martin

    Le lundi 8 octobre 2012 16:57:24 UTC-4, Martin Angers a écrit :
    Hi all,

    Here is my first version of my web crawler gocrawl:
    https://github.com/PuerkitoBio/gocrawl

    It features:

    - Full control over the URLs to visit, inspect and query
    - Crawl delays applied per host
    - Obedience to robots.txt rules
    - Concurrent execution using goroutines
    - Configurable logging using the builtin Go logger
    - Open, customizable design by providing hooks into the execution logic

    It's obviously an early release, but I think it reached a point where it
    can do useful work and behave as expected. There are already a few ways to
    customize it, using a custom Fetcher, a custom Logger and the LogFlags for
    verbosity level, the URLSelector function to filter links of interest, and
    the Visitor function that does the actual work on the content of the
    visited page.

    I have quite a few ideas for v0.2, namely a way to set crawling priorities
    and better customization (my goal would be to make it easy to build
    features on top of gocrawl, like a redis-persisted cache library, for
    example). I'm thinking of adding hooks (callback funcs) such as:

    - Starting : called on Run() to allow startup customization, such as
    reading seeds from a DB. Would return []string to be used as seeds.
    - Visited: called when an url has been visited.
    - Enqueued: called when an url has been enqueued for a visit. Together
    with Visited and Starting, would allow for persistence/recovery mechanisms
    in case of unexpected exit.

    Suggestions are welcome!

    Thanks,
    Martin
    --
  • Richard Penman at Oct 30, 2012 at 12:51 am
    Nice!

    Did you find a solution for accessing original HTML in the Visit hook?

    If SameHostOnly is false then when crawling may end up having thousands of
    domains, with a worker per domain. (A more realistic use case than my Alexa
    one.)
    Do you think limiting the number of concurrent domains would be better
    handled by gocrawl or in the hooks?

    On Tuesday, October 9, 2012 7:57:24 AM UTC+11, Martin Angers wrote:

    Hi all,

    Here is my first version of my web crawler gocrawl:
    https://github.com/PuerkitoBio/gocrawl

    It features:

    - Full control over the URLs to visit, inspect and query
    - Crawl delays applied per host
    - Obedience to robots.txt rules
    - Concurrent execution using goroutines
    - Configurable logging using the builtin Go logger
    - Open, customizable design by providing hooks into the execution logic

    It's obviously an early release, but I think it reached a point where it
    can do useful work and behave as expected. There are already a few ways to
    customize it, using a custom Fetcher, a custom Logger and the LogFlags for
    verbosity level, the URLSelector function to filter links of interest, and
    the Visitor function that does the actual work on the content of the
    visited page.

    I have quite a few ideas for v0.2, namely a way to set crawling priorities
    and better customization (my goal would be to make it easy to build
    features on top of gocrawl, like a redis-persisted cache library, for
    example). I'm thinking of adding hooks (callback funcs) such as:

    - Starting : called on Run() to allow startup customization, such as
    reading seeds from a DB. Would return []string to be used as seeds.
    - Visited: called when an url has been visited.
    - Enqueued: called when an url has been enqueued for a visit. Together
    with Visited and Starting, would allow for persistence/recovery mechanisms
    in case of unexpected exit.

    Suggestions are welcome!

    Thanks,
    Martin
    --
  • Martin Angers at Oct 30, 2012 at 1:49 am
    Hi Richard,

    Yes, I re-create the Body from the slice of Bytes so that it can be
    accessed again in the Visit method.

    For the number of workers per domain, with SameHostOnly set to false, you
    may harvest links from thousands of domains, but only if the Filter()
    method returns true for every one of them will you end up with thousands of
    workers (i.e. only if you are interested in crawling all those thousands of
    domains). I don't know how bad it would get to have that many goroutines,
    but obviously it would be pretty slow to crawl that many domains on a
    handful of cores! In fact, up to a certain point, it would be pretty slow
    on a single machine regardless of the implementation.

    You would certainly want to use distributed crawling, using the hooks to
    save the links to a database and possibly limit the number of domains
    crawled per a single process. As I mentioned in the Readme, such a load is
    untested territory and not the target use-case this was designed for, but
    the mechanisms are there to build it, I believe.

    Martin

    Le lundi 29 octobre 2012 20:51:31 UTC-4, Richard Penman a écrit :
    Nice!

    Did you find a solution for accessing original HTML in the Visit hook?

    If SameHostOnly is false then when crawling may end up having thousands of
    domains, with a worker per domain. (A more realistic use case than my Alexa
    one.)
    Do you think limiting the number of concurrent domains would be better
    handled by gocrawl or in the hooks?

    On Tuesday, October 9, 2012 7:57:24 AM UTC+11, Martin Angers wrote:

    Hi all,

    Here is my first version of my web crawler gocrawl:
    https://github.com/PuerkitoBio/gocrawl

    It features:

    - Full control over the URLs to visit, inspect and query
    - Crawl delays applied per host
    - Obedience to robots.txt rules
    - Concurrent execution using goroutines
    - Configurable logging using the builtin Go logger
    - Open, customizable design by providing hooks into the execution
    logic

    It's obviously an early release, but I think it reached a point where it
    can do useful work and behave as expected. There are already a few ways to
    customize it, using a custom Fetcher, a custom Logger and the LogFlags for
    verbosity level, the URLSelector function to filter links of interest, and
    the Visitor function that does the actual work on the content of the
    visited page.

    I have quite a few ideas for v0.2, namely a way to set crawling
    priorities and better customization (my goal would be to make it easy to
    build features on top of gocrawl, like a redis-persisted cache library, for
    example). I'm thinking of adding hooks (callback funcs) such as:

    - Starting : called on Run() to allow startup customization, such as
    reading seeds from a DB. Would return []string to be used as seeds.
    - Visited: called when an url has been visited.
    - Enqueued: called when an url has been enqueued for a visit.
    Together with Visited and Starting, would allow for persistence/recovery
    mechanisms in case of unexpected exit.

    Suggestions are welcome!

    Thanks,
    Martin
    --
  • Martin Angers at Dec 13, 2012 at 1:43 am
    Hi,

    I pushed a major feature in gocrawl (https://github.com/PuerkitoBio/gocrawl),
    fixing a few issues along the way. Prior to this, the Fetch()
    implementation used the default http Client, so it followed all
    redirections (up to 10), bypassing any Filter() implementation and
    SameHostOnly option along the way. Also, there was no way to arbitrarily
    enqueue other URLs (or re-enqueue an URL following a 500 error, so that it
    can be retried, for example).

    Now, the new default Fetch() implementation doesn't follow redirections,
    instead it enqueues the "redirect-to" URL, so that it passes through the
    Filter() process. And if the Extender instance has an EnqueueChan field (by
    convention), it is set to a channel that allows pushing new URLs to the
    crawler. Those URLs will also have to go through Filter(), obviously. And
    SameHostOnly option is enforced on every URL.

    Of course, it's always possible to implement a custom Fetch() or any other
    Extender function.

    The readme is up-to-date, let me know if anything is unclear. Github issues
    is the preferred way.

    Thanks,
    Martin

    Le lundi 8 octobre 2012 16:57:24 UTC-4, Martin Angers a écrit :
    Hi all,

    Here is my first version of my web crawler gocrawl:
    https://github.com/PuerkitoBio/gocrawl

    It features:

    - Full control over the URLs to visit, inspect and query
    - Crawl delays applied per host
    - Obedience to robots.txt rules
    - Concurrent execution using goroutines
    - Configurable logging using the builtin Go logger
    - Open, customizable design by providing hooks into the execution logic

    It's obviously an early release, but I think it reached a point where it
    can do useful work and behave as expected. There are already a few ways to
    customize it, using a custom Fetcher, a custom Logger and the LogFlags for
    verbosity level, the URLSelector function to filter links of interest, and
    the Visitor function that does the actual work on the content of the
    visited page.

    I have quite a few ideas for v0.2, namely a way to set crawling priorities
    and better customization (my goal would be to make it easy to build
    features on top of gocrawl, like a redis-persisted cache library, for
    example). I'm thinking of adding hooks (callback funcs) such as:

    - Starting : called on Run() to allow startup customization, such as
    reading seeds from a DB. Would return []string to be used as seeds.
    - Visited: called when an url has been visited.
    - Enqueued: called when an url has been enqueued for a visit. Together
    with Visited and Starting, would allow for persistence/recovery mechanisms
    in case of unexpected exit.

    Suggestions are welcome!

    Thanks,
    Martin
    --
  • Martin Angers at Jan 24, 2013 at 12:24 am
    Just a quick note to announce v0.3.1 of gocrawl. It fixes issues 9 and 10 from the github repo (https://github.com/PuerkitoBio/gocrawl). Instead of using the normalized URL for all operations - including the request to the website, it now uses this normalized form only for the Filtering process. The fetch is done using the original, non-normalized url.

    This fixes the relatively common cases where the normalization "breaks" the url, i.e. the original url is golang.org/pkg/, it gets normalized to golang.org/pkg ( no trailing slash), the request is done, it receives a redirect to golang.org/pkg/, enters infinite loop...

    Also, the http client used by the default Fetch implementation is now exported (public, named HttpClient), so that it can be customized without needing a full Fetch override and code duplication.

    See the readme for all the details, GitHub issues for bugs!

    Thanks,
    Martin

    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedOct 8, '12 at 9:04p
activeJan 24, '13 at 12:24a
posts6
users2
websitegolang.org

2 users in discussion

Martin Angers: 5 posts Richard Penman: 1 post

People

Translate

site design / logo © 2022 Grokbase