Here is my first version of my web crawler gocrawl:
- Full control over the URLs to visit, inspect and query
- Crawl delays applied per host
- Obedience to robots.txt rules
- Concurrent execution using goroutines
- Configurable logging using the builtin Go logger
- Open, customizable design by providing hooks into the execution logic
It's obviously an early release, but I think it reached a point where it
can do useful work and behave as expected. There are already a few ways to
customize it, using a custom Fetcher, a custom Logger and the LogFlags for
verbosity level, the URLSelector function to filter links of interest, and
the Visitor function that does the actual work on the content of the
I have quite a few ideas for v0.2, namely a way to set crawling priorities
and better customization (my goal would be to make it easy to build
features on top of gocrawl, like a redis-persisted cache library, for
example). I'm thinking of adding hooks (callback funcs) such as:
- Starting : called on Run() to allow startup customization, such as
reading seeds from a DB. Would return string to be used as seeds.
- Visited: called when an url has been visited.
- Enqueued: called when an url has been enqueued for a visit. Together
with Visited and Starting, would allow for persistence/recovery mechanisms
in case of unexpected exit.
Suggestions are welcome!