I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467
applied) to crawl, and have tried many of the fixes that have been
suggested here on the mailing list. The following is my Nutch output:
crawl started in: crawled-12
rootUrlDir = urls
threads = 10
depth = 3
topN = 20
Injector: crawlDb: crawled-12/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Generator: Selecting best-scoring urls due for fetch.
Generator: segment: crawled-12/segments/20080220133145
Generator: filtering: false
Generator: topN: 20
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled-12
I have done/checked for the following:
1. I have a valid http.agent.name string specified in nutch-site.xml;
as a precaution, I also commented out the http.agent.name <property>
section in nutch-default.xml in case the final configuration does not
take hold. I have also verified this against the job.xml retrieved via
the map/reduce web interface at 50030 on my master node, and the
http.agent.name and http.agent.version strings are both present (and
2. I have configured my crawl-urlfilter.txt in all manner of ways, and
it definitely allows the domains I'm crawling from. I have even added
"+." to allow everything at the end of the file, but still the crawl
does not work.
3. My logging level has been set to DEBUG, and then to TRACE, and
still there are no errors, nor warnings (except for messages that look
2008-02-20 07:47:55,247 DEBUG conf.Configuration - java.io.IOException: config()
which doesn't look like an error to me, after I looked at that line in
the source where it came from--it looks more like an indication that a
Config is being read, please correct me if I'm wrong.)
4. I have tried hadoop clusters with 1, 2, and 4 slaves.
5. I have tried URL lists with 1, 4, 6, 12, 40, 46 distinct URLs, in
case it was an issue with the minimum number of URLs needed -- I seem
to remember reading about such an issue on the mailing list but I
cannot find the post anymore--if anyone could point me in the
direction of that, that would be helpful.
6. I have tried setting "crawl.generate.filter" to true, and false, in
nutch-site.xml; neither works.
7. I have tried running with 10, 1, and 4 threads for the number of
map and reduce tasks.
8. There were no OutOfMemoryErrors whatsoever and system load was not
excessive during the crawl.
9. Results from readdb -stats:
CrawlDb statistics start: crawled-12/crawldb
Statistics for CrawlDb: crawled-12/crawldb
TOTAL urls: 46
retry 0: 46
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 46
CrawlDb statistics: done
Any help at all would be much appreciated.