I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467
applied) to crawl, and have tried many of the fixes that have been
suggested here on the mailing list. The following is my Nutch output:

crawl started in: crawled-12
rootUrlDir = urls
threads = 10
depth = 3
topN = 20
Injector: starting
Injector: crawlDb: crawled-12/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawled-12/segments/20080220133145
Generator: filtering: false
Generator: topN: 20
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled-12

I have done/checked for the following:

1. I have a valid http.agent.name string specified in nutch-site.xml;
as a precaution, I also commented out the http.agent.name <property>
section in nutch-default.xml in case the final configuration does not
take hold. I have also verified this against the job.xml retrieved via
the map/reduce web interface at 50030 on my master node, and the
http.agent.name and http.agent.version strings are both present (and
not empty).

2. I have configured my crawl-urlfilter.txt in all manner of ways, and
it definitely allows the domains I'm crawling from. I have even added
"+." to allow everything at the end of the file, but still the crawl
does not work.

3. My logging level has been set to DEBUG, and then to TRACE, and
still there are no errors, nor warnings (except for messages that look
like this:
2008-02-20 07:47:55,247 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.(FSConstants.java:120)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:276)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.create(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.ChecksumFileSystem$FSOutputSummer.(ChecksumFileSystem.java:438)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:253)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:84)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:78)
at org.apache.hadoop.fs.ChecksumFileSystem.copyFromLocalFile(ChecksumFileSystem.java:566)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:741)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:102)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:822)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)

which doesn't look like an error to me, after I looked at that line in
the source where it came from--it looks more like an indication that a
Config is being read, please correct me if I'm wrong.)

4. I have tried hadoop clusters with 1, 2, and 4 slaves.

5. I have tried URL lists with 1, 4, 6, 12, 40, 46 distinct URLs, in
case it was an issue with the minimum number of URLs needed -- I seem
to remember reading about such an issue on the mailing list but I
cannot find the post anymore--if anyone could point me in the
direction of that, that would be helpful.

6. I have tried setting "crawl.generate.filter" to true, and false, in
nutch-site.xml; neither works.

7. I have tried running with 10, 1, and 4 threads for the number of
map and reduce tasks.

8. There were no OutOfMemoryErrors whatsoever and system load was not
excessive during the crawl.

9. Results from readdb -stats:
CrawlDb statistics start: crawled-12/crawldb
Statistics for CrawlDb: crawled-12/crawldb
TOTAL urls: 46
retry 0: 46
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 46
CrawlDb statistics: done

Any help at all would be much appreciated.


Jiaqi Tan

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 8 | next ›
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 20, '08 at 8:53p
activeFeb 20, '08 at 11:19p

2 users in discussion

John Mendenhall: 4 posts Jiaqi Tan: 4 posts



site design / logo © 2022 Grokbase