Any suggestions on debugging the generator? My log4j is already in
DEBUG, but there are no DEBUG entries except for the final WARN that

08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
for fetching, exiting ...
08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
list and URL filters.

I've inserted code at Generator.java:424, which says:
if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
LOG.warn("Generator: 0 records selected for fetching, exiting ...");

essentially at the decision point to see which of the conditions
triggered the 0 records selected message, and the "readers" object is
perfectly fine, but the SequenceFileOutputFormat is reporting there
are no values (I suppose of URL scores) at all to be retrieved,
causing the generator to stop.
On Wed, Feb 20, 2008 at 5:39 PM, John Mendenhall wrote:
$ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3
(also tried this with '+*', '+.', didn't work either)
I don't understand how +* would ever work since * is for
repeating the previous element. But, +. should work.

Everything else looked okay to me. I would start looking
at the logs closely. I would try setting your log4j
properties to INFO or DEBUG level for the generator

The inject is obviously working since your stats shows
the urls in the crawldb as unfetched. So, debug the


john mendenhall
surf utopia
internet services

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 8 | next ›
Discussion Overview
groupnutch-user @
postedFeb 20, '08 at 8:53p
activeFeb 20, '08 at 11:19p

2 users in discussion

Jiaqi Tan: 4 posts John Mendenhall: 4 posts



site design / logo © 2022 Grokbase