I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even
though there should be more data to fetch. I can download a list of urls
without any problem using FreeGenerator, but the recursive crawl is not
working for me.
I have the crawl-urlfilter.txt set up to accept any url, and the plugins
configured to use this filter
The only other nutch configs that I've changed are the robot settings.
If I inspect the crawldb after a run I see that it's fetched the 3 seed pages
and refused to fetch anything else:
TOTAL urls: 248
retry 0: 248
min score: 0.0090
avg score: 0.03530645
max score: 2.029
status 1 (db_unfetched): 245
status 2 (db_fetched): 3
How can I get nutch to fetch the rest of the urls?
thanks in advance for your help,
ps: here's my crawl-urlfilter.txt
# skip image and other suffixes we can't yet parse
# skip URLs containing certain characters as probable queries, etc.
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
# accept hosts in MY.DOMAIN.NAME
# skip everything else