Hi
I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even
though there should be more data to fetch. I can download a list of urls
without any problem using FreeGenerator, but the recursive crawl is not
working for me.
I have the crawl-urlfilter.txt set up to accept any url, and the plugins
configured to use this filter
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)|feed</value>
The only other nutch configs that I've changed are the robot settings.
If I inspect the crawldb after a run I see that it's fetched the 3 seed pages
and refused to fetch anything else:
TOTAL urls: 248
retry 0: 248
min score: 0.0090
avg score: 0.03530645
max score: 2.029
status 1 (db_unfetched): 245
status 2 (db_fetched): 3
How can I get nutch to fetch the rest of the urls?
thanks in advance for your help,
Barry
ps: here's my crawl-urlfilter.txt
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*apache.org/
# skip everything else
#-.
+.*