I have been trying to run a crawl on a couple of different domains using
nutch:
bin/nutch crawl urls -dir crawled -depth 3
Everytime I get the response:
Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the
first level get crawled and in most other cases, nothing gets crawled. I
don't know if I have been making a mistake in the crawl-urlfilter.txt file.
Here is how it looks for me:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*blogspot.com/
(rest all other sections in the file have default values)
My urllist.txt file has only one url:
http://gmailblog.blogspot.com
The only website where the crawl seems to be working properly is
http://lucene.apache.org
Any suggestions are appreciated.
--
View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26310955.html
Sent from the Nutch - User mailing list archive at Nabble.com.