FAQ
Hi all,

I have been trying to run a crawl on a couple of different domains using
nutch:

bin/nutch crawl urls -dir crawled -depth 3

Everytime I get the response:
Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the
first level get crawled and in most other cases, nothing gets crawled. I
don't know if I have been making a mistake in the crawl-urlfilter.txt file.
Here is how it looks for me:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*blogspot.com/

(rest all other sections in the file have default values)

My urllist.txt file has only one url:
http://gmailblog.blogspot.com

The only website where the crawl seems to be working properly is
http://lucene.apache.org

Any suggestions are appreciated.



--
View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26310955.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Search Discussions

  • John Whelan at Nov 12, 2009 at 4:33 am
    Any other rules in your filter that preceed that one?
    (+^http://([a-z0-9]*\.)*blogspot.com/)
    --
    View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html
    Sent from the Nutch - User mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriesnutch, lucene
postedNov 11, '09 at 11:45p
activeNov 12, '09 at 4:33a
posts2
users2
websitenutch.apache.org

2 users in discussion

John Whelan: 1 post Kvorion: 1 post

People

Translate

site design / logo © 2022 Grokbase