Hi All,
I tried setting up a local filesystem crawl through nutch-0.9. I am facing
problems trying this.
Following are the details:



Found 1 items

/user/test/urls <dir>
crawl started in: crawled
rootUrlDir = urls
threads = 10
depth = 3
topN = 5
Injector: starting
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.

Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawled/segments/20071026235539
Generator: filtering: false

Generator: topN: 5
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled

urls/seed file :




# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

## skip file:, ftp:, & mailto: urls

# skip http:, ftp:, & mailto: urls

# skip image and other suffixes we can't yet parse

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

# accept hosts in
#+^http://([a-z0-9 <http://%28%5ba-z0-9/>]*\.)*com/

# skip everything else for http
# take everything else for file




<description>Directories where nutch plugins are located. Each

element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
<description>Regular expression naming plugin directory names to

include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.


Any hints on how to proceed further ?

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
postedOct 29, '07 at 4:57a
activeOct 29, '07 at 4:57a

1 user in discussion

Prem kumar: 1 post



site design / logo © 2022 Grokbase