FAQ
$ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3

URLS:
http://node5:8080/docs/
http://node5:8080/throughmyeyes/
http://node5:8080/docs2/
http://node5:8080/throughmyeyes2/
http://node5:8080/docs3/
http://node5:8080/throughmyeyes3/
http://node5:8080/empty-1.html
http://node5:8080/empty-2.html
http://node5:8080/empty-3.html
http://node5:8080/empty-4.html
http://node5:8080/empty-5.html
http://node5:8080/empty-6.html
http://node5:8080/empty-7.html
http://node5:8080/empty-8.html
http://node5:8080/empty-9.html
http://node5:8080/empty-10.html
http://node5:8080/empty-11.html
http://node5:8080/empty-12.html
http://node5:8080/empty-13.html
http://node5:8080/empty-14.html
http://node5:8080/empty-15.html
http://node5:8080/empty-16.html
http://node5:8080/empty-17.html
http://node5:8080/empty-18.html
http://node5:8080/empty-19.html
http://node5:8080/empty-20.html
http://node5:8080/empty-21.html
http://node5:8080/empty-22.html
http://node5:8080/empty-23.html
http://node5:8080/empty-24.html
http://node5:8080/empty-25.html
http://node5:8080/empty-26.html
http://node5:8080/empty-27.html
http://node5:8080/empty-28.html
http://node5:8080/empty-29.html
http://node5:8080/empty-30.html
http://node5:8080/empty-31.html
http://node5:8080/empty-32.html
http://node5:8080/empty-33.html
http://node5:8080/empty-34.html
http://node5:8080/empty-35.html
http://node5:8080/empty-36.html
http://node5:8080/empty-37.html
http://node5:8080/empty-38.html
http://node5:8080/empty-39.html
http://node5:8080/empty-40.html

conf/crawl-urlfilter.txt (comments removed):
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^http://node[0-9].vanilla7-([a-z0-9]*\.)*
+^http://node[0-9].vanilla7pc600-([a-z0-9]*\.)*
+^http://node[0-9]:8080/
+^http://([a-z0-9]*\.)*apache.org

(also tried this with '+*', '+.', didn't work either)

nutch-site.xml:
<property>
<name>http.agent.name</name>
<value>NutchCrawler</value>
</property>

<property>
<name>http.agent.version</name>
<value>0.9</value>
</property>

nutch-default.xml:
<property>
<name>http.robots.agents</name>
<value>*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>

<property>
<name>http.agent.description</name>
<value>ExperimentalCrawler</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>blah@ratherharmless.org</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

<property>
<name>http.agent.version</name>
<value>Nutch-0.9</value>
</property>

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
# (also tried this with urlfilter-crawl included, didn't work either)
</property>

<property>
<name>plugin.excludes</name>
<value></value>
</property>

regex-urlfilter.txt:
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

On Wed, Feb 20, 2008 at 5:20 PM, John Mendenhall wrote:
Any help at all would be much appreciated.
Submit your submitted command, plus a sample of the
urls in the url file, plus your filter. We can start
from there.

JohnM

--
john mendenhall
john@surfutopia.net
surf utopia
internet services

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 8 | next ›
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 20, '08 at 8:53p
activeFeb 20, '08 at 11:19p
posts8
users2
websitenutch.apache.org

2 users in discussion

John Mendenhall: 4 posts Jiaqi Tan: 4 posts

People

Translate

site design / logo © 2022 Grokbase