FAQ
Hi,

I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467
applied) to crawl, and have tried many of the fixes that have been
suggested here on the mailing list. The following is my Nutch output:

crawl started in: crawled-12
rootUrlDir = urls
threads = 10
depth = 3
topN = 20
Injector: starting
Injector: crawlDb: crawled-12/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawled-12/segments/20080220133145
Generator: filtering: false
Generator: topN: 20
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawled-12

I have done/checked for the following:

1. I have a valid http.agent.name string specified in nutch-site.xml;
as a precaution, I also commented out the http.agent.name <property>
section in nutch-default.xml in case the final configuration does not
take hold. I have also verified this against the job.xml retrieved via
the map/reduce web interface at 50030 on my master node, and the
http.agent.name and http.agent.version strings are both present (and
not empty).

2. I have configured my crawl-urlfilter.txt in all manner of ways, and
it definitely allows the domains I'm crawling from. I have even added
"+." to allow everything at the end of the file, but still the crawl
does not work.

3. My logging level has been set to DEBUG, and then to TRACE, and
still there are no errors, nor warnings (except for messages that look
like this:
2008-02-20 07:47:55,247 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.(FSConstants.java:120)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:276)
at org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.create(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.ChecksumFileSystem$FSOutputSummer.(ChecksumFileSystem.java:438)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:253)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:84)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:78)
at org.apache.hadoop.fs.ChecksumFileSystem.copyFromLocalFile(ChecksumFileSystem.java:566)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:741)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:102)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:822)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:910)

which doesn't look like an error to me, after I looked at that line in
the source where it came from--it looks more like an indication that a
Config is being read, please correct me if I'm wrong.)

4. I have tried hadoop clusters with 1, 2, and 4 slaves.

5. I have tried URL lists with 1, 4, 6, 12, 40, 46 distinct URLs, in
case it was an issue with the minimum number of URLs needed -- I seem
to remember reading about such an issue on the mailing list but I
cannot find the post anymore--if anyone could point me in the
direction of that, that would be helpful.

6. I have tried setting "crawl.generate.filter" to true, and false, in
nutch-site.xml; neither works.

7. I have tried running with 10, 1, and 4 threads for the number of
map and reduce tasks.

8. There were no OutOfMemoryErrors whatsoever and system load was not
excessive during the crawl.

9. Results from readdb -stats:
CrawlDb statistics start: crawled-12/crawldb
Statistics for CrawlDb: crawled-12/crawldb
TOTAL urls: 46
retry 0: 46
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 46
CrawlDb statistics: done

Any help at all would be much appreciated.

Thanks.

Jiaqi Tan

Search Discussions

  • John Mendenhall at Feb 20, 2008 at 10:20 pm
    Any help at all would be much appreciated.
    Submit your submitted command, plus a sample of the
    urls in the url file, plus your filter. We can start
    from there.

    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • Jiaqi Tan at Feb 20, 2008 at 10:26 pm
    $ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3

    URLS:
    http://node5:8080/docs/
    http://node5:8080/throughmyeyes/
    http://node5:8080/docs2/
    http://node5:8080/throughmyeyes2/
    http://node5:8080/docs3/
    http://node5:8080/throughmyeyes3/
    http://node5:8080/empty-1.html
    http://node5:8080/empty-2.html
    http://node5:8080/empty-3.html
    http://node5:8080/empty-4.html
    http://node5:8080/empty-5.html
    http://node5:8080/empty-6.html
    http://node5:8080/empty-7.html
    http://node5:8080/empty-8.html
    http://node5:8080/empty-9.html
    http://node5:8080/empty-10.html
    http://node5:8080/empty-11.html
    http://node5:8080/empty-12.html
    http://node5:8080/empty-13.html
    http://node5:8080/empty-14.html
    http://node5:8080/empty-15.html
    http://node5:8080/empty-16.html
    http://node5:8080/empty-17.html
    http://node5:8080/empty-18.html
    http://node5:8080/empty-19.html
    http://node5:8080/empty-20.html
    http://node5:8080/empty-21.html
    http://node5:8080/empty-22.html
    http://node5:8080/empty-23.html
    http://node5:8080/empty-24.html
    http://node5:8080/empty-25.html
    http://node5:8080/empty-26.html
    http://node5:8080/empty-27.html
    http://node5:8080/empty-28.html
    http://node5:8080/empty-29.html
    http://node5:8080/empty-30.html
    http://node5:8080/empty-31.html
    http://node5:8080/empty-32.html
    http://node5:8080/empty-33.html
    http://node5:8080/empty-34.html
    http://node5:8080/empty-35.html
    http://node5:8080/empty-36.html
    http://node5:8080/empty-37.html
    http://node5:8080/empty-38.html
    http://node5:8080/empty-39.html
    http://node5:8080/empty-40.html

    conf/crawl-urlfilter.txt (comments removed):
    -^(file|ftp|mailto):
    -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
    -[?*!@=]
    -.*(/.+?)/.*?\1/.*?\1/
    +^http://node[0-9].vanilla7-([a-z0-9]*\.)*
    +^http://node[0-9].vanilla7pc600-([a-z0-9]*\.)*
    +^http://node[0-9]:8080/
    +^http://([a-z0-9]*\.)*apache.org

    (also tried this with '+*', '+.', didn't work either)

    nutch-site.xml:
    <property>
    <name>http.agent.name</name>
    <value>NutchCrawler</value>
    </property>

    <property>
    <name>http.agent.version</name>
    <value>0.9</value>
    </property>

    nutch-default.xml:
    <property>
    <name>http.robots.agents</name>
    <value>*</value>
    <description>The agent strings we'll look for in robots.txt files,
    comma-separated, in decreasing order of precedence. You should
    put the value of http.agent.name as the first agent name, and keep the
    default * at the end of the list. E.g.: BlurflDev,Blurfl,*
    </description>
    </property>

    <property>
    <name>http.robots.403.allow</name>
    <value>true</value>
    <description>Some servers return HTTP status 403 (Forbidden) if
    /robots.txt doesn't exist. This should probably mean that we are
    allowed to crawl the site nonetheless. If this is set to false,
    then such sites will be treated as forbidden.</description>
    </property>

    <property>
    <name>http.agent.description</name>
    <value>ExperimentalCrawler</value>
    <description>Further description of our bot- this text is used in
    the User-Agent header. It appears in parenthesis after the agent name.
    </description>
    </property>

    <property>
    <name>http.agent.url</name>
    <value>http://lucene.apache.org/nutch/</value>
    <description>A URL to advertise in the User-Agent header. This will
    appear in parenthesis after the agent name. Custom dictates that this
    should be a URL of a page explaining the purpose and behavior of this
    crawler.
    </description>
    </property>

    <property>
    <name>http.agent.email</name>
    <value>[email protected]</value>
    <description>An email address to advertise in the HTTP 'From' request
    header and User-Agent header. A good practice is to mangle this
    address (e.g. 'info at example dot com') to avoid spamming.
    </description>
    </property>

    <property>
    <name>http.agent.version</name>
    <value>Nutch-0.9</value>
    </property>

    <property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    # (also tried this with urlfilter-crawl included, didn't work either)
    </property>

    <property>
    <name>plugin.excludes</name>
    <value></value>
    </property>

    regex-urlfilter.txt:
    # The default url filter.
    # Better for whole-internet crawling.

    # Each non-comment, non-blank line contains a regular expression
    # prefixed by '+' or '-'. The first matching pattern in the file
    # determines whether a URL is included or ignored. If no pattern
    # matches, the URL is ignored.

    # skip file: ftp: and mailto: urls
    -^(file|ftp|mailto):

    # skip image and other suffixes we can't yet parse
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    # skip URLs containing certain characters as probable queries, etc.
    -[?*!@=]

    # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
    -.*(/.+?)/.*?\1/.*?\1/

    # accept anything else
    +.

    On Wed, Feb 20, 2008 at 5:20 PM, John Mendenhall wrote:
    Any help at all would be much appreciated.
    Submit your submitted command, plus a sample of the
    urls in the url file, plus your filter. We can start
    from there.

    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • John Mendenhall at Feb 20, 2008 at 10:39 pm

    $ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3
    (also tried this with '+*', '+.', didn't work either)
    I don't understand how +* would ever work since * is for
    repeating the previous element. But, +. should work.

    Everything else looked okay to me. I would start looking
    at the logs closely. I would try setting your log4j
    properties to INFO or DEBUG level for the generator
    step.

    The inject is obviously working since your stats shows
    the urls in the crawldb as unfetched. So, debug the
    generator.

    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • Jiaqi Tan at Feb 20, 2008 at 10:47 pm
    Any suggestions on debugging the generator? My log4j is already in
    DEBUG, but there are no DEBUG entries except for the final WARN that
    says

    08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
    for fetching, exiting ...
    08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
    08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
    list and URL filters.

    I've inserted code at Generator.java:424, which says:
    if (readers == null || readers.length == 0 || !readers[0].next(new
    FloatWritable())) {
    LOG.warn("Generator: 0 records selected for fetching, exiting ...");

    essentially at the decision point to see which of the conditions
    triggered the 0 records selected message, and the "readers" object is
    perfectly fine, but the SequenceFileOutputFormat is reporting there
    are no values (I suppose of URL scores) at all to be retrieved,
    causing the generator to stop.
    On Wed, Feb 20, 2008 at 5:39 PM, John Mendenhall wrote:
    $ /exp/sw/nutch-0.9/bin/nutch crawl urls -dir crawled-15 -depth 3
    (also tried this with '+*', '+.', didn't work either)
    I don't understand how +* would ever work since * is for
    repeating the previous element. But, +. should work.

    Everything else looked okay to me. I would start looking
    at the logs closely. I would try setting your log4j
    properties to INFO or DEBUG level for the generator
    step.

    The inject is obviously working since your stats shows
    the urls in the crawldb as unfetched. So, debug the
    generator.



    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • John Mendenhall at Feb 20, 2008 at 10:58 pm

    08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
    for fetching, exiting ...
    08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
    08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
    list and URL filters.

    I've inserted code at Generator.java:424, which says:
    if (readers == null || readers.length == 0 || !readers[0].next(new
    FloatWritable())) {
    LOG.warn("Generator: 0 records selected for fetching, exiting ...");

    essentially at the decision point to see which of the conditions
    triggered the 0 records selected message, and the "readers" object is
    perfectly fine, but the SequenceFileOutputFormat is reporting there
    are no values (I suppose of URL scores) at all to be retrieved,
    causing the generator to stop.
    There is a problem with the Generator. There was a change committed
    after 0.9 was released. I implemented this change and it fixed my
    problem:

    http://www.mail-archive.com/[email protected]/msg01991.html

    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • Jiaqi Tan at Feb 20, 2008 at 11:14 pm
    So it should read "2.1.0" in nutch-0.9 then? Since lib-lucene-analyzer
    is version 2.1.0 (rather than 2.2.0 as described in NUTCH-507)?

    Thanks.
    On Wed, Feb 20, 2008 at 5:58 PM, John Mendenhall wrote:
    08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
    for fetching, exiting ...
    08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
    08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
    list and URL filters.

    I've inserted code at Generator.java:424, which says:
    if (readers == null || readers.length == 0 || !readers[0].next(new
    FloatWritable())) {
    LOG.warn("Generator: 0 records selected for fetching, exiting ...");

    essentially at the decision point to see which of the conditions
    triggered the 0 records selected message, and the "readers" object is
    perfectly fine, but the SequenceFileOutputFormat is reporting there
    are no values (I suppose of URL scores) at all to be retrieved,
    causing the generator to stop.
    There is a problem with the Generator. There was a change committed
    after 0.9 was released. I implemented this change and it fixed my
    problem:

    http://www.mail-archive.com/[email protected]/msg01991.html



    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services
  • John Mendenhall at Feb 20, 2008 at 11:19 pm

    So it should read "2.1.0" in nutch-0.9 then? Since lib-lucene-analyzer
    is version 2.1.0 (rather than 2.2.0 as described in NUTCH-507)?
    There is a problem with the Generator. There was a change committed
    after 0.9 was released. I implemented this change and it fixed my
    problem:

    http://www.mail-archive.com/[email protected]/msg01991.html
    Look at NUTCH-503, not NUTCH-507. I have no experience with NUTCH-507.

    JohnM

    --
    john mendenhall
    [email protected]
    surf utopia
    internet services

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 20, '08 at 8:53p
activeFeb 20, '08 at 11:19p
posts8
users2
websitenutch.apache.org

2 users in discussion

John Mendenhall: 4 posts Jiaqi Tan: 4 posts

People

Translate

site design / logo © 2023 Grokbase