FAQ
hi, Jiaqi Tan & John Mendenhall

i have encountered the same problem, i have tried

correct the log4j bug and

http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html

already, and it still did not work, i was working on a cluster of 4 boxes with redhat as4

i also checked the hadoop.log and found nothing more important

so i think the problem was the generator, and i saw someone said it might caused by setting bad mapred.map.tasks and mapred.reduce.tasks, i had 4 PCs and followed the explanation of mapred.map.tasks and mapred.reduce.tasks, i set 17 and 7, was it right? can someone help me?

thanks

ivannie
08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
for fetching, exiting ...
08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
list and URL filters.

I've inserted code at Generator.java:424, which says:
if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
LOG.warn("Generator: 0 records selected for fetching, exiting ...");

essentially at the decision point to see which of the conditions
triggered the 0 records selected message, and the "readers" object is
perfectly fine, but the SequenceFileOutputFormat is reporting there
are no values (I suppose of URL scores) at all to be retrieved,
causing the generator to stop.
There is a problem with the Generator. There was a change committed
after 0.9 was released. I implemented this change and it fixed my
problem:

http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html

JohnM

--
john mendenhall
john@surfutopia.net
surf utopia
internet services
= = = = = = = = = = = = = = = = = = = =

Search Discussions

  • Jiaqi Tan at Feb 24, 2008 at 9:28 pm
    Hi Ivannie,

    This is what I did:
    1. nutch-0.9 release
    2. applied NUTCH-503 (fixes generator bug that causes it to fail if
    first segment is empty but subsequent ones are not), NUTCH-467 (fixes
    dedup failure if index directory is empty) patches
    3. recompiled
    4. I had the following configurations:
    1,2,3,4,5 slave nodes, each with 10 map threads, 10 reduce threads

    So I don't think it's an issue with the number of map/reduce
    threads--I've also had it working with 5 threads for map and other
    random small prime numbers.

    Where did your crawl fail?

    Jiaqi

    P.S. I'm also no nutch developer, just a user.
    On Fri, Feb 22, 2008 at 10:56 PM, Ivannie wrote:
    hi, Jiaqi Tan & John Mendenhall

    i have encountered the same problem, i have tried

    correct the log4j bug and


    http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html

    already, and it still did not work, i was working on a cluster of 4 boxes with redhat as4

    i also checked the hadoop.log and found nothing more important

    so i think the problem was the generator, and i saw someone said it might caused by setting bad mapred.map.tasks and mapred.reduce.tasks, i had 4 PCs and followed the explanation of mapred.map.tasks and mapred.reduce.tasks, i set 17 and 7, was it right? can someone help me?

    thanks

    ivannie

    08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
    for fetching, exiting ...
    08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
    08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
    list and URL filters.

    I've inserted code at Generator.java:424, which says:
    if (readers == null || readers.length == 0 || !readers[0].next(new
    FloatWritable())) {
    LOG.warn("Generator: 0 records selected for fetching, exiting ...");

    essentially at the decision point to see which of the conditions
    triggered the 0 records selected message, and the "readers" object is
    perfectly fine, but the SequenceFileOutputFormat is reporting there
    are no values (I suppose of URL scores) at all to be retrieved,
    causing the generator to stop.
    There is a problem with the Generator. There was a change committed
    after 0.9 was released. I implemented this change and it fixed my
    problem:

    http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html
    JohnM

    --
    john mendenhall
    john@surfutopia.net
    surf utopia
    internet services
    = = = = = = = = = = = = = = = = = = = =


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupnutch-user @
categorieslucene
postedFeb 23, '08 at 3:56a
activeFeb 24, '08 at 9:28p
posts2
users2
websitenutch.apache.org

2 users in discussion

Jiaqi Tan: 1 post Ivannie: 1 post

People

Translate

site design / logo © 2022 Grokbase