FAQ
Hi Ivannie,

This is what I did:
1. nutch-0.9 release
2. applied NUTCH-503 (fixes generator bug that causes it to fail if
first segment is empty but subsequent ones are not), NUTCH-467 (fixes
dedup failure if index directory is empty) patches
3. recompiled
4. I had the following configurations:
1,2,3,4,5 slave nodes, each with 10 map threads, 10 reduce threads

So I don't think it's an issue with the number of map/reduce
threads--I've also had it working with 5 threads for map and other
random small prime numbers.

Where did your crawl fail?

Jiaqi

P.S. I'm also no nutch developer, just a user.
On Fri, Feb 22, 2008 at 10:56 PM, Ivannie wrote:
hi, Jiaqi Tan & John Mendenhall

i have encountered the same problem, i have tried

correct the log4j bug and


http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html

already, and it still did not work, i was working on a cluster of 4 boxes with redhat as4

i also checked the hadoop.log and found nothing more important

so i think the problem was the generator, and i saw someone said it might caused by setting bad mapred.map.tasks and mapred.reduce.tasks, i had 4 PCs and followed the explanation of mapred.map.tasks and mapred.reduce.tasks, i set 17 and 7, was it right? can someone help me?

thanks

ivannie

08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
for fetching, exiting ...
08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch.
08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
list and URL filters.

I've inserted code at Generator.java:424, which says:
if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
LOG.warn("Generator: 0 records selected for fetching, exiting ...");

essentially at the decision point to see which of the conditions
triggered the 0 records selected message, and the "readers" object is
perfectly fine, but the SequenceFileOutputFormat is reporting there
are no values (I suppose of URL scores) at all to be retrieved,
causing the generator to stop.
There is a problem with the Generator. There was a change committed
after 0.9 was released. I implemented this change and it fixed my
problem:

http://www.mail-archive.com/nutch-commits@lucene.apache.org/msg01991.html
JohnM

--
john mendenhall
john@surfutopia.net
surf utopia
internet services
= = = = = = = = = = = = = = = = = = = =


Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupnutch-user @
categorieslucene
postedFeb 23, '08 at 3:56a
activeFeb 24, '08 at 9:28p
posts2
users2
websitenutch.apache.org

2 users in discussion

Jiaqi Tan: 1 post Ivannie: 1 post

People

Translate

site design / logo © 2022 Grokbase