FAQ

Search Discussions

5,032 discussions - 16,458 posts

  • Hi, when trying to index four segments (~5 GB) with solrindexer, I get this error in hadoop.log. There is no error in the logs of Tomcat, where I deployed Solr. I crawled with "crawl"-command. I`ve ...
    Felix ZimmermannFelix Zimmermann
    Dec 6, 2009 at 12:35 am
    Dec 6, 2009 at 12:35 am
  • My fetch cycle failed on the following initial error : java.io.IOException: Task process exit with nonzero status of 65. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425) Than it makes ...
    MilleBiiMilleBii
    Dec 5, 2009 at 8:50 am
    Dec 5, 2009 at 12:18 pm
  • Hi, I'm developing my own set of tools, plugins and some minor code changes to Nutch. I still want to get updates from the main Nutch repository, but I would like to keep my own SVN for tracking my ...
    Eran ZinmanEran Zinman
    Dec 5, 2009 at 8:41 am
    Dec 5, 2009 at 8:41 am
  • Hi, I am using nutch to crawl the data from the web. Now I want to extract the images using nutch. Can somebody please suggest me some way how to do that or sugeest me some url? Regards, Manish Bawne ...
    ManishkbawneManishkbawne
    Dec 5, 2009 at 7:36 am
    Dec 5, 2009 at 7:36 am
  • Hi guys, I'm looking if I can optimize the size occupied on disk by my segments. I have implemented a topical-scoring plugin... this means I know at that steps if I should keep that page content or ...
    MilleBiiMilleBii
    Dec 4, 2009 at 10:18 pm
    Dec 5, 2009 at 8:42 am
  • I am going over mailing list and still didn't find an answer. For a project, I need to crawl the web, index it and merge that content with another site's content which is stored inside the key-value ...
    Mr HadoopMr Hadoop
    Dec 4, 2009 at 7:52 pm
    Dec 4, 2009 at 8:20 pm
  • To whom it may concern, Hello! Because I will use this E-mail for special purpose. I will use another E-mail to subscribe in nutch-user. So I want to unsubscribe from nutch-user. Thank you! -- ...
    Rengan xuRengan xu
    Dec 4, 2009 at 2:51 pm
    Dec 5, 2009 at 8:03 am
  • I am using Nutch 1.0. I want to perform a 'clean' crawl. I see the force option in this patch: NUTCH-601v1.0.patch <https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0 .patch Do I ...
    Peters, VijayaPeters, Vijaya
    Dec 4, 2009 at 1:18 pm
    Dec 4, 2009 at 3:36 pm
  • Hallo, I hope someone can help me. I installed nutch on 2 Amazon EC2 computers. Everything is fine but I can't put data in the hdfs. I formatted the namenode and start the hdfs with start all. All ...
    Tom LandvoigtTom Landvoigt
    Dec 4, 2009 at 12:25 pm
    Dec 4, 2009 at 7:16 pm
  • I am just staring to learn nutch. One question I wanted to know is that can nutch pause, stop and start indexing a site on a incremental daily basis? My concern with nutch is that nutch behaving like ...
    Mr HadoopMr Hadoop
    Dec 4, 2009 at 12:11 pm
    Dec 4, 2009 at 2:19 pm
  • Hi, I am new to Nutch. I want to crawl and search office 2007 documents (.docx, .pptx etc) from Nutch. But when I try to crawl, crawler throws following error: fetching ...
    Rupesh MankarRupesh Mankar
    Dec 4, 2009 at 10:59 am
    Dec 4, 2009 at 10:59 am
  • Hello, I am using Nutch 1.0. I followed the tutorial to getting Nutch up and running almost verbatim - I used a different site. When I deploy my war file as ROOT.war in Tomcat and search, I dont see ...
    Tom MacKenzieTom MacKenzie
    Dec 4, 2009 at 9:44 am
    Dec 4, 2009 at 1:55 pm
  • Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'? Shouldn't the crawldb entry have a status of 'db_gone'? This is happening in nutch-1.0 Here is one example of what I'm ...
    J.G.KonradJ.G.Konrad
    Dec 3, 2009 at 11:15 pm
    Dec 3, 2009 at 11:15 pm
  • hi, i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly <property <name db.fetch.interval.default</name <value ...
    BELLINI ADAMBELLINI ADAM
    Dec 3, 2009 at 9:29 pm
    Dec 3, 2009 at 9:58 pm
  • hi, i'm performing a RECRAWL using the recrawl.sh script, and i had this error when inverting the links: FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already ...
    BELLINI ADAMBELLINI ADAM
    Dec 3, 2009 at 4:15 pm
    Dec 3, 2009 at 4:15 pm
  • Observing my fetch cycles perf. It looks like there is always a rather long tail. I saw it on 10k, 150k, 450k fetch runs. Of course you can cut-off the tail with the patch 770 made by Julien (thx), I ...
    MilleBiiMilleBii
    Dec 3, 2009 at 5:49 am
    Dec 3, 2009 at 9:15 pm
  • hi, i have this error when crawling.... org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for ...
    BELLINI ADAMBELLINI ADAM
    Dec 2, 2009 at 2:41 pm
    Dec 2, 2009 at 11:43 pm
  • I'm looking for advice where to locate the search.dir I saw post stating to put on the OS file system some speak of locating under hdfs... So far I have only used the OS FS which means you have to ...
    MilleBiiMilleBii
    Dec 2, 2009 at 8:41 am
    Dec 2, 2009 at 8:41 am
  • i'm observing crawl dates, which have fetch interval with value 0. when i dump the segment, i see Recno:: 33 URL:: http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ CrawlDatum:: ...
    Reinhard schwabReinhard schwab
    Dec 1, 2009 at 11:27 pm
    Dec 2, 2009 at 12:12 pm
  • Hello, For those living in or near NYC, you may be interested in joining (and/or presenting?) at the NYC Search & Discovery Meetup. Topics are: search, machine learning, data mining, NLP, information ...
    Otis GospodneticOtis Gospodnetic
    Dec 1, 2009 at 8:39 pm
    Dec 1, 2009 at 8:39 pm
  • Hello everyone, the application I'm developing I use nutch normal in the polls by the AND operator and using the Lucene (for lack of support Nuth) for research with the OR operator. However, the ...
    JulianumJulianum
    Dec 1, 2009 at 7:30 pm
    Dec 1, 2009 at 7:30 pm
  • Hi, I am using nutch - 1.0 under windows xp with cygwin, and ran a test crawl. It apparently worked, as I can see some data in my crawl directory with the luke tool (also looking for documentation of ...
    BrianBrian
    Dec 1, 2009 at 8:45 am
    Dec 1, 2009 at 10:58 am
  • I am getting warnings in hadoop.log that segments.gen and segments_2 are not directories, and as you can see by the listing, they are in fact files not directories. I'm not sure what stage of the ...
    Jesse HiresJesse Hires
    Nov 30, 2009 at 4:49 pm
    Dec 2, 2009 at 4:27 pm
  • hello can someone help me with this: i am using nutch-0-9 with hadoop and want use bw-filter from patch nutch-249. after using ant i get some errors about import problems. as i read somewhere, nutch ...
    Myname ToMyname To
    Nov 28, 2009 at 10:03 pm
    Nov 30, 2009 at 11:36 am
  • My nutch crawl just stopped. The process is still there, and doesn't respond to a "kill -TERM" or a "kill -HUP", but it hasn't written anything to the log file in the last 40 minutes. The last thing ...
    Paul TomblinPaul Tomblin
    Nov 28, 2009 at 9:22 pm
    Nov 29, 2009 at 1:30 am
  • Although I have applied https://issues.apache.org/jira/browse/NUTCH-719 (& 769) I get my fetcher job hang-up at the end : ... -finishing thread FetcherThread, activeThreads=2 -finishing thread ...
    MilleBiiMilleBii
    Nov 28, 2009 at 5:27 pm
    Nov 28, 2009 at 5:34 pm
  • Hi all, I'm try to figure out ways to improve Nutch focused crawling efficiency. I'm looking for certain pages inside each domain which contains content I'm looking for. I'm unable to know that a ...
    Eran ZinmanEran Zinman
    Nov 27, 2009 at 3:16 pm
    Nov 29, 2009 at 3:09 pm
  • hi, i have to add parse-wml plugin to Nutch, if it has been finished,pls give me some advise. Tks!
    YangfengYangfeng
    Nov 27, 2009 at 1:23 am
    Nov 27, 2009 at 1:23 am
  • Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in ...
    Santiago PérezSantiago Pérez
    Nov 26, 2009 at 12:04 pm
    Nov 27, 2009 at 9:17 am
  • hi all, there are 4 document fields in my index that i am not indexing anymore; then i have 4 new fields i need to add to my index, so i created a new indexing filter. how i can add these new fields ...
    Fadzi UshewokunzeFadzi Ushewokunze
    Nov 26, 2009 at 10:31 am
    Nov 26, 2009 at 10:31 am
  • hi, i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50 i ...
    BELLINI ADAMBELLINI ADAM
    Nov 25, 2009 at 3:43 pm
    Dec 1, 2009 at 4:06 pm
  • hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those ...
    BELLINI ADAMBELLINI ADAM
    Nov 24, 2009 at 8:57 pm
    Nov 25, 2009 at 3:37 pm
  • I just observed that on my set-up hadoop pseudo distributed I hardly get overlap between Map & Reduce phases... sounds strange to me especially when I have plenty of spare CPU. Would there be a ...
    MilleBiiMilleBii
    Nov 24, 2009 at 8:43 pm
    Nov 24, 2009 at 8:43 pm
  • Hi, guys, my goal is to do by crawls at 100 fetches per second, observing, of course, polite crawling. But, when URLs are all different domains, what theoretically would stop some software from ...
    Mark KerznerMark Kerzner
    Nov 24, 2009 at 5:58 am
    Nov 28, 2009 at 1:15 pm
  • Does "bin/nutch merge" only create a whole new index out of several smaller indexes, or can it be used to incrementally update a single large index with newly fetched and indexed smaller segments? ...
    Jesse HiresJesse Hires
    Nov 24, 2009 at 4:13 am
    Nov 24, 2009 at 9:37 am
  • Hi, I want to exclude some of Yahoo Answers URLs from crawling. Few examples are as follows: 1. http://answers.yahoo.com/question/?link=answer&qid=20091122033318AA3huLM 2. ...
    VidyaMNVidyaMN
    Nov 22, 2009 at 2:13 pm
    Nov 22, 2009 at 2:13 pm
  • Hi, I want to use Nutch in EC2 to crawl around 100 million URLs, extracting only questions and answers from http://answers.yahoo.com. I'm a Nutch newbie so apologies for any basic queries, I've the ...
    VidyaMNVidyaMN
    Nov 22, 2009 at 9:56 am
    Nov 22, 2009 at 9:56 am
  • there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages ...
    Reinhard schwabReinhard schwab
    Nov 22, 2009 at 12:44 am
    Nov 22, 2009 at 7:18 pm
  • Hi, We've been using Nutch for focused crawling (right now we are crawling about 50 domains). We've encountered the long-tail problem - We've set TopN to 100,000 and generate.max.per.host to about ...
    Eran ZinmanEran Zinman
    Nov 21, 2009 at 8:23 am
    Nov 24, 2009 at 6:18 am
  • Hi, I installed Nutch -1.0 , configured the various xml configuration files, put a test url flat file in <urls ,ran a test crawl : bin/nutch crawl urls -dir crawl -depth 2 -topN 10 It runs but but ...
    BrianBrian
    Nov 20, 2009 at 4:46 pm
    Nov 20, 2009 at 5:53 pm
  • This is the first time I have received this error while crawling. During a crawl of 100K pages, one of the nodes had a task failed and cited "Too Many Fetch Failures" as the reason. The job completed ...
    Eric OsgoodEric Osgood
    Nov 19, 2009 at 8:50 pm
    Nov 20, 2009 at 8:06 pm
  • I'm using nutch-1.0 and have noticed after running some tests that the robot rules parser does not support wildcard (a.k.a globbing) in rules. This means the rule will not work like it was expected ...
    J.G.KonradJ.G.Konrad
    Nov 19, 2009 at 7:32 pm
    Nov 19, 2009 at 7:43 pm
  • Does anybody know of any concrete plans to update Nutch to Hadoop 0.20, 0.21? Something like a Nutch 1.1 release, get in some bug fixes and get current on Hadoop? I think that should be one of the ...
    John MartyniakJohn Martyniak
    Nov 19, 2009 at 7:04 pm
    Nov 22, 2009 at 3:19 am
  • hello can somebody help me with urlfilter. i need to fetch sites with this pattern: http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/ first folder can vary, whereas host name and second ...
    Myname ToMyname To
    Nov 19, 2009 at 12:54 am
    Nov 19, 2009 at 6:49 pm
  • Hi, I am working for a company and we want to customize nutch for our needs. So we need some experts. We want to make a backlink/inlink analysis for the German internet. So what we need: We want to ...
    Tom LandvoigtTom Landvoigt
    Nov 18, 2009 at 3:11 pm
    Nov 18, 2009 at 3:11 pm
  • Has anybody else had any trouble running nutch 0.19.2 with Ganglia 3.1.3? I was surfing through Jira and it seems that there where some issues but they have been resolved. Any thoughts would be ...
    John MartyniakJohn Martyniak
    Nov 18, 2009 at 12:05 am
    Nov 18, 2009 at 1:29 am
  • Hi all, Whats the best way to get the total hit count returned excluding deduped documents? At the moment nutch bean returns only the full total.
    Fadzi UshewokunzeFadzi Ushewokunze
    Nov 17, 2009 at 8:58 pm
    Nov 17, 2009 at 8:58 pm
  • Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not ...
    Mark KerznerMark Kerzner
    Nov 16, 2009 at 7:48 pm
    Nov 16, 2009 at 8:20 pm
  • Hi, All I'm using nutch 1.0 on a 12 nodes cluster. When using crawler to index intranet, it crashed after 12 hours crawling. One of my slave crashed too. Following are logs of crashed node(tasktacker ...
    Xiao yangXiao yang
    Nov 16, 2009 at 2:51 am
    Nov 16, 2009 at 2:51 am
  • Hi I have spend more than a week fetching using nutch. Near the end i get a message in the console indicating aborting with 100 hung threads as below. Has anyone seen this before... Does anyone know ...
    Kalaimathan MahenthiranKalaimathan Mahenthiran
    Nov 15, 2009 at 9:54 pm
    Nov 17, 2009 at 9:00 am
Group Navigation
period‹ prev | Latest | first ›
Group Overview
groupnutch-user @
categorieslucene
discussions5,032
posts16,458
users1,902
websitenutch.apache.org

Top users

Andrzej Bialecki: 739 posts Stefan Groschupf: 375 posts Dennis Kubes: 306 posts Ogjunk-nutch: 280 posts Doğacan Güney: 247 posts Nutch-dev: 177 posts Jérôme Charron: 154 posts Raghavendra Prabhu: 146 posts Gal Nitzan: 143 posts Sami siren: 139 posts Doug Cutting: 135 posts TDLN: 129 posts Susam Pal: 126 posts MilleBii: 108 posts Piotr Kosiorowski: 103 posts Sean Dean: 98 posts Alexander Aristov: 96 posts Vanderdray, Jacob: 91 posts Ken Krugler: 91 posts Feng \(Michael\) Ji: 88 posts
show more