Grokbase Groups Nutch agent
FAQ

Search Discussions

106 discussions - 167 posts

  • Hi Team, I have a query on fetcher.queue.mode property. It generally takes three values : - byHost - byDomain - byIP I want to know how each will work. How nutch will treat byHost and byDomain ...
    Manish BassiManish Bassi
    Feb 26, 2016 at 7:56 pm
    Feb 26, 2016 at 7:56 pm
  • Mihai can you use newest version of Nutch ? And this log is not enough that understanding what happen. can you share full log ?
    Talat UyarerTalat Uyarer
    Oct 21, 2014 at 3:38 am
    Oct 21, 2014 at 3:38 am
  • Hello, I'm trying to set up a Nutch+Solr to crawl a list of domains. I want to get 50 pages per seed in the list (no external links) and save the seed each page came from in the result. The goal is ...
    Pablo OvelleiroPablo Ovelleiro
    Oct 20, 2014 at 11:35 am
    Oct 20, 2014 at 11:35 am
  • 0

    Re:

    Hi! http://bbc-recycl.com/_redirect?burjpegy494650
    Rahul_0996Rahul_0996
    Sep 22, 2014 at 9:55 pm
    Sep 22, 2014 at 9:55 pm
  • I have succesfully implemented NUTCH as crawler for SOLR index on http://szukaj.ug.edu.pl http://szukaj.ug.edu.pl site. But there is some problem with HTTP REFERER. Nutch is not sending referer ...
    SebaZSebaZ
    Jun 6, 2012 at 11:17 am
    Jun 6, 2012 at 11:24 am
  • Your IP address 79.125.13.163 has used 1.7GIgabyte of my site's bandwidth (www.zahawi.com) in the past few days taking it over quota, I believe you may have an issue. Regards Simon
    Simon Smethurst-McIntyreSimon Smethurst-McIntyre
    Apr 13, 2010 at 5:02 pm
    Apr 13, 2010 at 5:02 pm
  • Hi, I'm new to Nutch. I use the latest version (1.0) and I'm getting those errors a lot ...
    Asaf halfonAsaf halfon
    Mar 28, 2010 at 10:15 am
    Mar 28, 2010 at 10:15 am
  • Hi,everyone! I use nutch 1.0 to crawl 10M URLs,I found the fetch slow down when the urls below almost 3000 or 4000 urls. and I change the threads to different numbers(1800,1000,600),but make no ...
    陈俊龙陈俊龙
    Jan 18, 2010 at 8:12 pm
    Jan 18, 2010 at 8:12 pm
  • Hello fellow Nutch users, In a few days we'll start crawling a long list of Thai websites. With previous crawls we noticed there were A LOT of poorly formatted html pages and the crawler would ...
    Kirk GillockKirk Gillock
    Dec 6, 2009 at 9:24 pm
    Dec 6, 2009 at 9:24 pm
  • Hi fellow Nutch users. Long time crawler, first time poster. :-) We're 23m pages into a 100m page crawl and our preliminary tests have shown that a lot of pages contain our agent name, description, ...
    Kirk GillockKirk Gillock
    Dec 5, 2009 at 2:29 pm
    Dec 5, 2009 at 5:55 pm
  • Dear Sir: Firstly thank you for reading my mail. My questions are as follows: I had successfully installed and runned the nutch, I deployed the nutch project to tomat's webapps. But when I re-crawled ...
    SamttschSamttsch
    Jun 17, 2009 at 8:18 am
    Jun 17, 2009 at 8:18 am
  • hi ! I Have : Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at ...
    admin Local Serveuradmin Local Serveur
    Jun 17, 2009 at 8:18 am
    Jun 17, 2009 at 8:18 am
  • Hi all. Is there anyway to extend ParseResult in order to be able to feed it with my own Parse Object Implementation? I need to generate a summary text from each crawled HTML Document, but the ...
    Rodrigo Reyes C.Rodrigo Reyes C.
    Jun 4, 2009 at 11:11 pm
    Jun 4, 2009 at 11:11 pm
  • Hi All, I have some starting Nutch questions that I am hoping to gain insight about. I want to start at Dmoz.org and follow links for entertainment (like concerts, art gallery events, etc) and ...
    Jason Todd Slack-MoehrleJason Todd Slack-Moehrle
    Apr 19, 2009 at 11:24 pm
    Apr 19, 2009 at 11:24 pm
  • Hello! Is it possible to get indexed wordlist from nutch? Is it possible to get fetched URLS from nutch? Is it Possible to get cached pages from nutch? Thanks
    Ilia chachkhunashviliIlia chachkhunashvili
    Apr 19, 2009 at 5:43 pm
    Apr 19, 2009 at 5:43 pm
  • I'm using nutch 1.0. My subcollections.xml config file is configured like this: <?xml version="1.0" encoding="UTF-8"? <subcollections <subcollection <name sub1</name <id sub1</id <whitelist ...
    Filipe AntunesFilipe Antunes
    Apr 8, 2009 at 2:10 pm
    Apr 8, 2009 at 2:10 pm
  • I have been tasked by my boss of finding out if Nutch indexes content in an image in a pdf document via OCR and then recognize it as text. So in other words, if someone uploads a PDF document to our ...
    Robert EdmistonRobert Edmiston
    Feb 26, 2009 at 7:40 pm
    Feb 27, 2009 at 5:08 pm
  • Hi, I am planning to do a huge crawl using Nutch (billions of URLs) and so need to understand whether Nutch can handle restarts after a crash. For single system, if I do Ctrl+C while Nutch is running ...
    Hrishikesh AgasheHrishikesh Agashe
    Feb 18, 2009 at 7:15 am
    Feb 18, 2009 at 1:36 pm
  • I'm interested in writing an application that analyzes sources every time they are updated, and uses the parsedText, tags, title, etc to perform some operations and export the finished data to a ...
    John CrepezziJohn Crepezzi
    Feb 7, 2009 at 1:11 pm
    Feb 7, 2009 at 1:11 pm
  • Hi This program is probably the fastest and the easiest way to create an income stream that can gradually replace your current job or career! Get $349,859.00 For A One-Time Fee Of $29.99! No ...
    James reddenJames redden
    Sep 8, 2008 at 8:47 pm
    Sep 8, 2008 at 8:47 pm
  • Hi, I know the properties in nutch-0.9/conf/nutch-default.xml boost the weight of certain elements on a page when that page is getting ranked in the index. I need to understand all the factors in how ...
    DjimmyDjimmy
    May 6, 2008 at 5:29 pm
    May 6, 2008 at 5:29 pm
  • please can you STOP sitesell from leaching and crawling all over my site www.georgiosi.com , i am receiving false statistics and this is NOT good. just take it off my site. : (
    Georgiosi ...Georgiosi ...
    Jan 17, 2008 at 6:05 pm
    Jan 17, 2008 at 6:28 pm
  • Hello all, I was trying to figure out the best method to crawl a site without getting any of the irrelevant bits such as flash widgets, javascript, links to ad networks, and others. The objective is ...
    Viksit GaurViksit Gaur
    Jan 7, 2008 at 3:53 am
    Jan 7, 2008 at 3:53 am
  • Didn't I block this with http://jidanni.org/robots.txt ?: 124.115.4.226 - - [01/Jan/2008:02:13:46 -0800] "GET /geo/antipodes/images/tai_par_arg.png HTTP/1.1" 200 4773 "http://image.soso.com" ...
    JidanniJidanni
    Jan 2, 2008 at 12:12 pm
    Jan 2, 2008 at 6:13 pm
  • Hi Everybody, We are planning to use Nutch-0.9 Web-Crawler. It works fine with any static website that has some static content. It crawls and creates the binary DB. We have another CMS that's content ...
    Chandra shekher guptaChandra shekher gupta
    Dec 26, 2007 at 8:40 am
    Dec 26, 2007 at 8:40 am
  • Hi, I just saw that my emails to you appear on this page http://www.nabble.com/Fw:-Blocked-nutch-spider-accessing-pages-t4877480.html It was not my intent for these emails to be made available for ...
    BluebritBluebrit
    Dec 10, 2007 at 8:02 am
    Dec 12, 2007 at 12:20 am
  • Hi, I am identifying Nutch bot hits by looking for "nutch" in my user logs. How do I identify users viewing my pages as a result of the Nutch index? John Sankey http://sankey.ws/searchbots.html ...
    John SankeyJohn Sankey
    Dec 6, 2007 at 4:51 pm
    Dec 7, 2007 at 4:32 pm
  • Hi Can you please stop your robot indexing carpages.co.uk Many Thanks DiV Support the World Aids Awareness campaign this month with Yahoo! For Good http://uk.promotions.yahoo.com/forgood/
    Div divDiv div
    Dec 3, 2007 at 10:39 am
    Dec 3, 2007 at 4:28 pm
  • I sent the below original email to you without reply two weeks ago and as you can see my domain is still being crawled by your spider. Please advise me how to block it permanently from my domain or i ...
    BluebritBluebrit
    Nov 26, 2007 at 7:38 pm
    Nov 27, 2007 at 2:15 pm
  • Hello, I am writing this email to you because of the following. Blocked spider in robots.txt found in log file. User-agent: Nutch Disallow: / To date this month Nutch has appeared in site log an ...
    BluebritBluebrit
    Nov 14, 2007 at 9:08 am
    Nov 14, 2007 at 9:08 am
  • Hi, I have finished a detailed Latest step by Step Installation guide for dummies: Nutch 0.9. http://www.thechristianlife.com/z/NutchGuideForDummies.htm Please add this link to the homepage if it is ...
    Peter WangPeter Wang
    Oct 23, 2007 at 10:26 am
    Oct 23, 2007 at 10:26 am
  • Hi everyone, Anyone knows if there is an option to fetch a single or a group of wanted urls using the fetcher, but (!!) not fetching previous links extracted from urls which have already been fetched ...
    Eyal edriEyal edri
    Sep 3, 2007 at 9:14 am
    Sep 3, 2007 at 10:14 am
  • Hi, Can anyone explain what is different in fetch2 vs fetch? I've run fetch2, and i see it is restricted by the number of threads given to him (in practise, when i run it with 1000 threads, it's much ...
    Eyal edriEyal edri
    Sep 3, 2007 at 5:52 am
    Sep 3, 2007 at 5:52 am
  • Hi, I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages. i've removed the suffixes from the regex-urlfilter & automation-urlfilter(files identical): regex-urlfilter.txt ...
    Eyal edriEyal edri
    Sep 2, 2007 at 4:01 pm
    Sep 2, 2007 at 4:01 pm
  • Hello, I'm testing nutch 0.9 in the "Whole-Web" approach where i use a set of command to run the engine instead of just runing "crawl". i.e. nutch inject nutch genrate nutch fetch nutch updatedb. ...
    Eyal edriEyal edri
    Aug 30, 2007 at 7:49 am
    Aug 30, 2007 at 8:22 am
  • Hi, Was looking at my log files and I had the Nutch spider blocked many many months ago, but it doesn't appear to be obeying.. http://www.pa-roots.org/robots.txt is my robots file Here is examples ...
    Nathan ZipfelNathan Zipfel
    Aug 30, 2007 at 5:58 am
    Aug 30, 2007 at 5:58 am
  • Hello- My configuration and stats are at the end of this email. I have set up nutch to crawl 100,000 urls. The first pass (of 100,000) items went well, but problems started after this. 1. Generate ...
    MiscMisc
    Aug 29, 2007 at 1:28 am
    Aug 30, 2007 at 7:25 pm
  • Hi, I want to write a plugin which makes use of the content of the retrieved documents. So which extension point should I use ? For example if I want to expand the query by local feed back method, I ...
    Srinivasarao VundavalliSrinivasarao Vundavalli
    Aug 7, 2007 at 3:17 pm
    Aug 20, 2007 at 11:27 am
  • It seems that Nutch won't index pages in UTF-16. If I change page to UTF-8 then works correctly. Any help, please? Regards
    Blaž SmolnikarBlaž Smolnikar
    Jul 26, 2007 at 6:51 am
    Jul 26, 2007 at 6:51 am
  • Dear Nutch developers, I have had problems with a Nutch based robot during the last 12 hours, which I have now solved by banning this particular bot from my server (not Nutch completely for the ...
    Lutz ZetzscheLutz Zetzsche
    Jun 3, 2007 at 10:15 am
    Jun 3, 2007 at 10:15 am
  • Hi All: Can I make nutch to crawl and create separate indices based on scope , where scope is determined from the querystring? For example: Let's assume that I'm having URL like ...
    VikasVikas
    May 7, 2007 at 12:50 pm
    May 7, 2007 at 12:50 pm
  • Hi, all I'm trying to make Nutch support Chinese and got a funny issue: the crawler printed out following log infomation: Indexing [http://sc.yfsz.com/cat.asp?catid=120] with analyzer ...
    SongjueSongjue
    Apr 16, 2007 at 2:37 pm
    Apr 16, 2007 at 2:37 pm
  • When I run nutch as a standard user the crawl barfs at 893M (1021 disk space used in total). Java jumps up to 100% CPU time at this point. I have no quotas or limits on this user account. A ...
    James reddenJames redden
    Mar 28, 2007 at 4:10 pm
    Mar 28, 2007 at 4:10 pm
  • Hi, I am basically new to nutch.. and i have to build a local serch engine for a city.. for example, search engine for city-bangalore, where i can look for a restaurant or anything depending on ...
    Rahul gargRahul garg
    Mar 27, 2007 at 10:30 am
    Mar 28, 2007 at 9:22 am
  • It would seem that one could use http://wiki.apache.org/lucene-hadoop/AmazonEC2 to run many hours of spidering in a single hour by having a bunch of xen virtual machine instances set to do this. If ...
    D eD e
    Mar 11, 2007 at 2:50 pm
    Mar 11, 2007 at 2:50 pm
  • Hi, I've got a few questions about customizing the crawling process. I tried checking out the Wiki, but many of the pages linked from "Becoming a Nutch Developer" are still unwritten, so any pointers ...
    Ricardo J. MéndezRicardo J. Méndez
    Feb 21, 2007 at 3:12 pm
    Feb 21, 2007 at 3:12 pm
  • I wish to use Nutch so that it would crawl the urls contained into a file (let say urls/urls.txt) but would stay only within these. I have been using Nutch for a few weeks now but it bothers me to ...
    Pierre-Luc BaconPierre-Luc Bacon
    Feb 13, 2007 at 5:22 am
    Mar 28, 2009 at 9:43 pm
  • Nutch spidered one of our sites last night and when it encountered a URL that contained a space character it would ignore everything after the space which caused our application to fail with the ...
    Rick FlosiRick Flosi
    Nov 17, 2006 at 4:13 pm
    Nov 17, 2006 at 4:13 pm
  • Hi, Can you tell me how indexing takes place in lucene(Depth).if document has 1....n indices then which algorithm it uses,which information retrival model it uses... Thanks & Regards, Akil Ajani ...
    Ajani, Akil \(Cognizant\)Ajani, Akil \(Cognizant\)
    Oct 3, 2006 at 9:27 am
    Oct 3, 2006 at 10:03 am
  • Our web server has been receiving a lot of failing traffic from shopping.com and irl.cs.tamu.edu I believe your crawler is seeing "&section" and replacing it with "§ion" ...
    Fred TyreFred Tyre
    Jul 26, 2006 at 10:29 pm
    Jul 27, 2006 at 3:15 am
Group Navigation
period‹ prev | Latest | first ›
Group Overview
groupagent @
categoriesnutch, lucene
discussions106
posts167
users112
websitenutch.apache.org

Top users

Eyal edri: 6 posts Shahinul Islam: 5 posts Misc: 5 posts Fuad Efendi: 4 posts Richard Braman: 4 posts WebExpertsAmerica: 4 posts Jack Tang: 4 posts Mahesh Raman: 4 posts Adriano50: 3 posts Doug Cutting: 3 posts Dennis Kubes: 3 posts Kirk Gillock: 3 posts Bluebrit: 3 posts Fred Tyre: 2 posts Ajani, Akil \(Cognizant\): 2 posts Ricardo J. Méndez: 2 posts Ken Krugler: 2 posts Srinivasarao Vundavalli: 2 posts James redden: 2 posts Martin Kuen: 2 posts
show more