FAQ

Search Discussions

96 discussions - 157 posts

  • Hi fellow Nutch users. Long time crawler, first time poster. :-) We're 23m pages into a 100m page crawl and our preliminary tests have shown that a lot of pages contain our agent name, description, ...
    Kirk GillockKirk Gillock
    Dec 5, 2009 at 2:29 pm
    Dec 5, 2009 at 5:55 pm
  • Dear Sir: Firstly thank you for reading my mail. My questions are as follows: I had successfully installed and runned the nutch, I deployed the nutch project to tomat's webapps. But when I re-crawled ...
    SamttschSamttsch
    Jun 17, 2009 at 8:18 am
    Jun 17, 2009 at 8:18 am
  • hi ! I Have : Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at ...
    admin Local Serveuradmin Local Serveur
    Jun 17, 2009 at 8:18 am
    Jun 17, 2009 at 8:18 am
  • Hi all. Is there anyway to extend ParseResult in order to be able to feed it with my own Parse Object Implementation? I need to generate a summary text from each crawled HTML Document, but the ...
    Rodrigo Reyes C.Rodrigo Reyes C.
    Jun 4, 2009 at 11:11 pm
    Jun 4, 2009 at 11:11 pm
  • Hi All, I have some starting Nutch questions that I am hoping to gain insight about. I want to start at Dmoz.org and follow links for entertainment (like concerts, art gallery events, etc) and ...
    Jason Todd Slack-MoehrleJason Todd Slack-Moehrle
    Apr 19, 2009 at 11:24 pm
    Apr 19, 2009 at 11:24 pm
  • Hello! Is it possible to get indexed wordlist from nutch? Is it possible to get fetched URLS from nutch? Is it Possible to get cached pages from nutch? Thanks
    Ilia chachkhunashviliIlia chachkhunashvili
    Apr 19, 2009 at 5:43 pm
    Apr 19, 2009 at 5:43 pm
  • I'm using nutch 1.0. My subcollections.xml config file is configured like this: <?xml version="1.0" encoding="UTF-8"? <subcollections <subcollection <name sub1</name <id sub1</id <whitelist ...
    Filipe AntunesFilipe Antunes
    Apr 8, 2009 at 2:10 pm
    Apr 8, 2009 at 2:10 pm
  • I have been tasked by my boss of finding out if Nutch indexes content in an image in a pdf document via OCR and then recognize it as text. So in other words, if someone uploads a PDF document to our ...
    Robert EdmistonRobert Edmiston
    Feb 26, 2009 at 7:40 pm
    Feb 27, 2009 at 5:08 pm
  • Hi, I am planning to do a huge crawl using Nutch (billions of URLs) and so need to understand whether Nutch can handle restarts after a crash. For single system, if I do Ctrl+C while Nutch is running ...
    Hrishikesh AgasheHrishikesh Agashe
    Feb 18, 2009 at 7:15 am
    Feb 18, 2009 at 1:36 pm
  • I'm interested in writing an application that analyzes sources every time they are updated, and uses the parsedText, tags, title, etc to perform some operations and export the finished data to a ...
    John CrepezziJohn Crepezzi
    Feb 7, 2009 at 1:11 pm
    Feb 7, 2009 at 1:11 pm
  • Hi This program is probably the fastest and the easiest way to create an income stream that can gradually replace your current job or career! Get $349,859.00 For A One-Time Fee Of $29.99! No ...
    James reddenJames redden
    Sep 8, 2008 at 8:47 pm
    Sep 8, 2008 at 8:47 pm
  • Hi, I know the properties in nutch-0.9/conf/nutch-default.xml boost the weight of certain elements on a page when that page is getting ranked in the index. I need to understand all the factors in how ...
    DjimmyDjimmy
    May 6, 2008 at 5:29 pm
    May 6, 2008 at 5:29 pm
  • please can you STOP sitesell from leaching and crawling all over my site www.georgiosi.com , i am receiving false statistics and this is NOT good. just take it off my site. : (
    Georgiosi ...Georgiosi ...
    Jan 17, 2008 at 6:05 pm
    Jan 17, 2008 at 6:28 pm
  • Hello all, I was trying to figure out the best method to crawl a site without getting any of the irrelevant bits such as flash widgets, javascript, links to ad networks, and others. The objective is ...
    Viksit GaurViksit Gaur
    Jan 7, 2008 at 3:53 am
    Jan 7, 2008 at 3:53 am
  • Didn't I block this with http://jidanni.org/robots.txt ?: 124.115.4.226 - - [01/Jan/2008:02:13:46 -0800] "GET /geo/antipodes/images/tai_par_arg.png HTTP/1.1" 200 4773 "http://image.soso.com" ...
    JidanniJidanni
    Jan 2, 2008 at 12:12 pm
    Jan 2, 2008 at 6:13 pm
  • Hi Everybody, We are planning to use Nutch-0.9 Web-Crawler. It works fine with any static website that has some static content. It crawls and creates the binary DB. We have another CMS that's content ...
    Chandra shekher guptaChandra shekher gupta
    Dec 26, 2007 at 8:40 am
    Dec 26, 2007 at 8:40 am
  • Hi, I just saw that my emails to you appear on this page http://www.nabble.com/Fw:-Blocked-nutch-spider-accessing-pages-t4877480.html It was not my intent for these emails to be made available for ...
    BluebritBluebrit
    Dec 10, 2007 at 8:02 am
    Dec 12, 2007 at 12:20 am
  • Hi, I am identifying Nutch bot hits by looking for "nutch" in my user logs. How do I identify users viewing my pages as a result of the Nutch index? John Sankey http://sankey.ws/searchbots.html ...
    John SankeyJohn Sankey
    Dec 6, 2007 at 4:51 pm
    Dec 7, 2007 at 4:32 pm
  • Hi Can you please stop your robot indexing carpages.co.uk Many Thanks DiV Support the World Aids Awareness campaign this month with Yahoo! For Good http://uk.promotions.yahoo.com/forgood/
    Div divDiv div
    Dec 3, 2007 at 10:39 am
    Dec 3, 2007 at 4:28 pm
  • Hello, I am writing this email to you because of the following. Blocked spider in robots.txt found in log file. User-agent: Nutch Disallow: / To date this month Nutch has appeared in site log an ...
    BluebritBluebrit
    Nov 14, 2007 at 9:08 am
    Nov 27, 2007 at 2:15 pm
  • Hi, I have finished a detailed Latest step by Step Installation guide for dummies: Nutch 0.9. http://www.thechristianlife.com/z/NutchGuideForDummies.htm Please add this link to the homepage if it is ...
    Peter WangPeter Wang
    Oct 23, 2007 at 10:26 am
    Oct 23, 2007 at 10:26 am
  • Hi everyone, Anyone knows if there is an option to fetch a single or a group of wanted urls using the fetcher, but (!!) not fetching previous links extracted from urls which have already been fetched ...
    Eyal edriEyal edri
    Sep 3, 2007 at 9:14 am
    Sep 3, 2007 at 10:14 am
  • Hi, Can anyone explain what is different in fetch2 vs fetch? I've run fetch2, and i see it is restricted by the number of threads given to him (in practise, when i run it with 1000 threads, it's much ...
    Eyal edriEyal edri
    Sep 3, 2007 at 5:52 am
    Sep 3, 2007 at 5:52 am
  • Hi, I am trying to use the nutch fetcher for d/l EXE/ZIP files from web pages. i've removed the suffixes from the regex-urlfilter & automation-urlfilter(files identical): regex-urlfilter.txt: ...
    Eyal edriEyal edri
    Sep 2, 2007 at 4:01 pm
    Sep 2, 2007 at 4:01 pm
  • Hello, I'm testing nutch 0.9 in the "Whole-Web" approach where i use a set of command to run the engine instead of just runing "crawl". i.e. nutch inject nutch genrate nutch fetch nutch updatedb.. ...
    Eyal edriEyal edri
    Aug 30, 2007 at 7:49 am
    Aug 30, 2007 at 8:22 am
  • Hi, Was looking at my log files and I had the Nutch spider blocked many many months ago, but it doesn't appear to be obeying.. http://www.pa-roots.org/robots.txt is my robots file Here is examples ...
    Nathan ZipfelNathan Zipfel
    Aug 30, 2007 at 5:58 am
    Aug 30, 2007 at 5:58 am
  • Hello- My configuration and stats are at the end of this email. I have set up nutch to crawl 100,000 urls. The first pass (of 100,000) items went well, but problems started after this. 1. Generate ...
    MiscMisc
    Aug 29, 2007 at 1:28 am
    Aug 30, 2007 at 7:25 pm
  • Hi, I want to write a plugin which makes use of the content of the retrieved documents. So which extension point should I use ? For example if I want to expand the query by local feed back method, I ...
    Srinivasarao VundavalliSrinivasarao Vundavalli
    Aug 7, 2007 at 3:17 pm
    Aug 20, 2007 at 11:27 am
  • It seems that Nutch won't index pages in UTF-16. If I change page to UTF-8 then works correctly. Any help, please? Regards
    Blaž SmolnikarBlaž Smolnikar
    Jul 26, 2007 at 6:51 am
    Jul 26, 2007 at 6:51 am
  • Dear Nutch developers, I have had problems with a Nutch based robot during the last 12 hours, which I have now solved by banning this particular bot from my server (not Nutch completely for the ...
    Lutz ZetzscheLutz Zetzsche
    Jun 3, 2007 at 10:15 am
    Jun 3, 2007 at 10:15 am
  • Hi All: Can I make nutch to crawl and create separate indices based on scope , where scope is determined from the querystring? For example: Let's assume that I'm having URL like: ...
    VikasVikas
    May 7, 2007 at 12:50 pm
    May 7, 2007 at 12:50 pm
  • Hi, all I'm trying to make Nutch support Chinese and got a funny issue: the crawler printed out following log infomation: Indexing [http://sc.yfsz.com/cat.asp?catid=120] with analyzer ...
    SongjueSongjue
    Apr 16, 2007 at 2:37 pm
    Apr 16, 2007 at 2:37 pm
  • When I run nutch as a standard user the crawl barfs at 893M (1021 disk space used in total). Java jumps up to 100% CPU time at this point. I have no quotas or limits on this user account. A ...
    James reddenJames redden
    Mar 28, 2007 at 4:10 pm
    Mar 28, 2007 at 4:10 pm
  • Hi, I am basically new to nutch.. and i have to build a local serch engine for a city.. for example, search engine for city-bangalore, where i can look for a restaurant or anything depending on ...
    Rahul gargRahul garg
    Mar 27, 2007 at 10:30 am
    Mar 28, 2007 at 9:22 am
  • It would seem that one could use http://wiki.apache.org/lucene-hadoop/AmazonEC2 to run many hours of spidering in a single hour by having a bunch of xen virtual machine instances set to do this. If ...
    D eD e
    Mar 11, 2007 at 2:50 pm
    Mar 11, 2007 at 2:50 pm
  • Hi, I've got a few questions about customizing the crawling process. I tried checking out the Wiki, but many of the pages linked from "Becoming a Nutch Developer" are still unwritten, so any pointers ...
    Ricardo J. MéndezRicardo J. Méndez
    Feb 21, 2007 at 3:12 pm
    Feb 21, 2007 at 3:12 pm
  • I wish to use Nutch so that it would crawl the urls contained into a file (let say urls/urls.txt) but would stay only within these. I have been using Nutch for a few weeks now but it bothers me to ...
    Pierre-Luc BaconPierre-Luc Bacon
    Feb 13, 2007 at 5:22 am
    Mar 28, 2009 at 9:43 pm
  • Nutch spidered one of our sites last night and when it encountered a URL that contained a space character it would ignore everything after the space which caused our application to fail with the ...
    Rick FlosiRick Flosi
    Nov 17, 2006 at 4:13 pm
    Nov 17, 2006 at 4:13 pm
  • Hi, Can you tell me how indexing takes place in lucene(Depth).if document has 1....n indices then which algorithm it uses,which information retrival model it uses... Thanks & Regards, Akil Ajani ...
    Ajani, Akil \(Cognizant\)Ajani, Akil \(Cognizant\)
    Oct 3, 2006 at 9:27 am
    Oct 3, 2006 at 10:03 am
  • Our web server has been receiving a lot of failing traffic from shopping.com and irl.cs.tamu.edu I believe your crawler is seeing "&section" and replacing it with "§ion" ...
    Fred TyreFred Tyre
    Jul 26, 2006 at 10:29 pm
    Jul 27, 2006 at 3:15 am
  • Hi there, I'm wondering if anyone can help. We injected 1000 seed URLs into Nutch 0.7.2 (basic configuration + 1000 URLs in regexp filter) and it processed them in just few hours. We just switched to ...
    Vasja OcvirkVasja Ocvirk
    Jul 26, 2006 at 2:19 pm
    Jul 26, 2006 at 2:19 pm
  • Hello, I am trying use Nutch to look up at specify URLs like this are: http://server.domain/appname/get?id=34&view=content http://server.domain/appname/get?id=35&view=content ...
    SKUHRA, MilanSKUHRA, Milan
    Jul 20, 2006 at 11:11 am
    Jul 23, 2006 at 10:43 am
  • Hello, your Nutch Crawler has a bug. It tries to read new links from Javascript parts of websites - unfortunately the things its trying to detect are none. An example would be ...
    NighthawkNighthawk
    Jun 21, 2006 at 6:20 pm
    Jun 21, 2006 at 6:20 pm
  • Netopia operates a web hosting service for small web sites. I am seeking a site search function that will enable searches of the small sites from which the search is initiated. Perhaps the "small ...
    Evan SolleyEvan Solley
    Jun 19, 2006 at 8:57 pm
    Jun 19, 2006 at 8:57 pm
  • Dear Nutch Project Gurus, I'm the webmaster of http://swisspig.net/, and I have noticed periodic access by the Nutch crawler at U Washington. However, today's access was strange, in that it attempted ...
    Brian ZimanBrian Ziman
    Jun 13, 2006 at 10:44 pm
    Jun 13, 2006 at 10:44 pm
  • Hi, This email is to inform you that your crawler has apparantely misbehaved in our site www.carmelwebdesigns.com on May 30, 8:00 to 8:45 pm. It has filled our contact form and sent more that 100 ...
    InfoInfo
    May 31, 2006 at 5:14 am
    May 31, 2006 at 5:14 am
  • Nutch filled out a form for some reason at allpar.com - not a big deal if it's for a good cause. REMOTE_ADDR: 128.208.6.227 HTTP_USER_AGENT: NutchCVS/0.8-dev (Nutch running at UW; ...
    DaveDave
    May 30, 2006 at 8:57 pm
    May 30, 2006 at 8:57 pm
  • Hi, it appears that nutch doesn't obey the "Crawl-Delay:" robots.txt statement. Out robots.txt defines a crawl-delay of 30, and most robots seem to obey it, unlike this nuch from tonight: 209.235.6.4 ...
    Rainer M. CanavanRainer M. Canavan
    May 30, 2006 at 8:57 pm
    May 31, 2006 at 12:10 am
  • Maybe it would be smart if your bloody Nutch bot doesn't submit forms? / /
    Jop Brocker - Yes2webJop Brocker - Yes2web
    May 24, 2006 at 7:16 pm
    May 24, 2006 at 7:16 pm
  • Your bot seems to submit contact forms, so I get blank emails daily. It would be ideal if your bot only followed links, and didn't follow form actions.
    John MasoneJohn Masone
    May 22, 2006 at 2:45 am
    May 22, 2006 at 6:31 am
Group Navigation
period‹ prev | Latest | first ›
Group Overview
groupnutch-agent @
categorieslucene
discussions96
posts157
users103
websitenutch.apache.org

Top users

Eyal edri: 6 posts Shahinul Islam: 5 posts Misc: 5 posts Richard Braman: 4 posts Fuad Efendi: 4 posts Jack Tang: 4 posts Mahesh Raman: 4 posts WebExpertsAmerica: 4 posts Bluebrit: 3 posts Adriano50: 3 posts Doug Cutting: 3 posts Dennis Kubes: 3 posts Martin Kuen: 2 posts Daniele Menozzi: 2 posts Srinivasarao Vundavalli: 2 posts Fred Tyre: 2 posts Richard Z. Ward: 2 posts Matthias Jaekle: 2 posts Gal Nitzan: 2 posts Ajani, Akil \(Cognizant\): 2 posts
show more