Search Discussions
-
Hi , I have been trying to run a program that takes the first 10 hits of a Nutch query and writes the parse text of the respective urls in separate files ....The code : import ...
Devj
Feb 5, 2008 at 7:35 am
Feb 7, 2008 at 5:30 pm -
Does anybody know when/how the nutch would catch up with the hadoop versions? Currently the nutch trunk uses hadoop 0.15.0, and result in a runtime no-method error when run with the .17 hadoop. We ...
Kenji Kawai
Feb 9, 2008 at 12:18 am
Feb 21, 2008 at 7:05 am -
Hi, I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467 applied) to crawl, and have tried many of the fixes that have been suggested here on the mailing list. The following is my Nutch ...
Jiaqi Tan
Feb 20, 2008 at 8:53 pm
Feb 20, 2008 at 11:19 pm -
hello, I have written a parser and indexer for dublin core metadata. is there anyone who has worked on it and can help me out where i have gone wrong. I have followed the instructions on the write ...
Syed Ahmed
Feb 26, 2008 at 7:28 pm
Feb 28, 2008 at 7:51 pm -
Hi all I've implemented a Nutch Bean into my web application (based on Appfuse). I'm pretty sure I've got all the prerequisites I need but when I try to build my application, some of my tests are now ...
Aled Rhys Jones
Feb 3, 2008 at 5:30 pm
Feb 10, 2008 at 6:02 pm -
I found a few of things in org.apache.nutch.crawl package which I want to ask. I have three questions. (1) In Injector.java, normalize() happens first and then filter() happens, where as in ...
Susam Pal
Feb 5, 2008 at 5:51 pm
Feb 6, 2008 at 7:10 pm -
5
Stats?
Hi folks... Is there a way to retrieve stats from Nutch - meaning how many webpages are indexed, to be indexed etc?? When I was working with AspSeek and Mnogosearch in the past I could run a command ...Paul Stewart
Feb 1, 2008 at 2:52 am
Feb 6, 2008 at 3:39 pm -
First of all, a question on stemming. We've tried applying the patches from the main wiki ( http://wiki.apache.org/nutch/Stemming ) and that seems to work fine for the most part. We are seeing one ...
Nick Tkach
Feb 11, 2008 at 7:02 pm
May 7, 2008 at 9:03 pm -
Hi Guys, I've updated my nutch version to use the latest trunk with the new TIKA jar. I run a crawl and i've got a lot of error like that 2008-02-14 22:02:51,494 INFO conf.Configuration - found ...
Emmanuel
Feb 14, 2008 at 2:08 pm
Feb 17, 2008 at 6:39 am -
Hi, I need to know an other feature of Nutch which is important for me. Is it possible with Nutch to change the weight of word. I explain : If a word of the query is in the URL of a document, is it ...
Jean-Christophe Alleman
Feb 27, 2008 at 3:51 pm
Feb 28, 2008 at 6:47 pm -
Hello, As per my earlier mails I could not deploy Nutch on Linux . Now am attempting the same using cygwin as per the tutorial by Peter Wang. Can someone from the list help me resolving the attached ...
Jaya Ghosh
Feb 20, 2008 at 11:22 am
Feb 21, 2008 at 4:45 pm -
Hi I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even though there should be more data to fetch. I can download a list of urls without any problem using FreeGenerator, but ...
Barry Haddow
Feb 14, 2008 at 4:31 pm
Feb 18, 2008 at 6:18 pm -
Hello - I am using latest nutch trunk on a Linux machine (single file system) - I am trying to fetch about 5-10K pages and every time I run fetch command, after fetching few hundred pages, it starts ...
DS jha
Feb 8, 2008 at 5:17 am
Feb 11, 2008 at 2:10 pm -
I have been working on improving the Generator for the last couple of days and here are the discussion areas I have come up with so far: 1) Would resolving IP addresses inside of the generator be ...
Dennis Kubes
Feb 6, 2008 at 6:59 pm
Feb 7, 2008 at 1:37 pm -
Hi Guys, I've been trying to understand the way we are getting the search results based on all parameters inputed. My objectives are the following: - sort my results by score - limit the nb of dup - ...
Emmanuel
Feb 17, 2008 at 7:05 am
Mar 17, 2008 at 1:42 pm -
When running a real world search engine, we will have a script to do the fetching all the time and re-index periodicly. I am wondering how people manage their segments/indexes data: do you let your ...
Yawl 62952928
Feb 27, 2008 at 9:53 pm
Feb 27, 2008 at 10:18 pm -
Hi All, Anyone seen this before? Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.UTF8 at ...
Euan Clark
Feb 27, 2008 at 12:43 am
Feb 27, 2008 at 2:23 am -
Hello I am trying to setup nutch in a clustered environment using the tutorial at http://wiki.apache.org/nutch/NutchHadoopTutorial I am seeing errors to verify setup on single machine. When I run ...
Developer Developer
Feb 25, 2008 at 4:33 pm
Feb 25, 2008 at 6:51 pm -
Hi, I noticed that nutch 0.8 has a proposed spell checker plugin, but will it work with nutch version 0.9? Has anyone installed the spell checker on 0.9? This is where I found the plugin ...
Ned
Feb 21, 2008 at 5:15 pm
Feb 22, 2008 at 11:15 pm -
Hello, Greetings! Using Peter Wang's tutorial I could finally perform a crawl and deploy Nutch 0.9. To test the search I entered: bin/nutch org.apache.nutch.server.NutchBean apache Got 31 hits Next I ...
Jaya Ghosh
Feb 21, 2008 at 8:53 am
Feb 21, 2008 at 10:22 am -
Hello, I've been doing some crawling to understand how the crawl filter works but I cannot figure out, I followed the tutorial inside the wiki, and even I have added the urls I want to crawl to a ...
Mario Méndez Villegas
Feb 20, 2008 at 12:44 am
Feb 20, 2008 at 10:17 pm -
Dear Nutchers! I would like to ask some newbie question (after reading docs for about a day): * How hard it would be to add support for adaptive refetching of pages depending on how often they ...
Oleg Mürk
Feb 11, 2008 at 6:32 pm
Feb 18, 2008 at 9:55 am -
I am running nutch 0.9. I have run nutch mergesegs many times before. The last couple times I have run, I get the following errors: ----- Merging 14 segments to ...
John Mendenhall
Feb 5, 2008 at 10:06 pm
Feb 13, 2008 at 7:27 pm -
Has anyone tried to apply/use the patches to the Nutch trunk from NUTCH-442? Between that code and the example from Sami's FooFactory weblog I've been able to at least get things running, but still ...
Nick Tkach
Feb 12, 2008 at 4:58 pm
Feb 12, 2008 at 8:53 pm -
Anybody know how to delete an index document in a distributed search server? Is that even possible? Dennis
Dennis Kubes
Feb 7, 2008 at 11:38 pm
Feb 8, 2008 at 3:50 am -
Hi, I would like to know the ratio between (index size)/(collection size) for collections larger than 1 TB. My objective is to have all the index in memory, so having I x GB of memory, what is the ...
Miguel Costa
Feb 29, 2008 at 8:22 pm
Feb 29, 2008 at 10:43 pm -
Hi everybody ! I have a problem wiyh boost. I found this on the wiki : http://wiki.apache.org/nutch/WritingPluginExample But I don't understand everything. 1) What's the [Source_Here] in ...
Jean-Christophe Alleman
Feb 28, 2008 at 3:26 pm
Feb 28, 2008 at 6:03 pm -
Nutch experts: Here’s the problem: 1. downloaded Nutch 0.9 from site. 2. Modified the required files to crawl on Linux. 3. http crawl successful and index was created. 4. Modified the files to run a ...
Garnier Garnier
Feb 28, 2008 at 6:41 am
Feb 28, 2008 at 11:38 am -
Hi Guys, We manage a counter to check how many time the URL has been consecutively in state Retry following some trouble to get the page. Here is a sample of the code: case ProtocolStatus.RETRY: // ...
Emmanuel
Feb 26, 2008 at 2:55 pm
Feb 26, 2008 at 4:00 pm -
Hello I am trying to setup nutch in a clustered environment using the tutorial at http://wiki.apache.org/nutch/NutchHadoopTutorial * *I am see the following error in the file * ...
Developer Developer
Feb 25, 2008 at 6:54 pm
Feb 25, 2008 at 7:49 pm -
hi, Jiaqi Tan & John Mendenhall i have encountered the same problem, i have tried correct the log4j bug and http://www.mail-archive.com/nutch-co[email protected]/msg01991.html already, and it ...
Ivannie
Feb 23, 2008 at 3:56 am
Feb 24, 2008 at 9:28 pm -
I've searched and searched the archives for any mentioned of this particular IO error. I suspect this is another newbie error, but most of those I've found in the archives and we've worked through. I ...
Fred Gilmore
Feb 22, 2008 at 10:29 pm
Feb 22, 2008 at 10:51 pm -
Okay, I think I may be missing something here. I'm trying to use the regex-urlfilter.txt and/or crawl-urlfilter.txt to make sure that only a few url roots are accepted and several are rejected. As a ...
Nick Tkach
Feb 22, 2008 at 12:21 am
Feb 22, 2008 at 1:11 am -
after running 3 hours without problems now i am getting NPE: java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.jav 48) at ...
Lindenblatt
Feb 20, 2008 at 1:55 pm
Feb 20, 2008 at 4:56 pm -
Hi All, I am using nutch 0.9. I want to crawl the webpage in a manner that it should give me the no. of links and the corresponding links in that webpage. But nutch is doing all the things like ...
Naveen Goswami
Feb 18, 2008 at 11:02 am
Feb 18, 2008 at 5:50 pm -
Hello Frens, Are there any instructions or information available on how to install Nutch on an existing Hadoop Cluster on a set of linux boxes. I look at the nutch wiki instructions ...
Developer Developer
Feb 14, 2008 at 1:20 pm
Feb 15, 2008 at 9:19 pm -
Hi, I have just started using hadoop for performing nutch crawls on a cluster of 5 servers. I am using nutch 0.9. I have gone through the initial setup as told in ...
Karthik Ramesh
Feb 10, 2008 at 10:20 am
Feb 11, 2008 at 1:53 am -
Hi Guys, I have a need to run apache (front end search page actually a portal) with nutch (backend search engine). What strategy can I employ to either have apache as my ONLY webserver (i.e. no ...
Hilkiah Lavinier
Feb 8, 2008 at 12:14 pm
Feb 8, 2008 at 4:21 pm -
Hi, Is it possible to control Nutch's indexing and scoring mechanism ?? What are the various classes that should be modified or added ?? -- View this message in context ...
Devj
Feb 7, 2008 at 4:53 pm
Feb 7, 2008 at 5:54 pm -
Hİ, I have setup nutch and hadoop succesfully. No problem at start.sh and stop.sh. I create a dir name urls with a txt file as seed. After I run the command bin/hadoop dfs -put urls urls it works .I ...
Volkan Ebil
Feb 6, 2008 at 12:23 pm
Feb 6, 2008 at 2:10 pm -
I am using the latest trunk. Whenever I search something in it and click on the cached link, I get this error from cached.jsp:- java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException ...
Mubey N.
Feb 4, 2008 at 7:12 pm
Feb 4, 2008 at 7:15 pm -
I've been running nutch-0.9 on a cluster of 3 linux machines. I've been able to crawl to crawl about 2M pages in several segments of around 200-300K pages each. The updatedb job now fails ...
Sandeep Tata
Feb 2, 2008 at 9:27 pm
Feb 2, 2008 at 10:59 pm -
Heritrix uses 1 thread per site/domain during crawling. So if I designate 25 threads for a crawl job and the seedlist has 25,000 URLs that share the same domain, only one thread will be used for the ...
Daniel Clark
Feb 1, 2008 at 4:57 pm
Feb 2, 2008 at 10:01 am -
Hi all, wondering if anybody else had been having problem with the script at: http://wiki.apache.org/nutch/MergeCrawl with nutch-0.9? I am doing the simple crawl like this: bin/nutch url1 -dir crawl1 ...
Boris Lau
Feb 29, 2008 at 7:10 pm
Feb 29, 2008 at 7:10 pm -
Hi all, I am having problem with using parse-xml plugin with nutch 0.9 with a 5-node hadoop to process some XMl documents. It is causing a huge slow down at the crawl-reduce stage (to the point that ...
Boris Lau
Feb 28, 2008 at 9:09 pm
Feb 28, 2008 at 9:09 pm -
Two things: 1- Today, every time we parse a page, we generate many Outlinks. Those Outlinks can be either related links to the same website or links to external website (different hostname). Those ...
Emmanuel
Feb 28, 2008 at 2:08 pm
Feb 28, 2008 at 2:08 pm -
please i want to post my question becuase somebody ruined it by asking some other irrelevant question in reply to my question. the plugin compiles.but it doesnt index the dc meta fields for some ...
Syed Ahmed
Feb 27, 2008 at 12:23 pm
Feb 27, 2008 at 12:23 pm -
hello, I have written a parser and indexer for dublin core metadata. is there anyone who has worked on it and can help me out where i have gone wrong. I have followed the instructions on the write ...
Syed Ahmed
Feb 27, 2008 at 11:55 am
Feb 27, 2008 at 11:55 am -
Hi nutchers! I am attempting to run the NutchBean.java (0.9 release) using the plugin described in: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 In this example,the meta-tag indexed and ...
Nutchvf
Feb 21, 2008 at 11:39 am
Feb 21, 2008 at 11:39 am -
I am writing a plugin and trying to use a class in the plugin jar file.. and I got the following error.. I searched around and found that there are some problems about classloading. But I don't ...
Guanyu
Feb 20, 2008 at 2:47 am
Feb 20, 2008 at 2:47 am
Group Overview
group | user |
categories | nutch, lucene |
discussions | 64 |
posts | 196 |
users | 63 |
website | nutch.apache.org |
63 users for February 2008
Archives
- June 2016 (45)
- May 2016 (83)
- April 2016 (77)
- March 2016 (87)
- February 2016 (137)
- January 2016 (106)
- December 2015 (79)
- November 2015 (84)
- October 2015 (83)
- September 2015 (90)
- August 2015 (27)
- July 2015 (68)
- June 2015 (72)
- May 2015 (93)
- April 2015 (127)
- March 2015 (137)
- February 2015 (158)
- January 2015 (91)
- December 2014 (87)
- November 2014 (72)
- October 2014 (67)
- September 2014 (162)
- August 2014 (106)
- July 2014 (136)
- June 2014 (120)
- May 2014 (174)
- April 2014 (123)
- March 2014 (221)
- February 2014 (148)
- January 2014 (107)
- December 2013 (193)
- November 2013 (163)
- October 2013 (182)
- September 2013 (78)
- August 2013 (238)
- July 2013 (353)
- June 2013 (462)
- May 2013 (203)
- April 2013 (199)
- March 2013 (305)
- February 2013 (318)
- January 2013 (271)
- December 2012 (172)
- November 2012 (300)
- October 2012 (305)
- September 2012 (206)
- August 2012 (386)
- July 2012 (324)
- June 2012 (309)
- May 2012 (348)
- April 2012 (208)
- March 2012 (235)
- February 2012 (351)
- January 2012 (321)
- December 2011 (324)
- November 2011 (325)
- October 2011 (295)
- September 2011 (316)
- August 2011 (316)
- July 2011 (620)
- June 2011 (302)
- May 2011 (161)
- April 2011 (183)
- March 2011 (224)
- February 2011 (249)
- January 2011 (240)
- December 2010 (183)
- November 2010 (271)
- October 2010 (245)
- September 2010 (280)
- August 2010 (235)
- July 2010 (211)
- June 2010 (151)
- May 2010 (175)
- April 2010 (194)
- March 2010 (148)
- February 2010 (136)
- January 2010 (193)
- December 2009 (259)
- November 2009 (308)
- October 2009 (259)
- September 2009 (184)
- August 2009 (199)
- July 2009 (312)
- June 2009 (196)
- May 2009 (164)
- April 2009 (247)
- March 2009 (408)
- February 2009 (215)
- January 2009 (205)
- December 2008 (249)
- November 2008 (194)
- October 2008 (171)
- September 2008 (270)
- August 2008 (165)
- July 2008 (124)
- June 2008 (243)
- May 2008 (220)
- April 2008 (296)
- March 2008 (212)
- February 2008 (196)
- January 2008 (285)
- December 2007 (147)
- November 2007 (234)
- October 2007 (270)
- September 2007 (278)
- August 2007 (302)
- July 2007 (342)
- June 2007 (393)
- May 2007 (247)
- April 2007 (305)
- March 2007 (285)
- February 2007 (188)
- January 2007 (371)
- December 2006 (224)
- November 2006 (161)
- October 2006 (254)
- September 2006 (413)
- August 2006 (451)
- July 2006 (316)
- June 2006 (380)
- May 2006 (233)
- April 2006 (459)
- March 2006 (664)
- February 2006 (581)
- January 2006 (587)
- December 2005 (434)
- November 2005 (395)
- October 2005 (304)
- September 2005 (409)
- August 2005 (275)
- July 2005 (346)
- June 2005 (232)
- May 2005 (173)
- April 2005 (250)
- March 2005 (173)