Search Discussions

64 discussions - 196 posts

  • Hi , I have been trying to run a program that takes the first 10 hits of a Nutch query and writes the parse text of the respective urls in separate files ....The code : import ...
    Feb 5, 2008 at 7:35 am
    Feb 7, 2008 at 5:30 pm
  • Does anybody know when/how the nutch would catch up with the hadoop versions? Currently the nutch trunk uses hadoop 0.15.0, and result in a runtime no-method error when run with the .17 hadoop. We ...
    Kenji KawaiKenji Kawai
    Feb 9, 2008 at 12:18 am
    Feb 21, 2008 at 7:05 am
  • Hi, I'm having trouble getting Nutch-0.9 (recompiled with NUTCH-467 applied) to crawl, and have tried many of the fixes that have been suggested here on the mailing list. The following is my Nutch ...
    Jiaqi TanJiaqi Tan
    Feb 20, 2008 at 8:53 pm
    Feb 20, 2008 at 11:19 pm
  • hello, I have written a parser and indexer for dublin core metadata. is there anyone who has worked on it and can help me out where i have gone wrong. I have followed the instructions on the write ...
    Syed AhmedSyed Ahmed
    Feb 26, 2008 at 7:28 pm
    Feb 28, 2008 at 7:51 pm
  • Hi all I've implemented a Nutch Bean into my web application (based on Appfuse). I'm pretty sure I've got all the prerequisites I need but when I try to build my application, some of my tests are now ...
    Aled Rhys JonesAled Rhys Jones
    Feb 3, 2008 at 5:30 pm
    Feb 10, 2008 at 6:02 pm
  • I found a few of things in org.apache.nutch.crawl package which I want to ask. I have three questions. (1) In Injector.java, normalize() happens first and then filter() happens, where as in ...
    Susam PalSusam Pal
    Feb 5, 2008 at 5:51 pm
    Feb 6, 2008 at 7:10 pm
  • Hi folks... Is there a way to retrieve stats from Nutch - meaning how many webpages are indexed, to be indexed etc?? When I was working with AspSeek and Mnogosearch in the past I could run a command ...
    Paul StewartPaul Stewart
    Feb 1, 2008 at 2:52 am
    Feb 6, 2008 at 3:39 pm
  • First of all, a question on stemming. We've tried applying the patches from the main wiki ( http://wiki.apache.org/nutch/Stemming ) and that seems to work fine for the most part. We are seeing one ...
    Nick TkachNick Tkach
    Feb 11, 2008 at 7:02 pm
    May 7, 2008 at 9:03 pm
  • Hi Guys, I've updated my nutch version to use the latest trunk with the new TIKA jar. I run a crawl and i've got a lot of error like that 2008-02-14 22:02:51,494 INFO conf.Configuration - found ...
    Feb 14, 2008 at 2:08 pm
    Feb 17, 2008 at 6:39 am
  • Hi, I need to know an other feature of Nutch which is important for me. Is it possible with Nutch to change the weight of word. I explain : If a word of the query is in the URL of a document, is it ...
    Jean-Christophe AllemanJean-Christophe Alleman
    Feb 27, 2008 at 3:51 pm
    Feb 28, 2008 at 6:47 pm
  • Hello, As per my earlier mails I could not deploy Nutch on Linux . Now am attempting the same using cygwin as per the tutorial by Peter Wang. Can someone from the list help me resolving the attached ...
    Jaya GhoshJaya Ghosh
    Feb 20, 2008 at 11:22 am
    Feb 21, 2008 at 4:45 pm
  • Hi I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even though there should be more data to fetch. I can download a list of urls without any problem using FreeGenerator, but ...
    Barry HaddowBarry Haddow
    Feb 14, 2008 at 4:31 pm
    Feb 18, 2008 at 6:18 pm
  • Hello - I am using latest nutch trunk on a Linux machine (single file system) - I am trying to fetch about 5-10K pages and every time I run fetch command, after fetching few hundred pages, it starts ...
    DS jhaDS jha
    Feb 8, 2008 at 5:17 am
    Feb 11, 2008 at 2:10 pm
  • I have been working on improving the Generator for the last couple of days and here are the discussion areas I have come up with so far: 1) Would resolving IP addresses inside of the generator be ...
    Dennis KubesDennis Kubes
    Feb 6, 2008 at 6:59 pm
    Feb 7, 2008 at 1:37 pm
  • Hi Guys, I've been trying to understand the way we are getting the search results based on all parameters inputed. My objectives are the following: - sort my results by score - limit the nb of dup - ...
    Feb 17, 2008 at 7:05 am
    Mar 17, 2008 at 1:42 pm
  • When running a real world search engine, we will have a script to do the fetching all the time and re-index periodicly. I am wondering how people manage their segments/indexes data: do you let your ...
    Yawl 62952928Yawl 62952928
    Feb 27, 2008 at 9:53 pm
    Feb 27, 2008 at 10:18 pm
  • Hi All, Anyone seen this before? Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.UTF8 at ...
    Euan ClarkEuan Clark
    Feb 27, 2008 at 12:43 am
    Feb 27, 2008 at 2:23 am
  • Hello I am trying to setup nutch in a clustered environment using the tutorial at http://wiki.apache.org/nutch/NutchHadoopTutorial I am seeing errors to verify setup on single machine. When I run ...
    Developer DeveloperDeveloper Developer
    Feb 25, 2008 at 4:33 pm
    Feb 25, 2008 at 6:51 pm
  • Hi, I noticed that nutch 0.8 has a proposed spell checker plugin, but will it work with nutch version 0.9? Has anyone installed the spell checker on 0.9? This is where I found the plugin ...
    Feb 21, 2008 at 5:15 pm
    Feb 22, 2008 at 11:15 pm
  • Hello, Greetings! Using Peter Wang's tutorial I could finally perform a crawl and deploy Nutch 0.9. To test the search I entered: bin/nutch org.apache.nutch.server.NutchBean apache Got 31 hits Next I ...
    Jaya GhoshJaya Ghosh
    Feb 21, 2008 at 8:53 am
    Feb 21, 2008 at 10:22 am
  • Hello, I've been doing some crawling to understand how the crawl filter works but I cannot figure out, I followed the tutorial inside the wiki, and even I have added the urls I want to crawl to a ...
    Mario Méndez VillegasMario Méndez Villegas
    Feb 20, 2008 at 12:44 am
    Feb 20, 2008 at 10:17 pm
  • Dear Nutchers! I would like to ask some newbie question (after reading docs for about a day): * How hard it would be to add support for adaptive refetching of pages depending on how often they ...
    Oleg MürkOleg Mürk
    Feb 11, 2008 at 6:32 pm
    Feb 18, 2008 at 9:55 am
  • I am running nutch 0.9. I have run nutch mergesegs many times before. The last couple times I have run, I get the following errors: ----- Merging 14 segments to ...
    John MendenhallJohn Mendenhall
    Feb 5, 2008 at 10:06 pm
    Feb 13, 2008 at 7:27 pm
  • Has anyone tried to apply/use the patches to the Nutch trunk from NUTCH-442? Between that code and the example from Sami's FooFactory weblog I've been able to at least get things running, but still ...
    Nick TkachNick Tkach
    Feb 12, 2008 at 4:58 pm
    Feb 12, 2008 at 8:53 pm
  • Anybody know how to delete an index document in a distributed search server? Is that even possible? Dennis
    Dennis KubesDennis Kubes
    Feb 7, 2008 at 11:38 pm
    Feb 8, 2008 at 3:50 am
  • Hi, I would like to know the ratio between (index size)/(collection size) for collections larger than 1 TB. My objective is to have all the index in memory, so having I x GB of memory, what is the ...
    Miguel CostaMiguel Costa
    Feb 29, 2008 at 8:22 pm
    Feb 29, 2008 at 10:43 pm
  • Hi everybody ! I have a problem wiyh boost. I found this on the wiki : http://wiki.apache.org/nutch/WritingPluginExample But I don't understand everything. 1) What's the [Source_Here] in ...
    Jean-Christophe AllemanJean-Christophe Alleman
    Feb 28, 2008 at 3:26 pm
    Feb 28, 2008 at 6:03 pm
  • Nutch experts: Here’s the problem: 1. downloaded Nutch 0.9 from site. 2. Modified the required files to crawl on Linux. 3. http crawl successful and index was created. 4. Modified the files to run a ...
    Garnier GarnierGarnier Garnier
    Feb 28, 2008 at 6:41 am
    Feb 28, 2008 at 11:38 am
  • Hi Guys, We manage a counter to check how many time the URL has been consecutively in state Retry following some trouble to get the page. Here is a sample of the code: case ProtocolStatus.RETRY: // ...
    Feb 26, 2008 at 2:55 pm
    Feb 26, 2008 at 4:00 pm
  • Hello I am trying to setup nutch in a clustered environment using the tutorial at http://wiki.apache.org/nutch/NutchHadoopTutorial * *I am see the following error in the file * ...
    Developer DeveloperDeveloper Developer
    Feb 25, 2008 at 6:54 pm
    Feb 25, 2008 at 7:49 pm
  • hi, Jiaqi Tan & John Mendenhall i have encountered the same problem, i have tried correct the log4j bug and http://www.mail-archive.com/nutch-co[email protected]/msg01991.html already, and it ...
    Feb 23, 2008 at 3:56 am
    Feb 24, 2008 at 9:28 pm
  • I've searched and searched the archives for any mentioned of this particular IO error. I suspect this is another newbie error, but most of those I've found in the archives and we've worked through. I ...
    Fred GilmoreFred Gilmore
    Feb 22, 2008 at 10:29 pm
    Feb 22, 2008 at 10:51 pm
  • Okay, I think I may be missing something here. I'm trying to use the regex-urlfilter.txt and/or crawl-urlfilter.txt to make sure that only a few url roots are accepted and several are rejected. As a ...
    Nick TkachNick Tkach
    Feb 22, 2008 at 12:21 am
    Feb 22, 2008 at 1:11 am
  • after running 3 hours without problems now i am getting NPE: java.lang.NullPointerException at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.jav 48) at ...
    Feb 20, 2008 at 1:55 pm
    Feb 20, 2008 at 4:56 pm
  • Hi All, I am using nutch 0.9. I want to crawl the webpage in a manner that it should give me the no. of links and the corresponding links in that webpage. But nutch is doing all the things like ...
    Naveen GoswamiNaveen Goswami
    Feb 18, 2008 at 11:02 am
    Feb 18, 2008 at 5:50 pm
  • Hello Frens, Are there any instructions or information available on how to install Nutch on an existing Hadoop Cluster on a set of linux boxes. I look at the nutch wiki instructions ...
    Developer DeveloperDeveloper Developer
    Feb 14, 2008 at 1:20 pm
    Feb 15, 2008 at 9:19 pm
  • Hi, I have just started using hadoop for performing nutch crawls on a cluster of 5 servers. I am using nutch 0.9. I have gone through the initial setup as told in ...
    Karthik RameshKarthik Ramesh
    Feb 10, 2008 at 10:20 am
    Feb 11, 2008 at 1:53 am
  • Hi Guys, I have a need to run apache (front end search page actually a portal) with nutch (backend search engine). What strategy can I employ to either have apache as my ONLY webserver (i.e. no ...
    Hilkiah LavinierHilkiah Lavinier
    Feb 8, 2008 at 12:14 pm
    Feb 8, 2008 at 4:21 pm
  • Hi, Is it possible to control Nutch's indexing and scoring mechanism ?? What are the various classes that should be modified or added ?? -- View this message in context ...
    Feb 7, 2008 at 4:53 pm
    Feb 7, 2008 at 5:54 pm
  • Hİ, I have setup nutch and hadoop succesfully. No problem at start.sh and stop.sh. I create a dir name urls with a txt file as seed. After I run the command bin/hadoop dfs -put urls urls it works .I ...
    Volkan EbilVolkan Ebil
    Feb 6, 2008 at 12:23 pm
    Feb 6, 2008 at 2:10 pm
  • I am using the latest trunk. Whenever I search something in it and click on the cached link, I get this error from cached.jsp:- java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException ...
    Mubey N.Mubey N.
    Feb 4, 2008 at 7:12 pm
    Feb 4, 2008 at 7:15 pm
  • I've been running nutch-0.9 on a cluster of 3 linux machines. I've been able to crawl to crawl about 2M pages in several segments of around 200-300K pages each. The updatedb job now fails ...
    Sandeep TataSandeep Tata
    Feb 2, 2008 at 9:27 pm
    Feb 2, 2008 at 10:59 pm
  • Heritrix uses 1 thread per site/domain during crawling. So if I designate 25 threads for a crawl job and the seedlist has 25,000 URLs that share the same domain, only one thread will be used for the ...
    Daniel ClarkDaniel Clark
    Feb 1, 2008 at 4:57 pm
    Feb 2, 2008 at 10:01 am
  • Hi all, wondering if anybody else had been having problem with the script at: http://wiki.apache.org/nutch/MergeCrawl with nutch-0.9? I am doing the simple crawl like this: bin/nutch url1 -dir crawl1 ...
    Boris LauBoris Lau
    Feb 29, 2008 at 7:10 pm
    Feb 29, 2008 at 7:10 pm
  • Hi all, I am having problem with using parse-xml plugin with nutch 0.9 with a 5-node hadoop to process some XMl documents. It is causing a huge slow down at the crawl-reduce stage (to the point that ...
    Boris LauBoris Lau
    Feb 28, 2008 at 9:09 pm
    Feb 28, 2008 at 9:09 pm
  • Two things: 1- Today, every time we parse a page, we generate many Outlinks. Those Outlinks can be either related links to the same website or links to external website (different hostname). Those ...
    Feb 28, 2008 at 2:08 pm
    Feb 28, 2008 at 2:08 pm
  • please i want to post my question becuase somebody ruined it by asking some other irrelevant question in reply to my question. the plugin compiles.but it doesnt index the dc meta fields for some ...
    Syed AhmedSyed Ahmed
    Feb 27, 2008 at 12:23 pm
    Feb 27, 2008 at 12:23 pm
  • hello, I have written a parser and indexer for dublin core metadata. is there anyone who has worked on it and can help me out where i have gone wrong. I have followed the instructions on the write ...
    Syed AhmedSyed Ahmed
    Feb 27, 2008 at 11:55 am
    Feb 27, 2008 at 11:55 am
  • Hi nutchers! I am attempting to run the NutchBean.java (0.9 release) using the plugin described in: http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 In this example,the meta-tag indexed and ...
    Feb 21, 2008 at 11:39 am
    Feb 21, 2008 at 11:39 am
  • I am writing a plugin and trying to use a class in the plugin jar file.. and I got the following error.. I searched around and found that there are some problems about classloading. But I don't ...
    Feb 20, 2008 at 2:47 am
    Feb 20, 2008 at 2:47 am
Group Navigation
period‹ prev | Feb 2008 | next ›
Group Overview
groupuser @
categoriesnutch, lucene

63 users for February 2008

Dennis Kubes: 30 posts John Mendenhall: 10 posts Payo: 10 posts Susam Pal: 10 posts Andrzej Bialecki: 8 posts Devj: 8 posts Emmanuel: 6 posts Lyndon Maydwell: 6 posts Nick Tkach: 6 posts Jaya Ghosh: 5 posts Jiaqi Tan: 5 posts Volkan Ebil: 5 posts Aled Rhys Jones: 4 posts Barry Haddow: 4 posts Developer Developer: 4 posts Nick Duan: 4 posts Otis Gospodnetic: 4 posts Syed Ahmed: 4 posts DS jha: 3 posts Jasper Kamperman: 3 posts
show more