FAQ

Search Discussions

69 discussions - 249 posts

  • Hi to all, I'm testing nutch 1-2 in pseudo distributed and local mode. I have a database with around 126M urls. They are all injected and prepared to fetch. When generating segments, there is always ...
    AxierrAxierr
    Feb 2, 2011 at 5:52 pm
    Feb 5, 2011 at 7:25 pm
  • when i crawl a site with pdf link contain arabic words it dont return me the arabic word in pdf when i search with nutch on it what can i do please help me -- View this message in context ...
    HalaHala
    Feb 13, 2011 at 5:22 pm
    Feb 20, 2011 at 5:26 pm
  • Hello list, Whilst using Nutch-1.2 on ubuntu 10.04 and undertaking a crawl either using crawl command or separate commands I can never seem to crawl the following site www.scotland.gov.uk My logs ...
    McGibbney, Lewis JohnMcGibbney, Lewis John
    Feb 20, 2011 at 7:17 pm
    Feb 25, 2011 at 7:38 am
  • Hi all, I was looking at the following example, http://wiki.apache.org/nutch/WritingPluginExample In the example, the author sets a boost of 5.0f for the recommended tag. In this same way, can I also ...
    .: Abhishek :..: Abhishek :.
    Feb 7, 2011 at 1:28 am
    Feb 8, 2011 at 2:19 am
  • Hi, We are using nutch 1.2 to crawl our intranet pages that require authentication. We followed the steps listed on nutch Wiki http://wiki.apache.org/nutch/HttpAuthenticationSchemes we have ...
    Carl ZhaCarl Zha
    Feb 23, 2011 at 11:54 pm
    Feb 25, 2011 at 6:25 pm
  • Hi everyone, I've build a little hadoop program to build an inverted index from a text collection. It performs basic analysis: tokenization, lowercasing, stopword removal. I was wondering if I could ...
    Marco DidonnaMarco Didonna
    Feb 8, 2011 at 10:07 am
    Feb 8, 2011 at 12:27 pm
  • Hi all, I am just trying to figure out if there is some way I can set Nutch crawls between a time interval say like crawl from 12:00 AM to 12:00 PM and then start the further processing(start process ...
    .: Abhishek :..: Abhishek :.
    Feb 9, 2011 at 1:17 am
    Feb 11, 2011 at 4:23 am
  • Dear all. I'm crawling Internet (inject 100 URL's and start crawling with deep 10) with Nutch 1.2. After some time I receive the following (on the deep 3): -activeThreads=1500, spinWaiting=131, ...
    Andrey SapeginAndrey Sapegin
    Feb 3, 2011 at 8:56 am
    May 30, 2011 at 5:28 pm
  • Has anybody installed Nutch on ec2 with using aws elastic map reduce underneath? -- Amin Bandeali Cell: 714.757.9544 Follow me on twitter http://twitter.com/aminbandeali DISCLAIMER This e-mail is ...
    Amin BandealiAmin Bandeali
    Feb 7, 2011 at 2:35 am
    Apr 27, 2011 at 1:23 pm
  • I recently installed Nutch and have spent some time trying to get it working with limited success. ./nutch crawl urls -dir crawl -depth 5 -topN 50 After the crawl completes I am trying to run the web ...
    Jeremy ArnoldJeremy Arnold
    Feb 24, 2011 at 10:14 pm
    Feb 25, 2011 at 1:49 am
  • Hi all, I have a question concerning updating a site's score in Nutch 1.2. In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call to scfilters.updateDbScore((Text)key, oldSet ? old ...
    David SaileDavid Saile
    Feb 2, 2011 at 12:19 pm
    Feb 7, 2011 at 7:02 am
  • Hi, When I look through the fetched results, I find some URLs were fetched and some weren't. How can I make sure that every URL is fetched? Thanks, Jeff
    Jeff ZhouJeff Zhou
    Feb 18, 2011 at 1:48 pm
    Feb 20, 2011 at 2:59 am
  • Ho all, i want to know that does the method of indexing used for nutch is a well accepted standard like RFC standard..???? Thanks in advance Amna Waqar
    Amna WaqarAmna Waqar
    Feb 12, 2011 at 10:04 am
    Feb 12, 2011 at 11:15 am
  • Hi all, I am crawling a really huge site, and the crawl has been running like for almost 5 days now and its still continuing. So until this crawl ends, I will not be able to see the results? What do ...
    .: Abhishek :..: Abhishek :.
    Feb 4, 2011 at 1:42 am
    Feb 5, 2011 at 3:52 am
  • Hi all, I am writing an custom HtmlParserFilter by implementing the HtmlParseFilter. And, I am using the ParserChecker for testing the filter. I could see by some Syso's in the HTMLParseFilters class ...
    .: Abhishek :..: Abhishek :.
    Feb 2, 2011 at 2:49 am
    Feb 4, 2011 at 1:37 am
  • Hello, I'm using nutch-1.2 and would like to run a search like... Query q = NumericRangeQuery.newLongRange("tstamp", 20101216171805246L, 20101217024017851L, true, true); ...but 0 results. If I check ...
    Eggebrecht, Thomas (GfK Marktforschung)Eggebrecht, Thomas (GfK Marktforschung)
    Feb 16, 2011 at 2:24 pm
    May 2, 2011 at 1:33 pm
  • Hi, I want to separate parsing from crawling in Nutch. In other words, I want to crawl thousands of URLs and save the contents in local drive, and parse the contents later after crawling is ...
    Jeff ZhouJeff Zhou
    Feb 18, 2011 at 1:41 pm
    Feb 20, 2011 at 6:53 am
  • I follow the NutchTutorial and get the search worked, but I have several questions. 1st, is it possible for a website to setup some restriction so that nutch can not fetch its pages or the pages ...
    Thomas AndersonThomas Anderson
    Feb 18, 2011 at 10:11 am
    Feb 19, 2011 at 8:44 am
  • Hi, We are using nutch 1.2 we have overridden the 'plugin.includes' property of 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 'protocol-http' with 'protocol-httpclient'. please see ...
    Carl ZhaCarl Zha
    Feb 16, 2011 at 11:04 pm
    Feb 16, 2011 at 11:38 pm
  • But is there any way to programmatically modify the config files behind Nutch? I am talking specifically about crawl-urlfilter.txt and the Solr mapping file. My inquiring mind wants to know ;-) ...
    Adam EstradaAdam Estrada
    Feb 11, 2011 at 3:00 am
    Feb 12, 2011 at 4:09 pm
  • Hi all, When do we use the -solr param for the nutch crawl? And is it a mandate that solr should be running in the solr URL passed in the -solr? Should I be using it as, bin/nutch crawl ..... -solr ...
    .: Abishek :..: Abishek :.
    Feb 10, 2011 at 3:18 am
    Feb 10, 2011 at 2:28 pm
  • Hi list, I am at Solr indexing stage and seem to have hit trouble when sending crawldb linkdb and segments/* to Solr to be indexed. I have added xml file to $CATALINA_HOME/cong/catalina/localhost ...
    McGibbney, Lewis JohnMcGibbney, Lewis John
    Feb 9, 2011 at 11:36 pm
    Feb 10, 2011 at 1:34 pm
  • Running version 1.2. A very simple page I'm using to seed some URLs but don't want to return in the index itself has this metatag: <head <META http-equiv="Content-Type" content="text/html ...
    Joshua J PavelJoshua J Pavel
    Feb 7, 2011 at 9:42 pm
    Feb 9, 2011 at 2:35 pm
  • I am using the following command [root@Amna search]# bin/nutch readseg -dump /user/root/crawl/segments/20110124205537/ amna_out but output is SegmentReader: dump segment ...
    Amna WaqarAmna Waqar
    Feb 2, 2011 at 11:47 am
    Feb 2, 2011 at 12:27 pm
  • Hi all, I am planning to implement a negative keyword indexer such that if a negative keyword appears in a segment I should never show up it during the search. I have the following steps in mind, ...
    .: Abhishek :..: Abhishek :.
    Feb 1, 2011 at 4:14 am
    Feb 2, 2011 at 2:01 am
  • Hello, I would like to use nutch to crawl and index some sites in local network, but the server require client certificate. How can I configure nutch crawler to use some X.509 certificate for secured ...
    SlavoSlavo
    Feb 21, 2011 at 6:28 pm
    Apr 14, 2011 at 2:16 pm
  • hi guys,I'm using nutch-1.0 for Chinese web search,I changed the NutchDocumentAnalyzer.java to use imdict-chinese-analyzer,which is dedicated to chinese word segmentation,after successfully crawled ...
    Jason ShiJason Shi
    Feb 28, 2011 at 2:54 am
    Feb 28, 2011 at 9:21 am
  • Does anybody know if Hadoop still does a fork to run whoami? -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
    Paul TomblinPaul Tomblin
    Feb 23, 2011 at 2:09 pm
    Feb 26, 2011 at 11:17 pm
  • I would like to do a large crawl and let nutch run to index up to 10-100 million webpages. I know on http://wiki.apache.org/nutch/NutchTutorial the nutch crawl command will do all steps with just ...
    FirespinFirespin
    Feb 26, 2011 at 5:39 pm
    Feb 26, 2011 at 6:53 pm
  • Hello everyone, I'm currently thinking of using Nutch in a new website project. My aim is to index files (HTML, TXT, PDF ...) stored on a filesystem (which Nutch can ), but some of the files may have ...
    KaolaKaola
    Feb 25, 2011 at 11:41 am
    Feb 25, 2011 at 4:58 pm
  • I learn setting up nutch to crawl a website through http://wiki.apache.org/nutch/NutchHadoopTutorial. When testing to crawl the url http://lucene.apache.org as described in tutorial, I keep getting ...
    Thomas AndersonThomas Anderson
    Feb 21, 2011 at 5:17 am
    Feb 21, 2011 at 9:50 am
  • I follow the tutorial at http://wiki.apache.org/nutch/NutchTutorial to start crawling web pages. The usage with crawl command works bin/nutch crawl ../test-domain/urls -dir ../test-domain -depth 3 ...
    Chia-Hung LinChia-Hung Lin
    Feb 16, 2011 at 9:54 am
    Feb 16, 2011 at 10:59 am
  • Hi all, I am new to Nutch. I want to use Nutch's MapReduce indexer to index files on a local filesystem. And I want to customize the field adding to the index. I searched the Internet for a while, ...
    Wenhao XuWenhao Xu
    Feb 10, 2011 at 3:32 am
    Feb 13, 2011 at 12:23 am
  • Hi all, I want to know do the ASF license of nutch allows us to modfiy its code and make a new search engine and then start earning revenue on the basis of it.. Thanks in advance Regards Amna Waqar
    Amna WaqarAmna Waqar
    Feb 12, 2011 at 9:28 am
    Feb 12, 2011 at 9:53 am
  • hello everybody, I write a plugin for checking the meta tags http-equiv and do some processing based on its content values like content="text/html ; charset=UTF-8" Properties HttpMetaTags = ...
    Amna WaqarAmna Waqar
    Feb 11, 2011 at 9:11 am
    Feb 11, 2011 at 3:50 pm
  • Hi, I tried upgrading upgrading to hadoop-0.21.0 in nutch-1.2. 'ant package' does not report any errors and build is sucessfull but nutch crawl is failing with following errors: ./bin/nutch crawl ...
    Rishi pathakRishi pathak
    Feb 3, 2011 at 2:06 pm
    Feb 6, 2011 at 4:32 pm
  • Hi list, I am Arjun. I am trying to develop an application in which I'll give a constrained set of urls to the urls file in Nutch. I am able to crawl these urls and get the contents of them by ...
    Arjun Kumar ReddyArjun Kumar Reddy
    Feb 2, 2011 at 7:52 am
    Feb 2, 2011 at 3:23 pm
  • Hi all, I would like to know whether parsing and application of parsing filters happens after the fetch of the pages or during the process of fetching itself? Thanks, Abi
    .: Abhishek :..: Abhishek :.
    Feb 2, 2011 at 5:48 am
    Feb 2, 2011 at 10:53 am
  • From the javadocs for CrawlDatum.getFetchTime() (Nutch 1.1): "Returns either the time of the last fetch, or the next fetch time, depending on whether Fetcher or CrawlDbReducer set the time." So is ...
    Mike BaranczakMike Baranczak
    Feb 1, 2011 at 10:16 pm
    Feb 1, 2011 at 11:39 pm
  • to add to this... please try Solr for search funtionality. Solr.war Thank you Lewis From: Alexander Aristov [alexander.aristov@gmail.com] Sent: 28 February 2011 09:20 To: user@nutch.apache.org Cc ...
    McGibbney, Lewis JohnMcGibbney, Lewis John
    Feb 28, 2011 at 11:25 am
    Mar 7, 2011 at 3:29 am
  • Hi there, I need to read some pages from segments to get the raw HTML. I do it like: nutch-1.2/bin nutch readseg -get /path/to/segment http://key.value.html -nofetch -nogenerate -noparse -noparsedata ...
    Eggebrecht, Thomas (GfK Marktforschung)Eggebrecht, Thomas (GfK Marktforschung)
    Feb 28, 2011 at 5:37 pm
    Feb 28, 2011 at 9:11 pm
  • Hello everybody, i want to see the lang of the doc, and if lang != ur then i want to delete that that doc before it can be accessed..i ve used ur.ngp to detect lang of the doc and the plugin is ...
    Amna WaqarAmna Waqar
    Feb 25, 2011 at 4:46 am
    Feb 28, 2011 at 3:30 pm
  • Hi, I was looking at nutch as a crawler for indexing into Indri. In Indri's docs, it lists "warc" as a corpus class option described as "WARC (Web ARChive) format, such as is output by the Nutch ...
    Michael LeeMichael Lee
    Feb 28, 2011 at 2:46 pm
    Feb 28, 2011 at 3:14 pm
  • Hello everybody We have a cluster of five nodes, is there a way of telling that slaves are crawling after issuing the command bin/nutch crawl urls -dir ... Ibrahim
    Ibrahim AlkharashiIbrahim Alkharashi
    Feb 20, 2011 at 4:43 am
    Feb 20, 2011 at 5:22 am
  • hi everybody, i wrote a plugin named description which only index those pages containg content-type meta-tag with value "text/html; charset=UTF-8" package org.apache.nutch.parse.description; // JDK ...
    Amna WaqarAmna Waqar
    Feb 12, 2011 at 6:25 am
    Feb 16, 2011 at 3:36 pm
  • Hi Folks, A while back I nominated Alexis Detreglode for Nutch committership and PMC membership. The VOTE tallies in Nutch PMC-ville have occurred and I'm happy to announce that Alexis is now an ...
    Mattmann, Chris A (388J)Mattmann, Chris A (388J)
    Feb 15, 2011 at 4:50 pm
    Feb 15, 2011 at 7:32 pm
  • Hi all, I need someone to advise on how to change the search result page from search.jsp to something like php etc. Regards Ronny
    Muwonge RonaldMuwonge Ronald
    Feb 15, 2011 at 4:15 pm
    Feb 15, 2011 at 4:28 pm
  • Hi all, I ve read the code of indexer in nutch-1.2 which states that store ,index and vector are used for each field in the index. what is the reason for using 'vector'. Also i ve understood the ...
    Amna WaqarAmna Waqar
    Feb 8, 2011 at 8:49 am
    Feb 14, 2011 at 11:01 pm
  • Hello everyone. I currently have Nutch 1.0. I would like to upgrade to the latest version 1.2. How do I upgrade? Are there any instructions detailing how to upgrade?
    Terrell JamesTerrell James
    Feb 12, 2011 at 2:37 am
    Feb 14, 2011 at 10:55 pm
  • Hi folks, From your experience...Could you please let me know how long does nutch take to fetch, parse and index a single web page(approx)? Cheers, Abi
    .: Abishek :..: Abishek :.
    Feb 11, 2011 at 4:06 am
    Feb 11, 2011 at 8:04 am
Group Navigation
period‹ prev | Feb 2011 | next ›
Group Overview
groupuser @
categoriesnutch, lucene
discussions69
posts249
users54
websitenutch.apache.org

54 users for February 2011

Markus Jelsma: 38 posts .: Abishek :.: 37 posts Amna Waqar: 16 posts McGibbney, Lewis John: 14 posts Julien Nioche: 10 posts Estrada Groups: 8 posts Alexander Aristov: 7 posts Axierr: 7 posts Carl Zha: 6 posts Jeff Zhou: 6 posts Thomas Anderson: 6 posts A a: 5 posts David Saile: 4 posts Hala: 4 posts Ken Krugler: 4 posts Marco Didonna: 4 posts Alxsss: 3 posts Arkadi Kosmynin: 3 posts Arjun Kumar Reddy: 3 posts Eggebrecht, Thomas (GfK Marktforschung): 3 posts
show more
Archives