Grokbase Groups Nutch user
FAQ

Search Discussions

8,906 discussions - 32,231 posts

  • Hi Folks, 72 hours has come and gone. Thank you to everyone that was able to VOTE and thank you to everyone who has contributed to our continually growing community. The RESULTS are as follows [8] +1 ...
    Lewis John McgibbneyLewis John Mcgibbney
    Jun 18, 2016 at 5:21 pm
    Jun 18, 2016 at 5:21 pm
  • Hi folks, I am curious as to whether Nutch 2.x might solve some of the problems we are experiencing with Nuch 1.11 at a very large scale (multiple billions of URLs). For now, the primary issue is the ...
    Joseph NaegeleJoseph Naegele
    Jun 17, 2016 at 1:00 pm
    Jun 17, 2016 at 8:40 pm
  • Hi, On the seed page there are a few hundred links (approx. 400) in a large list of items that must be indexed. I already made sure that the number of inbound and outbound links in the settings are ...
    Jigal van Hemert | alterNET internet BVJigal van Hemert | alterNET internet BV
    Jun 16, 2016 at 2:56 pm
    Jun 16, 2016 at 2:56 pm
  • Hi Folks, A first candidate for the Nutch 1.12 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.12/ The release candidate is a zip and tar archive of the sources tag available ...
    Lewis john mcgibbneyLewis john mcgibbney
    Jun 15, 2016 at 5:15 am
    Jun 16, 2016 at 2:06 pm
  • Hi Guys, I am attempting to run nutch using cygwin, and I am having the following problem: Ps. I added Hadoop-core to the lib folder already - I appreciate any insight or comment you guys may have - ...
    Jamal, SarfarazJamal, Sarfaraz
    Jun 13, 2016 at 9:36 pm
    Jun 16, 2016 at 3:35 pm
  • I have installed and successfully web crawled thousands of pages using Nutch 2.3.1 with MongoDB. But suddently, Nutch 2.3.1 Generator not generating any URLs. Seed list URL are accepted ...
    Jean VenceJean Vence
    Jun 13, 2016 at 8:57 pm
    Jun 15, 2016 at 8:29 am
  • Hi folks, I'm in the process of indexing a large number of docs using Nutch 1.11 and the indexer-elastic plugin. I've observed slow indexing performance and narrowed it down to the map phase and ...
    Joseph NaegeleJoseph Naegele
    Jun 13, 2016 at 4:56 pm
    Jun 14, 2016 at 8:27 pm
  • Hello, Sorry if this is a dumb question. I can't find the good info. I've installed at some computer, nutch 1.11 and Solr 6.0.1 As I'm new to nutch/solr, I've followed the old tutorial at ...
    Jose-Marcio Martins da CruzJose-Marcio Martins da Cruz
    Jun 13, 2016 at 1:07 pm
    Jun 14, 2016 at 11:48 am
  • I would like to "groom" the crawldb.... My guess is that it should be an easy thing just to built upon the function that removes the 404 status and duplicates. But where do I find these? Thank you
    BlackIceBlackIce
    Jun 13, 2016 at 12:19 pm
    Jun 15, 2016 at 6:09 pm
  • Hi All - When I change storage.schema.webpage to something other than webpage, nutch 2.x still uses the table webpage in HBase. For example, I changed it to testCollect1, and I get this message ...
    Joseph ObernbergerJoseph Obernberger
    Jun 10, 2016 at 11:26 pm
    Jun 14, 2016 at 8:28 pm
  • Hi all, Just to update this..I did correct a silly problem I had in the url, and now am using the following. In short though, I still get a URISyntaxException. bin/nutch index crawl/crawldb -linkdb ...
    Tim JohnsonTim Johnson
    Jun 10, 2016 at 1:47 pm
    Jun 10, 2016 at 1:47 pm
  • Hi all, I've been having some problems trying to get this to work. I started off with solr 4.9.1 and using the basic nutch tutorial at https://wiki.apache.org/nutch/NutchTutorial, followed the steps ...
    Tim JohnsonTim Johnson
    Jun 9, 2016 at 7:57 pm
    Jun 9, 2016 at 7:57 pm
  • 1down votefavorite <http://stackoverflow.com/questions/37731716/indexing-nutch-crawled-data-in-bluemix-solr# I'm trying to index the nutch crawled data by Bluemix solr and I cannot find anyway to do ...
    Shakiba davariShakiba davari
    Jun 9, 2016 at 5:11 pm
    Jun 16, 2016 at 9:04 pm
  • Hi All, I'm getting following errors when updatedb. can someone tell me whats going wrong and how to solve it. thanks. 16/06/04 00:58:42 INFO mapreduce.Job: map 0% reduce 0% 16/06/04 00:59:27 INFO ...
    Nana PandiawanNana Pandiawan
    Jun 6, 2016 at 1:27 am
    Jun 7, 2016 at 6:30 am
  • Hi, We are trying to run Nutch with selenium and getting error as "GDK_BACKEND does not match available displays" . We tried a lot to reslove this. can anyone help on this am getting this error only ...
    Deepa JayaveerDeepa Jayaveer
    Jun 3, 2016 at 9:59 am
    Jun 3, 2016 at 9:59 am
  • Hi folks, I'm looking for clarification on the index "-nocommit" option: The description says: "do the commits once and for all the reducers in one go (optional)", which sounds unintuitive. The ...
    Joseph NaegeleJoseph Naegele
    May 26, 2016 at 2:40 pm
    May 27, 2016 at 1:15 pm
  • Hi - I'm trying to make a new indexer plugin for Nutch, but I'm having trouble with the classpath. The error that I'm getting is: Error: java.lang.ClassNotFoundException: org.jaxen.JaxenException My ...
    Joseph ObernbergerJoseph Obernberger
    May 26, 2016 at 2:28 pm
    May 31, 2016 at 8:19 pm
  • Hi, I'm running nutch 1.9 on Hadoop & yarn, 3 nodes. Is there anywhere guide with optimize configuration so the nutch will run the most efficient way? Those are my current nutch-site: <?xml ...
    Chaushu, ShaniChaushu, Shani
    May 26, 2016 at 1:08 pm
    May 26, 2016 at 1:08 pm
  • I am trying to crawl a single site and have used db.ignore.external.links=true flag. But it seems to fail because it will crawl sites with a different country extension so for example: if the seed is ...
    Jean VenceJean Vence
    May 25, 2016 at 9:45 am
    May 31, 2016 at 8:33 pm
  • Hi, I've just seen on a website which tracks bots, that "Tarantula" , our nutch 1.11 based crawler is being classified as not obeying robots.txt. What's the solution?
    BlackIceBlackIce
    May 24, 2016 at 10:17 pm
    May 27, 2016 at 9:06 am
  • In April 2015 Google rolled out their mobile-friendly update <https://webmasters.googleblog.com/2015/04/rolling-out-mobile-friendly-update.html which boosts the ranking of mobile-friendly pages on ...
    FengtanFengtan
    May 24, 2016 at 3:20 am
    May 24, 2016 at 4:05 pm
  • I am using master branch, solr is version 6 and in cloud configuration. I am using 'cloud' for solr.server.type in nutch-site.xml. does this make sense to anybody? 16/05/23 13:52:27 INFO ...
    Kaveh minooieKaveh minooie
    May 23, 2016 at 9:00 pm
    May 24, 2016 at 12:03 am
  • Dear all, it is my pleasure to announce that Thamme Gowda N. has joined us as committer and member of the Nutch PMC. Congratulations on your new role within the Apache Nutch community! Thamme, would ...
    Sebastian NagelSebastian Nagel
    May 22, 2016 at 8:02 pm
    May 24, 2016 at 9:48 am
  • Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Karanjeet Singh has joined the Nutch team as committer and PMC member. Karanjeet, would you mind to introduce yourself and tell ...
    Sebastian NagelSebastian Nagel
    May 22, 2016 at 7:52 pm
    May 24, 2016 at 9:49 am
  • Hi, Is there a possibility for the "headings" plug-in to define the field where the data should be stored? We have wildcard fields defined in the schema.xml and it would be nice if we could use such ...
    Jigal van Hemert | alterNET internet BVJigal van Hemert | alterNET internet BV
    May 20, 2016 at 7:34 am
    May 25, 2016 at 9:32 am
  • is there any rest client skeleton in java with all the required steps till full crawl? or - can I send job with multiple values in type field (probably not)
    EyalEyal
    May 18, 2016 at 8:31 pm
    May 18, 2016 at 8:31 pm
  • was that implemented? Curretnly GET /config/default just contains the name of those configuration files
    EyalEyal
    May 18, 2016 at 6:46 pm
    May 18, 2016 at 6:46 pm
  • Hi, I have crawled PDFs using Nutch 1.7. I found that "content" field has no line breaks. It grabbed all the paragraphs in the PDF as one aggregated paragraph without line breaks. Is it possible to ...
    A LaxmiA Laxmi
    May 18, 2016 at 5:46 pm
    May 18, 2016 at 6:20 pm
  • Hi, I followed this tutorial and now I get this wehn try to inject - didn't quite understand from the web what are my next steps here: 2016-05-17 19:46:12,986 INFO crawl.InjectorJob - InjectorJob ...
    EyalEyal
    May 17, 2016 at 7:49 pm
    May 19, 2016 at 8:21 am
  • Hi folks, Would anyone be willing to share a few pros/cons of using many nodes vs. 1 very powerful machine for large-scale crawling? Of course many advantages and disadvantages overlap with Hadoop ...
    Joseph NaegeleJoseph Naegele
    May 16, 2016 at 6:40 pm
    May 17, 2016 at 8:32 am
  • Hi Folks, I recently worked with Infra to make Docker images available for our community. https://hub.docker.com/r/apache/nutch/ Master branch will always be latest, with the Nutch 2.X Cassandra and ...
    Lewis John McgibbneyLewis John Mcgibbney
    May 16, 2016 at 4:27 am
    May 16, 2016 at 4:35 am
  • Hi All, I know this has probably been asked many times but it is easier to ask again than search the entire archive. I would like to create a search and have the results display in a similar fashion ...
    Sheon banksSheon banks
    May 12, 2016 at 5:03 am
    May 12, 2016 at 6:36 am
  • Hi, Any expected release date for Nutch 1.12? Really looking forward to the new feature - expore tika support for boilerpipe. Thanks! AL
    A LaxmiA Laxmi
    May 10, 2016 at 5:04 pm
    May 11, 2016 at 3:07 pm
  • I am trying Nutch for the first time. I created an automated docker setup to load Nutch 2 + Hbase (i had tried cassandra but could not get it to work so i thought i start with Hbase to give it a try) ...
    Diego gulloDiego gullo
    May 7, 2016 at 8:41 am
    May 16, 2016 at 7:32 pm
  • Hi folks, I'm using Nutch 1.11. Is it possible to implement plugin instance startUp/shutDown methods for normal extension points? This would allow for cleaning up resources at the end of a plugin ...
    Joseph NaegeleJoseph Naegele
    May 6, 2016 at 2:04 pm
    May 11, 2016 at 9:03 am
  • Hi, (a) Is it possible to crawl URL of a Zip file using Nutch and index in Solr? (pls see example below) (b) Also, if a zip file URL has PDF files in them, is it possible to use Nutch to crawl the ...
    A LaxmiA Laxmi
    May 6, 2016 at 2:00 am
    May 9, 2016 at 4:33 pm
  • Hi Folks, A heads up about my presentation @ApacheCon Big Data Next week in Vancouver, BC. I will be giving a presentation titled "Experiences Using Apache HTRace (Incubating) in Distributed Web ...
    Lewis John McgibbneyLewis John Mcgibbney
    May 5, 2016 at 5:03 am
    May 5, 2016 at 5:03 am
  • Hi Bin, Hope you are doing well! Please see response below Mike Joyce and I were previously working on the following (currently stalled) 1. Upgrade enture MR API to 'New' MR API within master ...
    Lewis John McgibbneyLewis John Mcgibbney
    May 3, 2016 at 8:42 pm
    May 3, 2016 at 8:42 pm
  • Hi Yulio, Correct. OK. It worth stating that the LinkDB [0] is a data structure which maintains an inverted link map, listing incoming links for each url. This is not generated until after an initial ...
    Lewis John McgibbneyLewis John Mcgibbney
    May 3, 2016 at 8:14 pm
    May 3, 2016 at 8:14 pm
  • Hello - I'm working with nutch 2.3.1 with HBase for the webpage table. I have all the phases (inject, generate, fetch, parse, and updatedb) working fine. Nutch is a crawling beast! On our cluster, ...
    Joseph ObernbergerJoseph Obernberger
    May 3, 2016 at 1:04 pm
    May 20, 2016 at 11:37 pm
  • Hi there, Is there a state of the art visualization tool that is Nutch friendly? I am planning to get the crawldb information into a better format that can be digested by Neo4j or Gephi for ...
    Bin WangBin Wang
    May 2, 2016 at 7:26 pm
    May 3, 2016 at 8:43 pm
  • Hi, I want to upgrade from nutch 1.9 to nutch 1.11 I saw that in bin/crawl script there is no step of solrindex Do I need to run command for solr index separately after all the crawl is complete? ...
    Chaushu, ShaniChaushu, Shani
    May 2, 2016 at 12:47 pm
    May 4, 2016 at 2:06 pm
  • Hi Yulio, Marcus wrote the MimeAdaptiveFetchSchedule [0] implementation for exactly this purpose. You can utilize it as per [1] [0] ...
    Lewis John McgibbneyLewis John Mcgibbney
    May 1, 2016 at 7:40 pm
    May 1, 2016 at 9:59 pm
  • Hi. I'm using Nutch 1.9 with Solr 4.10 in a local environment. I need a way to priorize some links in the Fetching Steps, through filtering the new links identified in the last crawls by some ...
    Yulio Aleman JimenezYulio Aleman Jimenez
    Apr 29, 2016 at 8:47 pm
    Apr 29, 2016 at 8:47 pm
  • Here's an odd one (Nutch 1.11): I haven't tested this with other extension points, but if you extend or depend on the "protocol-http" plugin in a new plugin, the name of the new plugin is significant ...
    Joseph NaegeleJoseph Naegele
    Apr 26, 2016 at 1:59 pm
    Apr 27, 2016 at 6:53 am
  • Hi folks, I'm using Nutch 1.11. What I'd like to do is use parse-tika for HTML and maybe a select few other content types, but nothing else. This doesn't appear to be possible without making changes ...
    Joseph NaegeleJoseph Naegele
    Apr 26, 2016 at 1:41 pm
    Apr 26, 2016 at 1:41 pm
  • hello. i have some problem to using solr indexer on nutch 1.11 deploy mode. i've got some error message like pictures below : error1 error2 i didn't got any logs error. i'm so confuse because when i ...
    Tkg_cangkulTkg_cangkul
    Apr 24, 2016 at 4:28 pm
    Apr 27, 2016 at 2:24 am
  • hi i try to use solr as backend in nutch 2.3.1. this is my config in gora.properties: gora.datastore.default=org.apache.gora.solr.store.SolrStore gora.solrstore.solr.url=http://localhost:8983/solr ...
    Tkg_cangkulTkg_cangkul
    Apr 21, 2016 at 9:29 am
    Apr 28, 2016 at 2:53 pm
  • hi i've try to build nutch v 2.3.1 with hbase 0.98 and it's success. now i want to try build nutch without hbase. so, the crawling result not store into hbase. i want to store it in hdfs. is it ...
    Tkg_cangkulTkg_cangkul
    Apr 21, 2016 at 6:17 am
    Apr 21, 2016 at 5:40 pm
  • Dear All, I use Nutch 2.3.1 with HBase, how to find the command like this on nutch version 2.3.1 : /"bin/nutch dump -outputDir DATA_DUMP -segment TestCrawl/segments -mimetype image/jpeg image/png ...
    Nana PandiawanNana Pandiawan
    Apr 21, 2016 at 3:45 am
    Apr 21, 2016 at 5:36 pm
Group Navigation
period‹ prev | Latest | first ›
Group Overview
groupuser @
categoriesnutch, lucene
discussions8,906
posts32,231
users3,267
websitenutch.apache.org

Top users

Markus Jelsma: 1638 posts Lewis Mcgibbney: 1574 posts Andrzej Bialecki: 886 posts Julien Nioche: 709 posts Stefan Groschupf: 375 posts Dennis Kubes: 337 posts Sebastian Nagel: 329 posts Ogjunk-nutch: 289 posts Mattmann, Chris A: 276 posts Tejas Patil: 266 posts Dogacan Guney: 256 posts Alxsss: 236 posts MilleBii: 213 posts Bai Shen: 183 posts Nutch-dev: 177 posts Alexander Aristov: 161 posts Ken Krugler: 161 posts Remi tassing: 160 posts Jérôme Charron: 157 posts Raghavendra Prabhu: 146 posts
show more
Archives