Help on this would be greatly appreciated!

I am trying to modify Nutch in a way, that recrawling becomes more incremental. This requires the use of a more iterative algorithm like OPIC, instead of creating an entire WebGraph..


Anfang der weitergeleiteten E-Mail:
Von: David Saile <david@uni-koblenz.de>
Datum: 4. Februar 2011 16:03:41 MEZ
An: user@nutch.apache.org
Betreff: Re: ScoringFilter always increasing a fetched site's score

Thanks for pointing me to that information.

However, the OPIC-algorithm seems more suitable for my needs, as it creates scores w/o the need to compute an entire WebGraph.

I think I still don't understand the nature of the problem with the OPIC-algorithm. It seems to me the problem Tim described, of scores converging to an infimum is avoided in the OPIC-algorithm for dynamic graphs, where the score is reset after a certain time-window.

Inspecting the nutch-code, I could not find mechanisms to start a new time-window. Was Nutch using the algorithm for static graphs, prior to Dennis' new scoring tools?

Thanks for all your help!

Am 03.02.2011 um 14:10 schrieb Julien Nioche:
Dennis' new scoring tools have been designed to replace the OPIC
implementation. See http://wiki.apache.org/nutch/NewScoring and



On 3 February 2011 12:40, David Saile wrote:

Am 02.02.2011 um 17:04 schrieb Tim Pease:
On Feb 2, 2011, at 5:18 AM, David Saile wrote:

Hi all,

I have a question concerning updating a site's score in Nutch 1.2.

In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call
scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
During debugging, I discovered that this method is executed in the
org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this
method is the following:
/** Increase the score by a sum of inlinked scores. */
public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
List inlinked) throws ScoringFilterException {
float adjust = 0.0f;
for (int i = 0; i < inlinked.size(); i++) {
CrawlDatum linked = (CrawlDatum)inlinked.get(i);
adjust += linked.getScore();
if (old == null) old = datum;
datum.setScore(old.getScore() + adjust);

To my understanding, this code would increase a sites score based on
it's inlinks, every time a site is crawled. So even if neither the site has
been modified, nor any new inlink was discovered, the sites score will
Is my understanding of this mechanism correct?
If so, could anyone explain to me why a sites score is increased in any
case? I would expect it to only change if either its content has changed, or
a new inlink has been discovered.
Your observations are correct. We recently ran into this exact same issue
and have determined that the OPICScoringFilter is not suitable for crawls
where pages will be re-fetched / re-parsed. The page score will continually
be increased each time it is fetched eventually resulting in a score of
The "Online Page Importance Computation" (OPIC) score algorithm is
described in this paper =>
The purpose of the algorithm is that you do not have to maintain the
entire link graph in memory to computer score imparted to inlinks and
outlinks. The downside is that you cannot determine if a page's score has
already been included in the outlinks to another page. Hence the infinite
score growth you have observed.
This behavior only appears if you are re-fetching / re-parsing pages.

Thank you very much for you reply Tim!

Is it correct to assume, that you could make the OPIC score algorithm more
precise by only updating the score in two cases:

1) If a site has a modified outlink (i.e. the outlink was added or
deleted since the last fetch), update the score of the target-site of this

2) If a sites score has changed since the last fetch, you have to
update the score of all targets of outlinks on this site.

(given the case you actually had the required information at hand)?


*Open Source Solutions for Text Engineering


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 6 | next ›
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 2, '11 at 12:19p
activeFeb 7, '11 at 7:02a



site design / logo © 2022 Grokbase