FAQ

Am 02.02.2011 um 17:04 schrieb Tim Pease:

On Feb 2, 2011, at 5:18 AM, David Saile wrote:

Hi all,

I have a question concerning updating a site's score in Nutch 1.2.

In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call to
scfilters.updateDbScore((Text)key, oldSet ? old : null, result, linkList);

During debugging, I discovered that this method is executed in the org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this method is the following:
/** Increase the score by a sum of inlinked scores. */
public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List inlinked) throws ScoringFilterException {
float adjust = 0.0f;
for (int i = 0; i < inlinked.size(); i++) {
CrawlDatum linked = (CrawlDatum)inlinked.get(i);
adjust += linked.getScore();
}
if (old == null) old = datum;
datum.setScore(old.getScore() + adjust);
}

To my understanding, this code would increase a sites score based on it's inlinks, every time a site is crawled. So even if neither the site has been modified, nor any new inlink was discovered, the sites score will increase.

Is my understanding of this mechanism correct?
If so, could anyone explain to me why a sites score is increased in any case? I would expect it to only change if either its content has changed, or a new inlink has been discovered.
Your observations are correct. We recently ran into this exact same issue and have determined that the OPICScoringFilter is not suitable for crawls where pages will be re-fetched / re-parsed. The page score will continually be increased each time it is fetched eventually resulting in a score of Inifinity.

The "Online Page Importance Computation" (OPIC) score algorithm is described in this paper => http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html

The purpose of the algorithm is that you do not have to maintain the entire link graph in memory to computer score imparted to inlinks and outlinks. The downside is that you cannot determine if a page's score has already been included in the outlinks to another page. Hence the infinite score growth you have observed.

This behavior only appears if you are re-fetching / re-parsing pages.

Blessings,
TwP
Thank you very much for you reply Tim!

Is it correct to assume, that you could make the OPIC score algorithm more precise by only updating the score in two cases:

1) If a site has a modified outlink (i.e. the outlink was added or deleted since the last fetch), update the score of the target-site of this outlink.

2) If a sites score has changed since the last fetch, you have to update the score of all targets of outlinks on this site.

(given the case you actually had the required information at hand)?

Cheers
David

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 6 | next ›
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 2, '11 at 12:19p
activeFeb 7, '11 at 7:02a
posts6
users3
websitenutch.apache.org

People

Translate

site design / logo © 2022 Grokbase