FAQ
Thanks for pointing me to that information.

However, the OPIC-algorithm seems more suitable for my needs, as it creates scores w/o the need to compute an entire WebGraph.

I think I still don't understand the nature of the problem with the OPIC-algorithm. It seems to me the problem Tim described, of scores converging to an infimum is avoided in the OPIC-algorithm for dynamic graphs, where the score is reset after a certain time-window.

Inspecting the nutch-code, I could not find mechanisms to start a new time-window. Was Nutch using the algorithm for static graphs, prior to Dennis' new scoring tools?

Thanks for all your help!
David



Am 03.02.2011 um 14:10 schrieb Julien Nioche:
Dennis' new scoring tools have been designed to replace the OPIC
implementation. See http://wiki.apache.org/nutch/NewScoring and
http://wiki.apache.org/nutch/NewScoringIndexingExample

HTH

Julien

On 3 February 2011 12:40, David Saile wrote:


Am 02.02.2011 um 17:04 schrieb Tim Pease:
On Feb 2, 2011, at 5:18 AM, David Saile wrote:

Hi all,

I have a question concerning updating a site's score in Nutch 1.2.

In org.apache.nutch.crawlCrawlDbReducer's reduce-method I found a call
to
scfilters.updateDbScore((Text)key, oldSet ? old : null, result,
linkList);
During debugging, I discovered that this method is executed in the
org.apache.nutch.scoring.opic.OPICScoringFilter class. The code for this
method is the following:
/** Increase the score by a sum of inlinked scores. */
public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum,
List inlinked) throws ScoringFilterException {
float adjust = 0.0f;
for (int i = 0; i < inlinked.size(); i++) {
CrawlDatum linked = (CrawlDatum)inlinked.get(i);
adjust += linked.getScore();
}
if (old == null) old = datum;
datum.setScore(old.getScore() + adjust);
}

To my understanding, this code would increase a sites score based on
it's inlinks, every time a site is crawled. So even if neither the site has
been modified, nor any new inlink was discovered, the sites score will
increase.
Is my understanding of this mechanism correct?
If so, could anyone explain to me why a sites score is increased in any
case? I would expect it to only change if either its content has changed, or
a new inlink has been discovered.
Your observations are correct. We recently ran into this exact same issue
and have determined that the OPICScoringFilter is not suitable for crawls
where pages will be re-fetched / re-parsed. The page score will continually
be increased each time it is fetched eventually resulting in a score of
Inifinity.
The "Online Page Importance Computation" (OPIC) score algorithm is
described in this paper =>
http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html
The purpose of the algorithm is that you do not have to maintain the
entire link graph in memory to computer score imparted to inlinks and
outlinks. The downside is that you cannot determine if a page's score has
already been included in the outlinks to another page. Hence the infinite
score growth you have observed.
This behavior only appears if you are re-fetching / re-parsing pages.

Blessings,
TwP
Thank you very much for you reply Tim!

Is it correct to assume, that you could make the OPIC score algorithm more
precise by only updating the score in two cases:

1) If a site has a modified outlink (i.e. the outlink was added or
deleted since the last fetch), update the score of the target-site of this
outlink.

2) If a sites score has changed since the last fetch, you have to
update the score of all targets of outlinks on this site.

(given the case you actually had the required information at hand)?

Cheers
David



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 6 | next ›
Discussion Overview
groupuser @
categoriesnutch, lucene
postedFeb 2, '11 at 12:19p
activeFeb 7, '11 at 7:02a
posts6
users3
websitenutch.apache.org

People

Translate

site design / logo © 2022 Grokbase