FAQ
Hi

I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even
though there should be more data to fetch. I can download a list of urls
without any problem using FreeGenerator, but the recursive crawl is not
working for me.

I have the crawl-urlfilter.txt set up to accept any url, and the plugins
configured to use this filter
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)|feed</value>

The only other nutch configs that I've changed are the robot settings.

If I inspect the crawldb after a run I see that it's fetched the 3 seed pages
and refused to fetch anything else:

TOTAL urls: 248
retry 0: 248
min score: 0.0090
avg score: 0.03530645
max score: 2.029
status 1 (db_unfetched): 245
status 2 (db_fetched): 3

How can I get nutch to fetch the rest of the urls?

thanks in advance for your help,

Barry

ps: here's my crawl-urlfilter.txt
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
#-.
+.*

Search Discussions

  • Alxsss at Feb 14, 2008 at 6:27 pm
    How this FreeGenerator works?

    Thanks.
    Alex.




    -----Original Message-----
    From: Barry Haddow <bhaddow@inf.ed.ac.uk>
    To: nutch-user@lucene.apache.org
    Sent: Thu, 14 Feb 2008 8:31 am
    Subject: crawl stops at depth 1










    Hi

    I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1 even
    though there should be more data to fetch. I can download a list of urls
    without any problem using FreeGenerator, but the recursive crawl is not
    working for me.

    I have the crawl-urlfilter.txt set up to accept any url, and the plugins
    configured to use this filter
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
    index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
    urlnormalizer-(pass|regex|basic)|feed</value>

    The only other nutch configs that I've changed are the robot settings.

    If I inspect the crawldb after a run I see that it's fetched the 3 seed pages
    and refused to fetch anything else:

    TOTAL urls: 248
    retry 0: 248
    min score: 0.0090
    avg score: 0.03530645
    max score: 2.029
    status 1 (db_unfetched): 245
    status 2 (db_fetched): 3

    How can I get nutch to fetch the rest of the urls?

    thanks in advance for your help,

    Barry

    ps: here's my crawl-urlfilter.txt
    -^(file|ftp|mailto):

    # skip image and other suffixes we can't yet parse
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
    tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    # skip URLs containing certain characters as probable queries, etc.
    -[?*!@=]

    # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
    -.*(/[^/]+)/[^/]+\1/[^/]+\1/

    # accept hosts in MY.DOMAIN.NAME
    #+^http://([a-z0-9]*\.)*apache.org/

    # skip everything else
    #-.
    +.*








    ________________________________________________________________________
    More new features than ever. Check out the new AIM(R) Mail ! - http://webmail.aim.com
  • Barry Haddow at Feb 14, 2008 at 6:41 pm

    On Thursday 14 February 2008, alxsss@aim.com wrote:
    How this FreeGenerator works?

    Thanks.
    Alex.
    nutch freegen
  • Barry Haddow at Feb 18, 2008 at 6:18 pm
    Hi

    I think I've solved the problem. When I turned up logging I found that the
    Generator's FetchSchedule was rejecting all the potential urls since they had
    a fetch time in the future. This was because the clocks on the slaves were
    all slightly ahead of the master clock. So the moral of the story is, make
    sure that you synchronise clocks on your cluster otherwise nutch may fail,

    regards
    Barry
    On Thursday 14 February 2008 16:31:14 Barry Haddow wrote:
    Hi

    I'm trying to get a nutch crawl to work, and it keeps stopping at depth 1
    even though there should be more data to fetch. I can download a list of
    urls without any problem using FreeGenerator, but the recursive crawl is
    not working for me.

    I have the crawl-urlfilter.txt set up to accept any url, and the plugins
    configured to use this filter
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-(crawl|regex)|parse-(text|html|js)|
    index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|
    urlnormalizer-(pass|regex|basic)|feed</value>

    The only other nutch configs that I've changed are the robot settings.

    If I inspect the crawldb after a run I see that it's fetched the 3 seed
    pages and refused to fetch anything else:

    TOTAL urls: 248
    retry 0: 248
    min score: 0.0090
    avg score: 0.03530645
    max score: 2.029
    status 1 (db_unfetched): 245
    status 2 (db_fetched): 3

    How can I get nutch to fetch the rest of the urls?

    thanks in advance for your help,

    Barry

    ps: here's my crawl-urlfilter.txt
    -^(file|ftp|mailto):

    # skip image and other suffixes we can't yet parse
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
    tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    # skip URLs containing certain characters as probable queries, etc.
    -[?*!@=]

    # skip URLs with slash-delimited segment that repeats 3+ times, to break
    loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/

    # accept hosts in MY.DOMAIN.NAME
    #+^http://([a-z0-9]*\.)*apache.org/

    # skip everything else
    #-.
    +.*

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupnutch-user @
categorieslucene
postedFeb 14, '08 at 4:31p
activeFeb 18, '08 at 6:18p
posts4
users2
websitenutch.apache.org

2 users in discussion

Barry Haddow: 3 posts Alxsss: 1 post

People

Translate

site design / logo © 2022 Grokbase