Grokbase Groups Nutch user March 2011
FAQ
I am using nutch 1.1 for crawling.
I am able to crawl so many site without any issue but when I am crawling
www.magicbricks.com
it is stopping at depth=1.
I am using "bin/nutch crawl urls/magicbricks/url.txt -dir crawl/magicbricks
-threads 10 -depth 3 -topN 10"
But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html"
or "http://www.magicbricks.com/bricks/propertySearch.html" in
urls/magicbricks/url.txt it crawls without any issue.

In robots.txt I have allowed my crawler named Propertybot all access to
crawl, which can be seen by using http://magicbricks.com/robots.txt

Please suggest what can be the reasons, why it is happening.

Thanks in advance
Hemant Verma

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Search Discussions

  • Julien Nioche at Mar 1, 2011 at 2:50 pm
    the root page redirects to http://www.m.magicbricks.com/mbs/wapmb
    does your URLFIlter configuration allow that host?
    On 1 March 2011 09:44, hemantverma09@gmail.com wrote:

    I am using nutch 1.1 for crawling.
    I am able to crawl so many site without any issue but when I am crawling
    www.magicbricks.com
    it is stopping at depth=1.
    I am using "bin/nutch crawl urls/magicbricks/url.txt -dir crawl/magicbricks
    -threads 10 -depth 3 -topN 10"
    But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html"
    or "http://www.magicbricks.com/bricks/propertySearch.html" in
    urls/magicbricks/url.txt it crawls without any issue.

    In robots.txt I have allowed my crawler named Propertybot all access to
    crawl, which can be seen by using http://magicbricks.com/robots.txt

    Please suggest what can be the reasons, why it is happening.

    Thanks in advance
    Hemant Verma

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html
    Sent from the Nutch - User mailing list archive at Nabble.com.


    --

    Open Source Solutions for Text Engineering

    http://digitalpebble.blogspot.com/
    http://www.digitalpebble.com
  • Alxsss at Mar 1, 2011 at 4:47 pm
    For some reason nutch starts to crawl inner links at depth 4 for domains with redirects.







    -----Original Message-----
    From: hemantverma09 <hemantverma09@gmail.com>
    To: nutch-user <nutch-user@lucene.apache.org>
    Sent: Tue, Mar 1, 2011 6:17 am
    Subject: Can't Crawl Through Home Page, but crawling through inner page


    I am using nutch 1.1 for crawling.

    I am able to crawl so many site without any issue but when I am crawling

    www.magicbricks.com

    it is stopping at depth=1.

    I am using "bin/nutch crawl urls/magicbricks/url.txt -dir crawl/magicbricks

    -threads 10 -depth 3 -topN 10"

    But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html"

    or "http://www.magicbricks.com/bricks/propertySearch.html" in

    urls/magicbricks/url.txt it crawls without any issue.



    In robots.txt I have allowed my crawler named Propertybot all access to

    crawl, which can be seen by using http://magicbricks.com/robots.txt



    Please suggest what can be the reasons, why it is happening.



    Thanks in advance

    Hemant Verma



    --

    View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html

    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Julien Nioche at Mar 1, 2011 at 5:34 pm
    This is the behaviour by default and redirs are treated like normal links
    i.e. they are fetched in subsequent rounds.
    This can be changed using the param

    *<property>
    <name>http.redirect.max</name>
    <value>0</value>
    <description>The maximum number of redirects the fetcher will follow when
    trying to fetch a page. If set to negative or 0, fetcher won't immediately
    follow redirected URLs, instead it will record them for later fetching.
    </description>
    </property>*

    Julien
    On 1 March 2011 16:46, wrote:
    For some reason nutch starts to crawl inner links at depth 4 for domains
    with redirects.






    -----Original Message-----
    From: hemantverma09 <hemantverma09@gmail.com>
    To: nutch-user <nutch-user@lucene.apache.org>
    Sent: Tue, Mar 1, 2011 6:17 am
    Subject: Can't Crawl Through Home Page, but crawling through inner page


    I am using nutch 1.1 for crawling.

    I am able to crawl so many site without any issue but when I am crawling

    www.magicbricks.com

    it is stopping at depth=1.

    I am using "bin/nutch crawl urls/magicbricks/url.txt -dir
    crawl/magicbricks
    -threads 10 -depth 3 -topN 10"

    But if I put links like "http://www.magicbricks.com/bricks/cityIndex.html"

    or "http://www.magicbricks.com/bricks/propertySearch.html" in

    urls/magicbricks/url.txt it crawls without any issue.



    In robots.txt I have allowed my crawler named Propertybot all access to

    crawl, which can be seen by using http://magicbricks.com/robots.txt



    Please suggest what can be the reasons, why it is happening.



    Thanks in advance

    Hemant Verma



    --

    View this message in context:
    http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2601843.html
    Sent from the Nutch - User mailing list archive at Nabble.com.






    --

    Open Source Solutions for Text Engineering

    http://digitalpebble.blogspot.com/
    http://www.digitalpebble.com
  • Patricio Galeas at Mar 1, 2011 at 11:06 pm
    Hello,

    I have some questions related to the nutch statistics.
    I ran five crawls with topN=12500, depth=2,4,7,10,11, with following results:
    https://spreadsheets.google.com/ccc?key=0AvF8Ig446DzEdGNxaDNLLTgtUzdoTVNzQTJIcVFSZXc&hl=es#gid=0


    Why is the number of TOTAL URLs not equal to (db_fetched + db_unfetched +
    db_gone) ?

    I expected to get a value about 125000 TOTAL URLs (using TopN=12500, depth=10),
    but I got only 34000 URLs (27% of TOTAL URLs). Has this difference to do with
    the regex-urlfilters only?

    When db_gone decreases (for example comparing crawl2 with crawl3) means that
    some URLs which were not available in the past will be now fetched?

    Thanks for your help!

    Regards
    Patricio
  • Anurag at Mar 2, 2011 at 6:49 am
    depth=10 does not imply that total urls=12500*10=125000.
    Depth says that crawling will be performed recursively on the new urls and
    old urls 10 times more. the number of urls fetched depends on the content of
    web pages.....
    On Wed, Mar 2, 2011 at 4:37 AM, Patricio Galeas-5 [via Lucene] wrote:

    Hello,

    I have some questions related to the nutch statistics.
    I ran five crawls with topN=12500, depth=2,4,7,10,11, with following
    results:

    https://spreadsheets.google.com/ccc?key=0AvF8Ig446DzEdGNxaDNLLTgtUzdoTVNzQTJIcVFSZXc&hl=es#gid=0


    Why is the number of TOTAL URLs not equal to (db_fetched + db_unfetched +
    db_gone) ?

    I expected to get a value about 125000 TOTAL URLs (using TopN=12500,
    depth=10),
    but I got only 34000 URLs (27% of TOTAL URLs). Has this difference to do
    with
    the regex-urlfilters only?

    When db_gone decreases (for example comparing crawl2 with crawl3) means
    that
    some URLs which were not available in the past will be now fetched?

    Thanks for your help!

    Regards
    Patricio




    ------------------------------
    If you reply to this email, your message will be added to the discussion
    below:

    http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2607370.html
    To start a new topic under Nutch - User, email
    ml-node+603147-156023097-146354@n3.nabble.com
    To unsubscribe from Nutch - User, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw2MDMxNDd8LTIwOTgzNDQxOTY=>.


    --
    Kumar Anurag


    -----
    Kumar Anurag

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2611555.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Hemantverma09 at Mar 2, 2011 at 1:22 pm
    Thanks to all

    I did following changes and it worked :-)

    crawl-urlfilter.txt
    # skip URLs containing certain characters as probable queries, etc.
    #-[?*!@=]

    # accept hosts in MY.DOMAIN.NAME
    +^*magicbricks.com*


    regex-urlfilter.txt
    # skip URLs containing certain characters as probable queries, etc.
    #-[?*!@=]

    # accept hosts in MY.DOMAIN.NAME
    -^*magicbricks.com*


    nutch-default.xml

    http.redirect.max

    The maximum number of redirects the fetcher will follow when
    trying to fetch a page. If set to negative or 0, fetcher won't immediately
    follow redirected URLs, instead it will record them for later fetching.




    Thanks Again
    Hemant Verma

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Can-t-Crawl-Through-Home-Page-but-crawling-through-inner-page-tp2601843p2611857.html
    Sent from the Nutch - User mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriesnutch, lucene
postedMar 1, '11 at 2:17p
activeMar 2, '11 at 1:22p
posts7
users5
websitenutch.apache.org

People

Translate

site design / logo © 2022 Grokbase