FAQ
Hi,
I have successfully configured NUTCH 0.9, which is crawling number of
sites and after that searching is also happening properly.
However, now I want to crawl password protected pages using NUTCH. In
order to access those pages I should have a valid user name and
password. I have configured the user name and password in my
nutch-site.xml and httpclient-auth.xml
However it is not crawling.
I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
the Zip file for your reference. Kindly check and let me know what is
missing from my end.
CONFIGURATION:
nutch-2008-07-10_04-01-48.tar (I have download from
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
<http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://az3
3exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://hudson.zones.a
pache.org/hudson/job/Nutch-trunk/> which contains your patch for
HttpAuthentication)

Windows XP
Cygwin
jdk1.6.0

Thanks in advance...
Please help....

Best regards,
Biswajit

Search Discussions

  • Kunthar at Sep 15, 2008 at 12:58 pm
    Stop trolling me...



    Rout Biswajit-B16078 yazmış:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order to access those pages I should have a valid user name and
    password. I have configured the user name and password in my
    nutch-site.xml and httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
    the Zip file for your reference. Kindly check and let me know what is
    missing from my end.

    */_CONFIGURATION:_/**/__/*

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
    <http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://az33exf40.fsl.freescale.net/exchweb/bin/redir.asp?URL=http://hudson.zones.apache.org/hudson/job/Nutch-trunk/>
    which contains your patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit

  • Susam Pal at Sep 15, 2008 at 1:04 pm
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In order
    to access those pages I should have a valid user name and password. I have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
    Zip file for your reference. Kindly check and let me know what is missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
  • Biswajit_rout at Sep 15, 2008 at 1:21 pm
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
    Zip file for your reference. Kindly check and let me know what is missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 15, 2008 at 5:49 pm
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in the
    Zip file for your reference. Kindly check and let me know what is missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 16, 2008 at 8:03 am
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 16, 2008 at 8:06 am
    I have also attached hadoop.log for your reference...


    biswajit_rout wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507198.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 16, 2008 at 8:08 am
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 16, 2008 at 12:33 pm
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log file(debugenabled_hadoop.log).
    Kindly go through the file and let me know what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH. In
    order
    to access those pages I should have a valid user name and password. I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19510820.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 16, 2008 at 3:33 pm
    Hi Susam,

    Please find the latest log file(latest.log), which shows different error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes received: 985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want
    to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH.
    In
    order
    to access those pages I should have a valid user name and password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 16, 2008 at 4:39 pm
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes received: 985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch 0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of Nutch
    and build it as it contains the authentication feature. If you want
    to
    use it with Nutch 0.9, you have to download the latest patch present
    in the ticket page and apply it to the source code and build it. You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH.
    In
    order
    to access those pages I should have a valid user name and password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 16, 2008 at 5:25 pm
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 16, 2008 at 5:36 pm
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you have
    set credentials for 10.222.18.113:8080 which is never being fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 18, 2008 at 1:10 pm
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received: 1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19552519.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 19, 2008 at 5:38 am
    Hi Susam,

    Please give a look into new.txt and suggest a solution for this. This time i
    have crawled another site. I am able to crawl all the public pages but
    password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566500.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 19, 2008 at 5:38 am
    Hi Susam,

    Please give a look into the attached file (new.txt) and suggest a solution
    for this. This time i have crawled another site. I am able to crawl all the
    public pages but password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 19, 2008 at 2:57 pm
    Hi Biswajit,

    I don't find a single error caused due to authentication problem in
    the 'new.txt' file you have attached in some mail before.. Most of
    them are HTTP 404 or HTTP 302 errors, which means either the page is
    not available or the page has been moved to another location, which
    the crawler would try to fetch. There's nothing I can do to help you
    in this matter. You have access to the network and you can analyze
    better why this is happening. Please do not send the same mail
    multiple time. As, I have told you before, it takes time for members
    to respond as they do so only in their free time.

    Regards,
    Susam Pal

    On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
    wrote:
    Hi Susam,

    Please give a look into the attached file (new.txt) and suggest a solution
    for this. This time i have crawled another site. I am able to crawl all the
    public pages but password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail multiple
    times. We have received the same mail from you 4 times. It takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 22, 2008 at 8:10 am
    Hi Susam,

    I saw,
    http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888

    Where your patch is used for web server (Tomcat) manager authentication.

    However my requirement is different…

    I am trying to crawl sites which are password protected just like gmail.
    That means I have to pass correct user name and password, after that I will
    be able to see all the pages / modules. So for this, is there any different
    configuration mechanism is there?

    Could you please let me know???

    Best regards,
    Biswajit.



    Susam Pal wrote:
    Hi Biswajit,

    I don't find a single error caused due to authentication problem in
    the 'new.txt' file you have attached in some mail before.. Most of
    them are HTTP 404 or HTTP 302 errors, which means either the page is
    not available or the page has been moved to another location, which
    the crawler would try to fetch. There's nothing I can do to help you
    in this matter. You have access to the network and you can analyze
    better why this is happening. Please do not send the same mail
    multiple time. As, I have told you before, it takes time for members
    to respond as they do so only in their free time.

    Regards,
    Susam Pal

    On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
    wrote:
    Hi Susam,

    Please give a look into the attached file (new.txt) and suggest a
    solution
    for this. This time i have crawled another site. I am able to crawl all
    the
    public pages but password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled
    all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me
    know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let
    me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do
    authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version
    of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build
    it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail
    multiple
    times. We have received the same mail from you 4 times. It
    takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml
    and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know
    what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Susam Pal at Sep 22, 2008 at 8:17 am
    Replies inline.

    On Mon, Sep 22, 2008 at 1:40 PM, biswajit_rout
    wrote:
    Hi Susam,

    I saw,
    http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888

    Where your patch is used for web server (Tomcat) manager authentication.
    My patch is used for Basic, Digest or NTLM authentication schemes
    only. It works with Tomcat probably because it uses Basic
    authentication to login to the manager. I hope you have read this:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes.
    However my requirement is different…

    I am trying to crawl sites which are password protected just like gmail.
    That means I have to pass correct user name and password, after that I will
    be able to see all the pages / modules. So for this, is there any different
    configuration mechanism is there?

    POST based authentication is not done. You might want to read this:
    http://wiki.apache.org/nutch/HttpPostAuthentication

    Regards,
    Susam Pal
    Could you please let me know???

    Best regards,
    Biswajit.



    Susam Pal wrote:
    Hi Biswajit,

    I don't find a single error caused due to authentication problem in
    the 'new.txt' file you have attached in some mail before.. Most of
    them are HTTP 404 or HTTP 302 errors, which means either the page is
    not available or the page has been moved to another location, which
    the crawler would try to fetch. There's nothing I can do to help you
    in this matter. You have access to the network and you can analyze
    better why this is happening. Please do not send the same mail
    multiple time. As, I have told you before, it takes time for members
    to respond as they do so only in their free time.

    Regards,
    Susam Pal

    On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
    wrote:
    Hi Susam,

    Please give a look into the attached file (new.txt) and suggest a
    solution
    for this. This time i have crawled another site. I am able to crawl all
    the
    public pages but password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled
    all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me
    know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let
    me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do
    authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version
    of
    Nutch
    and build it as it contains the authentication feature. If you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build
    it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail
    multiple
    times. We have received the same mail from you 4 times. It
    takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml
    and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know
    what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
  • Biswajit_rout at Sep 25, 2008 at 6:34 am
    Hi,

    It's regarding the indexing which NUTCH is doing...
    From the below log we can see, it has indexed 75 number of docs. Now my
    question is what is the maximum value of this indexing? Can we control this
    value to some limit?

    2008-09-24 13:13:23,390 INFO indexer.Indexer - Optimizing index.
    2008-09-24 13:13:23,406 INFO indexer.Indexer - merging segments _ram_1e (1
    docs) _ram_1f (1 docs) _ram_1g (1 docs) _ram_1h (1 docs) _ram_1i (1 docs)
    _ram_1j (1 docs) _ram_1k (1 docs) _ram_1l (1 docs) _ram_1m (1 docs) _ram_1n
    (1 docs) _ram_1o (1 docs) _ram_1p (1 docs) _ram_1q (1 docs) _ram_1r (1 docs)
    _ram_1s (1 docs) _ram_1t (1 docs) _ram_1u (1 docs) _ram_1v (1 docs) _ram_1w
    (1 docs) _ram_1x (1 docs) _ram_1y (1 docs) _ram_1z (1 docs) _ram_20 (1 docs)
    _ram_21 (1 docs) _ram_22 (1 docs) into _1 (25 docs)
    2008-09-24 13:13:23,437 INFO indexer.Indexer - merging segments _0 (50
    docs) _1 (25 docs) into _2 (75 docs)
    2008-09-24 13:13:24,216 INFO indexer.Indexer - Indexer: done
    2008-09-24 13:13:24,216 INFO indexer.DeleteDuplicates - Dedup: starting
    2008-09-24 13:13:24,232 INFO indexer.DeleteDuplicates - Dedup: adding
    indexes in: crawl/indexes
    2008-09-24 13:13:27,723 INFO indexer.DeleteDuplicates - Dedup: done
    2008-09-24 13:13:27,723 INFO indexer.IndexMerger - merging indexes to:
    crawl/index
    2008-09-24 13:13:27,738 INFO indexer.IndexMerger - Adding
    crawl/indexes/part-00000
    2008-09-24 13:13:27,816 INFO indexer.IndexMerger - done merging
    2008-09-24 13:13:27,832 INFO crawl.Crawl - crawl finished: crawl

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Replies inline.

    On Mon, Sep 22, 2008 at 1:40 PM, biswajit_rout
    wrote:
    Hi Susam,

    I saw,
    http://www.nabble.com/Problems-testing-Authentication-td13991771.html#a13995888

    Where your patch is used for web server (Tomcat) manager authentication.
    My patch is used for Basic, Digest or NTLM authentication schemes
    only. It works with Tomcat probably because it uses Basic
    authentication to login to the manager. I hope you have read this:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes.
    However my requirement is different…

    I am trying to crawl sites which are password protected just like gmail.
    That means I have to pass correct user name and password, after that I
    will
    be able to see all the pages / modules. So for this, is there any
    different
    configuration mechanism is there?

    POST based authentication is not done. You might want to read this:
    http://wiki.apache.org/nutch/HttpPostAuthentication

    Regards,
    Susam Pal
    Could you please let me know???

    Best regards,
    Biswajit.



    Susam Pal wrote:
    Hi Biswajit,

    I don't find a single error caused due to authentication problem in
    the 'new.txt' file you have attached in some mail before.. Most of
    them are HTTP 404 or HTTP 302 errors, which means either the page is
    not available or the page has been moved to another location, which
    the crawler would try to fetch. There's nothing I can do to help you
    in this matter. You have access to the network and you can analyze
    better why this is happening. Please do not send the same mail
    multiple time. As, I have told you before, it takes time for members
    to respond as they do so only in their free time.

    Regards,
    Susam Pal

    On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
    wrote:
    Hi Susam,

    Please give a look into the attached file (new.txt) and suggest a
    solution
    for this. This time i have crawled another site. I am able to crawl all
    the
    public pages but password protected pages crawling is not happening...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi,

    There is nothing to crawl in the home page of
    http://10.222.18.113:8080/dao/.

    So this time i have crawled another site. I have successfully crawled
    all
    the public pages but not able to crawl private pages.
    I have attached a log file(new.log). Can you please check and let me
    know
    what needs to be done from my end???

    Best regards,
    Biswajit.


    Susam Pal wrote:
    The log file shows only one fetching line:

    2008-09-16 20:46:15,321 INFO fetcher.Fetcher - fetching
    http://10.222.18.113:8080/dao/

    This has been fetched successfully. There is no other page being
    fetched. Have you set up Nutch properly so that it can fetch all the
    pages you need? If it tries to fetch a page but fails due to
    authentication, then it is a problem with authentication.

    In this case, it is not even attempting to fetch those pages. So, the
    problem lies elsewhere. You need to first find out why it is fetching
    only one page and not others.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 5:24 PM, biswajit_rout
    wrote:
    But still it is not crawling the password protected pages...

    Regards,
    Biswajit.


    Susam Pal wrote:
    The latest log shows that the page from the URL:
    http://10.222.18.113:8080/dao/ has been fetched successfully.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 3:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    Please find the latest log file(latest.log), which shows different
    error.

    2008-09-16 20:46:16,102 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/robots.txt; status code: 404; bytes
    received:
    985;
    Content-Length: 985
    2008-09-16 20:46:16,384 DEBUG httpclient.Http - url:
    http://10.222.18.113:8080/dao/; status code: 200; bytes received:
    1941;
    Content-Length: 1941

    Thanks in advance...

    Best regards,
    Biswajit.


    biswajit_rout wrote:
    Hi Susam,

    Thanks for your immediate response...
    Herewith i am attaching the debug enabled log
    file(debugenabled_hadoop.log). Kindly go through the file and let
    me
    know
    what is missing from my end...

    Best regards,
    Biswajit.


    Susam Pal wrote:
    Hi Biswajit,

    The authscope specifies which IP address or domain-name would
    the
    credentials be used for. If you provide 10.222.18.113 in the
    authscope, the credentials would not be used for localhost even
    though
    both represent the same machine.

    Please provide logs with DEBUG enabled.

    Regards,
    Susam Pal

    On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
    wrote:
    Hi Susam,

    The ip 10.222.18.113 is nothing but the ip address of my
    machine(localhost).
    Now also i changed http://localhost:8080/ to
    http://10.222.18.113:8080.
    However no result, i mean to say still not able to crawl
    password
    protected
    pages.

    Kindly assist me to resolve this issue.

    Thanks in advance...

    Best regards,
    Biswajit.




    Susam Pal wrote:
    The logs show that it is fetching http://localhost:8080/ but
    you
    have
    set credentials for 10.222.18.113:8080 which is never being
    fetched.
    So, no authentication takes place.

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
    wrote:
    Hi Susam,

    In order to crawl password protected pages, I am using
    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ which
    contains
    your
    patch for HttpAuthentication)

    I have modified nutch-site.xml, httpclient-auth.xml.

    Please find the attached zip file which contains
    nutch-site.xml,httpclient-auth.xml.

    Kindly provide me a solution for this.

    Best regards,
    Biswajit


    Susam Pal wrote:
    Hi Biswajit,

    Could you please tell us how you have added the support for
    authentication in Nutch 0.9? Nutch 0.9 can not do
    authentication
    properly by default. The authentication feature is buggy in
    Nutch
    0.9
    which was fixed with this ticket:
    https://issues.apache.org/jira/browse/NUTCH-559

    The feature is documented here:
    http://wiki.apache.org/nutch/HttpAuthenticationSchemes

    The easiest way to use it is to check out the latest version
    of
    Nutch
    and build it as it contains the authentication feature. If
    you
    want
    to
    use it with Nutch 0.9, you have to download the latest patch
    present
    in the ticket page and apply it to the source code and build
    it.
    You
    might have to resolve some conflicts manually.

    I would suggest that you do not send the mail same mail
    multiple
    times. We have received the same mail from you 4 times. It
    takes
    sometime for members to reply to a mail. :-)

    Regards,
    Susam Pal

    On Mon, Sep 15, 2008 at 6:07 PM, Rout Biswajit-B16078
    wrote:
    Hi,

    I have successfully configured NUTCH 0.9, which is crawling
    number
    of
    sites
    and after that searching is also happening properly.

    However, now I want to crawl password protected pages using
    NUTCH.
    In
    order
    to access those pages I should have a valid user name and
    password.
    I
    have
    configured the user name and password in my nutch-site.xml
    and
    httpclient-auth.xml

    However it is not crawling.

    I have attached nutch-site.xml, httpclient-auth.xml and
    hadoop.log
    in
    the
    Zip file for your reference. Kindly check and let me know
    what
    is
    missing
    from my end.

    CONFIGURATION:

    nutch-2008-07-10_04-01-48.tar (I have download from
    http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
    which
    contains
    your
    patch for HttpAuthentication)



    Windows XP

    Cygwin

    jdk1.6.0



    Thanks in advance…

    Please help....



    Best regards,

    Biswajit
    http://www.nabble.com/file/p19492846/Nutch.zip Nutch.zip
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19492846.html
    Sent from the Nutch - User mailing list archive at
    Nabble.com.
    http://www.nabble.com/file/p19507146/hadoop.log hadoop.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19507146.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19510820/debugenabled_hadoop.log
    debugenabled_hadoop.log
    http://www.nabble.com/file/p19514374/latest.log latest.log
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19514374.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19516409.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    http://www.nabble.com/file/p19552519/new.txt new.txt
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19566502.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context:
    http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19603477.html
    Sent from the Nutch - User mailing list archive at Nabble.com.
    --
    View this message in context: http://www.nabble.com/Not-able-to-crawl-password-protected-pages-using-NUTCH-0.9-tp19492246p19663810.html
    Sent from the Nutch - User mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupnutch-user @
categorieslucene
postedSep 15, '08 at 12:45p
activeSep 25, '08 at 6:34a
posts20
users4
websitenutch.apache.org

People

Translate

site design / logo © 2022 Grokbase