FAQ
I'm doing web crawling using nutch, which runs on hadoop in distributed
mode. When the crawldb has tens of millions of urls, I have started to see
strange failure in generating new segment and updating crawldb.
For generating segment, the hadoop job for select is completed successfully
and generate-temp-1285641291765 is created. but it does not start the
partition job and the segment is not created in segments directory. I try to
understand where it fails. There is no error message except for a few WARN
messages about connection reset by peer. Hadoop fsck and dfsadmin show the
nodes and directories are healthy. Is this a hadoop problem or nutch
problem? I'll appreciate any suggestion for how to debug this fatal
problem.

Similar problem is seen for updatedb step, which creates the temp dir but
never actually update the crawldb.

thanks,
aj
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
web2express.org
twitter: @web2express
Palo Alto, CA, USA

Search Discussions

  • AJ Chen at Oct 2, 2010 at 5:29 pm
    More observations: during hadoop job running, this "filesystem closed" error
    happens consistently.
    2010-10-02 05:29:58,951 WARN mapred.TaskTracker - Error running child
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
    at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1678)
    at java.io.FilterInputStream.close(FilterInputStream.java:155)
    at
    org.apache.hadoop.io.SequenceFile$Reader.close(SequenceFile.java:1584)
    at
    org.apache.hadoop.mapred.SequenceFileRecordReader.close(SequenceFileRecordReader.java:125)
    at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:362)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
    2010-10-02 05:29:58,979 WARN mapred.TaskRunner - Parent died. Exiting
    attempt_201009301134_0006_m_000074_1

    Could this error turns on the safemode in hadoop? I suspect this because the
    next hadoop job is supposed to create a segment directory and write out
    segment results, but it does not create the directory. Anything else could
    happen to hdfs?

    thanks,
    -aj
    On Tue, Sep 28, 2010 at 4:40 PM, AJ Chen wrote:

    I'm doing web crawling using nutch, which runs on hadoop in distributed
    mode. When the crawldb has tens of millions of urls, I have started to see
    strange failure in generating new segment and updating crawldb.
    For generating segment, the hadoop job for select is completed successfully
    and generate-temp-1285641291765 is created. but it does not start the
    partition job and the segment is not created in segments directory. I try to
    understand where it fails. There is no error message except for a few WARN
    messages about connection reset by peer. Hadoop fsck and dfsadmin show the
    nodes and directories are healthy. Is this a hadoop problem or nutch
    problem? I'll appreciate any suggestion for how to debug this fatal
    problem.

    Similar problem is seen for updatedb step, which creates the temp dir but
    never actually update the crawldb.

    thanks,
    aj
    --
    AJ Chen, PhD
    Chair, Semantic Web SIG, sdforum.org
    web2express.org
    twitter: @web2express
    Palo Alto, CA, USA
  • Sudhir Vallamkondu at Oct 4, 2010 at 4:29 pm
    fs -du has a 'h' option for human readble values but it doesn't seem to
    work. Instead you can use something like this to print in gigs. Adjust the
    1024 multiplier for other forms.

    hadoop fs -du / | awk '{print ($1/(1024*1024*1024))"g" "\t" $2}'



    On 10/4/10 2:04 AM, "common-user-digest-help@hadoop.apache.org"
    wrote:
    From: Sandhya E <sandhyabhaskar@gmail.com>
    Date: Sat, 2 Oct 2010 23:36:55 +0530
    To: <common-user@hadoop.apache.org>
    Subject: Re: Total Space Available on Hadoop Cluster Or Hadoop version of
    "df".

    There is a fs -du command that can be useful. Or the Hadoop DFS
    website shows the stats also.
    On Sat, Oct 2, 2010 at 9:44 AM, rahul wrote:
    Hi,

    I am using Hadoop 0.20.2 version for data processing by setting up Hadoop
    Cluster on two nodes.

    And I am continuously adding more space to the nodes.

    Can some body let me know how to get the total space available on the hadoop
    cluster using command line.

    or

    Hadoop version "df", Unix command.

    Any input is helpful.

    Thanks
    Rahul

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
  • Sudhir Vallamkondu at Oct 4, 2010 at 6:27 pm
    Worth checking if this is caused by the open file limit issue

    http://sudhirvn.blogspot.com/2010/07/hadoop-error-logs-orgapachehadoophdfsse
    .html


    On 10/4/10 2:04 AM, "common-user-digest-help@hadoop.apache.org"
    wrote:
    From: AJ Chen <ajchen@web2express.org>
    Date: Sat, 2 Oct 2010 10:28:29 -0700
    To: <common-user@hadoop.apache.org>, nutch-user <nutch-user@lucene.apache.org>
    Subject: Re: hadoop or nutch problem?

    More observations: during hadoop job running, this "filesystem closed" error
    happens consistently.
    2010-10-02 05:29:58,951 WARN mapred.TaskTracker - Error running child
    java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
    at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
    at
    org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1678)
    at java.io.FilterInputStream.close(FilterInputStream.java:155)
    at
    org.apache.hadoop.io.SequenceFile$Reader.close(SequenceFile.java:1584)
    at
    org.apache.hadoop.mapred.SequenceFileRecordReader.close(SequenceFileRecordRead
    er.java:125)
    at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:362)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
    2010-10-02 05:29:58,979 WARN mapred.TaskRunner - Parent died. Exiting
    attempt_201009301134_0006_m_000074_1

    Could this error turns on the safemode in hadoop? I suspect this because the
    next hadoop job is supposed to create a segment directory and write out
    segment results, but it does not create the directory. Anything else could
    happen to hdfs?

    thanks,
    -aj
    On Tue, Sep 28, 2010 at 4:40 PM, AJ Chen wrote:

    I'm doing web crawling using nutch, which runs on hadoop in distributed
    mode. When the crawldb has tens of millions of urls, I have started to see
    strange failure in generating new segment and updating crawldb.
    For generating segment, the hadoop job for select is completed successfully
    and generate-temp-1285641291765 is created. but it does not start the
    partition job and the segment is not created in segments directory. I try to
    understand where it fails. There is no error message except for a few WARN
    messages about connection reset by peer. Hadoop fsck and dfsadmin show the
    nodes and directories are healthy. Is this a hadoop problem or nutch
    problem? I'll appreciate any suggestion for how to debug this fatal
    problem.

    Similar problem is seen for updatedb step, which creates the temp dir but
    never actually update the crawldb.

    thanks,
    aj
    --
    AJ Chen, PhD
    Chair, Semantic Web SIG, sdforum.org
    web2express.org
    twitter: @web2express
    Palo Alto, CA, USA

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 28, '10 at 11:40p
activeOct 4, '10 at 6:27p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Sudhir Vallamkondu: 2 posts AJ Chen: 2 posts

People

Translate

site design / logo © 2022 Grokbase