FAQ
Hello all,

I got the time out error as mentioned below -- after 600 seconds,
that attempt was killed and the attempt would be deemed a failure. I
searched around about this error, and one of the suggestions to include
"progress" statements in the reducer -- it might be taking longer than 600
seconds and so is timing out. I added calls to context.progress() and
context.setStatus(str) in the reducer. Now, it works fine -- there are no
timeout errors.

But, for a few jobs, it takes awfully long time to move from "Map
100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
was more than an hour. The reduce code is not complex -- 2 level loop and
couple of if-else blocks. The input size is also not huge, for the job that
gets struck for an hour at reduce 99%, it would take in 130. Some of them
are 1-3 MB in size and couple of them are 16MB in size.

Has anyone encountered this problem before? Any pointers? I use
Hadoop 0.20.2 on a linux cluster of 16 nodes.

Thank you.

Regards,
Raghava.
On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju wrote:

Hi all,

I am running a series of jobs one after another. While executing the
4th job, the job fails. It fails in the reducer --- the progress percentage
would be map 100%, reduce 99%. It gives out the following message

10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
attempt_201003240138_0110_r_000018_1, Status : FAILED
Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
seconds. Killing!

It makes several attempts again to execute it but fails with similar
message. I couldn't get anything from this error message and wanted to look
at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
find any files which match the timestamp of the job. Also I did not find
history and userlogs in the logs folder. Should I look at some other place
for the logs? What could be the possible causes for the above error?

I am using Hadoop 0.20.2 and I am running it on a cluster with 16
nodes.

Thank you.

Regards,
Raghava.

Search Discussions

  • Eric Arenas at Apr 8, 2010 at 6:28 pm
    Yes Raghava,

    I have experience that issue before, and the solution that you mentioned also solved my issue (adding a context.progress or setcontext to tell the JT that my jobs are still running)

    regards
    Eric Arenas




    ________________________________
    From: Raghava Mutharaju <m.vijayaraghava@gmail.com>
    To: common-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org
    Sent: Thu, April 8, 2010 10:30:49 AM
    Subject: Reduce gets struck at 99%

    Hello all,

    I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.


    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju wrote:

    Hi all,
    I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!

    It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

    Thank you.

    Regards,
    Raghava.
  • Prashant ullegaddi at Apr 8, 2010 at 7:03 pm
    Dear Raghava,

    I also faced this problem. It mostly happens if the computation for the data
    that reduce received is taking more time
    and is not able to finish within the default time-out 600s. You can also
    increase the time-out to ensure that all reduces
    complete by setting the property "mapred.task.timeout".

    On Thu, Apr 8, 2010 at 11:57 PM, Eric Arenas wrote:

    Yes Raghava,

    I have experience that issue before, and the solution that you mentioned
    also solved my issue (adding a context.progress or setcontext to tell the JT
    that my jobs are still running)

    regards
    Eric Arenas




    ________________________________
    From: Raghava Mutharaju <m.vijayaraghava@gmail.com>
    To: common-user@hadoop.apache.org; mapreduce-user@hadoop.apache.org
    Sent: Thu, April 8, 2010 10:30:49 AM
    Subject: Reduce gets struck at 99%

    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure. I
    searched around about this error, and one of the suggestions to include
    "progress" statements in the reducer -- it might be taking longer than 600
    seconds and so is timing out. I added calls to context.progress() and
    context.setStatus(str) in the reducer. Now, it works fine -- there are no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
    was more than an hour. The reduce code is not complex -- 2 level loop and
    couple of if-else blocks. The input size is also not huge, for the job that
    gets struck for an hour at reduce 99%, it would take in 130. Some of them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.


    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,
    I am running a series of jobs one after another. While executing
    the 4th job, the job fails. It fails in the reducer --- the progress
    percentage would be map 100%, reduce 99%. It gives out the following message
    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
    seconds. Killing!
    It makes several attempts again to execute it but fails with similar
    message. I couldn't get anything from this error message and wanted to look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
    find any files which match the timestamp of the job. Also I did not find
    history and userlogs in the logs folder. Should I look at some other place
    for the logs? What could be the possible causes for the above error?
    I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.
    Thank you.

    Regards,
    Raghava.


    --
    Thanks and Regards,
    Prashant Ullegaddi,
    Search and Information Extraction Lab,
    IIIT-Hyderabad, India.
  • Gregory Lawrence at Apr 8, 2010 at 8:28 pm
    Hi,

    I have also experienced this problem. Have you tried speculative execution? Also, I have had jobs that took a long time for one mapper / reducer because of a record that was significantly larger than those contained in the other filesplits. Do you know if it always slows down for the same filesplit?

    Regards,
    Greg Lawrence

    On 4/8/10 10:30 AM, "Raghava Mutharaju" wrote:

    Hello all,

    I got the time out error as mentioned below -- after 600 seconds, that attempt was killed and the attempt would be deemed a failure. I searched around about this error, and one of the suggestions to include "progress" statements in the reducer -- it might be taking longer than 600 seconds and so is timing out. I added calls to context.progress() and context.setStatus(str) in the reducer. Now, it works fine -- there are no timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map 100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it was more than an hour. The reduce code is not complex -- 2 level loop and couple of if-else blocks. The input size is also not huge, for the job that gets struck for an hour at reduce 99%, it would take in 130. Some of them are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I use Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju wrote:
    Hi all,

    I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for 602 seconds. Killing!

    It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with 16 nodes.

    Thank you.

    Regards,
    Raghava.
  • Raghava Mutharaju at Apr 8, 2010 at 8:14 pm
    Hi,

    Thank you Eric, Prashant and Greg. Although the timeout problem was
    resolved, reduce is getting stuck at 99%. As of now, it has been stuck there
    for about 3 hrs. That is too high a wait time for my task. Do you guys see
    any reason for this?

    Speculative execution is "on" by default right? Or should I enable it?

    Regards,
    Raghava.
    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence wrote:

    Hi,

    I have also experienced this problem. Have you tried speculative execution?
    Also, I have had jobs that took a long time for one mapper / reducer because
    of a record that was significantly larger than those contained in the other
    filesplits. Do you know if it always slows down for the same filesplit?

    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" wrote:

    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure. I
    searched around about this error, and one of the suggestions to include
    "progress" statements in the reducer -- it might be taking longer than 600
    seconds and so is timing out. I added calls to context.progress() and
    context.setStatus(str) in the reducer. Now, it works fine -- there are no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
    was more than an hour. The reduce code is not complex -- 2 level loop and
    couple of if-else blocks. The input size is also not huge, for the job that
    gets struck for an hour at reduce 99%, it would take in 130. Some of them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after another. While executing the
    4th job, the job fails. It fails in the reducer --- the progress percentage
    would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
    seconds. Killing!

    It makes several attempts again to execute it but fails with similar
    message. I couldn't get anything from this error message and wanted to look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
    find any files which match the timestamp of the job. Also I did not find
    history and userlogs in the logs folder. Should I look at some other place
    for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with 16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Ted Yu at Apr 8, 2010 at 8:40 pm
    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju wrote:

    Hi,

    Thank you Eric, Prashant and Greg. Although the timeout problem was
    resolved, reduce is getting stuck at 99%. As of now, it has been stuck
    there
    for about 3 hrs. That is too high a wait time for my task. Do you guys see
    any reason for this?

    Speculative execution is "on" by default right? Or should I enable it?

    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried speculative
    execution?
    Also, I have had jobs that took a long time for one mapper / reducer because
    of a record that was significantly larger than those contained in the other
    filesplits. Do you know if it always slows down for the same filesplit?

    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" wrote:

    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure. I
    searched around about this error, and one of the suggestions to include
    "progress" statements in the reducer -- it might be taking longer than 600
    seconds and so is timing out. I added calls to context.progress() and
    context.setStatus(str) in the reducer. Now, it works fine -- there are no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
    was more than an hour. The reduce code is not complex -- 2 level loop and
    couple of if-else blocks. The input size is also not huge, for the job that
    gets struck for an hour at reduce 99%, it would take in 130. Some of them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after another. While executing the
    4th job, the job fails. It fails in the reducer --- the progress
    percentage
    would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for 602
    seconds. Killing!

    It makes several attempts again to execute it but fails with similar
    message. I couldn't get anything from this error message and wanted to look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't
    find any files which match the timestamp of the job. Also I did not find
    history and userlogs in the logs folder. Should I look at some other place
    for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with 16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Raghava Mutharaju at Apr 8, 2010 at 9:47 pm
    Hi Ted,

    Thank you for the suggestion. I enabled it using the Configuration
    class because I cannot change hadoop-site.xml file (I am not an admin). The
    situation is still the same --- it gets stuck at reduce 99% and does not
    move further.

    Regards,
    Raghava.
    On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu wrote:

    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>


    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi,

    Thank you Eric, Prashant and Greg. Although the timeout problem was
    resolved, reduce is getting stuck at 99%. As of now, it has been stuck
    there
    for about 3 hrs. That is too high a wait time for my task. Do you guys see
    any reason for this?

    Speculative execution is "on" by default right? Or should I enable it?
    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried speculative
    execution?
    Also, I have had jobs that took a long time for one mapper / reducer because
    of a record that was significantly larger than those contained in the other
    filesplits. Do you know if it always slows down for the same filesplit?

    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:
    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure. I
    searched around about this error, and one of the suggestions to include
    "progress" statements in the reducer -- it might be taking longer than 600
    seconds and so is timing out. I added calls to context.progress() and
    context.setStatus(str) in the reducer. Now, it works fine -- there are
    no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for some it
    was more than an hour. The reduce code is not complex -- 2 level loop
    and
    couple of if-else blocks. The input size is also not huge, for the job that
    gets struck for an hour at reduce 99%, it would take in 130. Some of
    them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I
    use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after another. While executing the
    4th job, the job fails. It fails in the reducer --- the progress
    percentage
    would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for
    602
    seconds. Killing!

    It makes several attempts again to execute it but fails with similar
    message. I couldn't get anything from this error message and wanted to look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
    don't
    find any files which match the timestamp of the job. Also I did not
    find
    history and userlogs in the logs folder. Should I look at some other place
    for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with
    16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Ted Yu at Apr 8, 2010 at 9:52 pm
    Raghava:
    Are you able to share the last segment of reducer log ?
    You can get them from web UI:
    http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193

    Adding more log in your reducer task would help pinpoint where the issue is.
    Also look in job tracker log.

    Cheers
    On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju wrote:

    Hi Ted,

    Thank you for the suggestion. I enabled it using the Configuration
    class because I cannot change hadoop-site.xml file (I am not an admin). The
    situation is still the same --- it gets stuck at reduce 99% and does not
    move further.

    Regards,
    Raghava.
    On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu wrote:

    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>


    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi,

    Thank you Eric, Prashant and Greg. Although the timeout problem was
    resolved, reduce is getting stuck at 99%. As of now, it has been stuck
    there
    for about 3 hrs. That is too high a wait time for my task. Do you guys see
    any reason for this?

    Speculative execution is "on" by default right? Or should I enable it?
    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried speculative
    execution?
    Also, I have had jobs that took a long time for one mapper / reducer because
    of a record that was significantly larger than those contained in the other
    filesplits. Do you know if it always slows down for the same
    filesplit?
    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com> wrote:
    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure. I
    searched around about this error, and one of the suggestions to
    include
    "progress" statements in the reducer -- it might be taking longer
    than
    600
    seconds and so is timing out. I added calls to context.progress() and
    context.setStatus(str) in the reducer. Now, it works fine -- there
    are
    no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move from "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
    some
    it
    was more than an hour. The reduce code is not complex -- 2 level loop
    and
    couple of if-else blocks. The input size is also not huge, for the
    job
    that
    gets struck for an hour at reduce 99%, it would take in 130. Some of
    them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers? I
    use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after another. While
    executing
    the
    4th job, the job fails. It fails in the reducer --- the progress
    percentage
    would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status for
    602
    seconds. Killing!

    It makes several attempts again to execute it but fails with similar
    message. I couldn't get anything from this error message and wanted
    to
    look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
    don't
    find any files which match the timestamp of the job. Also I did not
    find
    history and userlogs in the logs folder. Should I look at some other place
    for the logs? What could be the possible causes for the above error?

    I am using Hadoop 0.20.2 and I am running it on a cluster with
    16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Raghava Mutharaju at Apr 9, 2010 at 2:41 am
    Hi Ted,

    Thank you for all the suggestions. I went through the job tracker
    logs and I have attached the exceptions found in the logs. I found two
    exceptions

    1) org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
    complete write to file (DFS Client)

    2) org.apache.hadoop.ipc.RemoteException:
    org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
    /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
    File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
    does not have any open files.


    The exception occurs at the point of writing out <K,V> pairs in the reducer
    and it occurs only in certain task attempts. I am not using any custom
    output format or record writers but I do use custom input reader.

    What could have gone wrong here?

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu wrote:

    Raghava:
    Are you able to share the last segment of reducer log ?
    You can get them from web UI:

    http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193

    Adding more log in your reducer task would help pinpoint where the issue
    is.
    Also look in job tracker log.

    Cheers

    On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi Ted,

    Thank you for the suggestion. I enabled it using the Configuration
    class because I cannot change hadoop-site.xml file (I am not an admin). The
    situation is still the same --- it gets stuck at reduce 99% and does not
    move further.

    Regards,
    Raghava.
    On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu wrote:

    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>


    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi,

    Thank you Eric, Prashant and Greg. Although the timeout problem
    was
    resolved, reduce is getting stuck at 99%. As of now, it has been
    stuck
    there
    for about 3 hrs. That is too high a wait time for my task. Do you
    guys
    see
    any reason for this?

    Speculative execution is "on" by default right? Or should I
    enable
    it?
    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
    gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried speculative
    execution?
    Also, I have had jobs that took a long time for one mapper /
    reducer
    because
    of a record that was significantly larger than those contained in
    the
    other
    filesplits. Do you know if it always slows down for the same
    filesplit?
    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" <m.vijayaraghava@gmail.com wrote:
    Hello all,

    I got the time out error as mentioned below -- after 600 seconds,
    that attempt was killed and the attempt would be deemed a failure.
    I
    searched around about this error, and one of the suggestions to
    include
    "progress" statements in the reducer -- it might be taking longer
    than
    600
    seconds and so is timing out. I added calls to context.progress()
    and
    context.setStatus(str) in the reducer. Now, it works fine -- there
    are
    no
    timeout errors.

    But, for a few jobs, it takes awfully long time to move
    from
    "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its 15mins and for
    some
    it
    was more than an hour. The reduce code is not complex -- 2 level
    loop
    and
    couple of if-else blocks. The input size is also not huge, for the
    job
    that
    gets struck for an hour at reduce 99%, it would take in 130. Some
    of
    them
    are 1-3 MB in size and couple of them are 16MB in size.

    Has anyone encountered this problem before? Any pointers?
    I
    use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after another. While
    executing
    the
    4th job, the job fails. It fails in the reducer --- the progress
    percentage
    would be map 100%, reduce 99%. It gives out the following message

    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to report status
    for
    602
    seconds. Killing!

    It makes several attempts again to execute it but fails with
    similar
    message. I couldn't get anything from this error message and wanted
    to
    look
    at logs (located in the default dir of ${HADOOP_HOME/logs}). But I
    don't
    find any files which match the timestamp of the job. Also I did not
    find
    history and userlogs in the logs folder. Should I look at some
    other
    place
    for the logs? What could be the possible causes for the above
    error?
    I am using Hadoop 0.20.2 and I am running it on a cluster
    with
    16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Ted Yu at Apr 18, 2010 at 1:36 am
    Hi,
    Putting this thread back in pool to leverage collective intelligence.

    If you get the full command line of the java processes, it wouldn't be
    difficult to correlate reduce task(s) with a particular job.

    Cheers
    On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju wrote:

    Hello Ted,

    Thank you for the suggestions :). I haven't come across any other
    serious issue before this one. Infact, the same MR job runs for a smaller
    input size, although, lot slower than what we expected.

    I will use jstack to get stack trace. I had a question in this regard. How
    would I know which MR job (job id) is related to which java process (pid)? I
    can get a list of hadoop jobs with "hadoop job -list" and list of java
    processes with "jps" but how I couldn't determine how to get the connection
    between these 2 lists.


    Thank you again.

    Regards,
    Raghava.
    On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu wrote:

    If you look at
    https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
    you can see that hdfs-127-branch20-redone-v2.txt<https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was the latest.

    You need to download the source code corresponding to your version of
    hadoop, apply the patch and rebuild.

    If you haven't experienced serious issue with hadoop for other scenarios,
    we should try to find out the root cause for the current problem without the
    127 patch.

    My advice is to use jstack to find what each thread was waiting for after
    reducers get stuck.
    I would expect a deadlock in either your code or hdfs, I would think it
    should the former.

    You can replace sensitive names in the stack traces and paste it if you
    cannot determine the deadlock.

    Cheers


    On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hello Ted,

    Thank you for the reply. Will this change fix my issue? I asked
    this because I again need to convince my admin to make this change.

    We have a gateway to the cluster-head. We generally run our MR jobs
    on the gateway. Should this change be made to the hadoop installation on the
    gateway?

    1) I am confused on which patch to be applied? There are 4 patches listed
    at https://issues.apache.org/jira/browse/HDFS-127

    2) How to apply the patch? Should we change the lines of code specified
    and rebuild hadoop? Or is there any other way?

    Thank you again.

    Regards,
    Raghava.

    On Fri, Apr 16, 2010 at 6:42 PM, wrote:

    That patch is very important.

    please apply it.

    Sent from my Verizon Wireless BlackBerry
    ------------------------------
    *From: * Raghava Mutharaju <m.vijayaraghava@gmail.com>
    *Date: *Fri, 16 Apr 2010 17:27:11 -0400
    *To: *Ted Yu<yuzhihong@gmail.com>
    *Subject: *Re: Reduce gets struck at 99%

    Hi Ted,

    It took sometime to contact my department's admin (he was on
    leave) and ask him to make ulimit changes effective in the cluster (just
    adding entry in /etc/security/limits.conf was not sufficient, so took
    sometime to figure out). Now the ulimit is 32768. I ran the set of MR jobs,
    the result is the same --- it gets stuck at Reduce 99%. But this time, there
    are no exceptions in the logs. I view JobTracker logs through the Web UI. I
    checked "Running Jobs" as well as "Failed Jobs".

    I haven't asked the admin to apply the patch
    https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
    earlier. Is this important?

    Do you any suggestions?

    Thank you.

    Regards,
    Raghava.
    On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu wrote:

    For the user under whom you launch MR jobs.


    On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    Sorry to bug you again :) but I do not have an account on all
    the datanodes, I just have it on the machine on which I start the MR jobs.
    So is it required to increase the ulimit on all the nodes (in this case the
    admin may have to increase it for all the users?)


    Regards,
    Raghava.
    On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu wrote:

    ulimit should be increased on all nodes.

    The link I gave you lists several actions to take. I think they're
    not specifically for hbase.
    Also make sure the following is applied:
    https://issues.apache.org/jira/browse/HDFS-127


    On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hello Ted,

    Should the increase in ulimit to 32768 be applied on all the
    datanodes (its a 16 node cluster)? Is this related to HBase, because I am
    not using HBase.
    Are the exceptions & delay (at Reduce 99%) due to this?

    Regards,
    Raghava.

    On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu wrote:

    Your ulimit is low.
    Ask your admin to increase it to 32768

    See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6


    On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    I am pasting below the timestamps from the log.

    Lease-exception:

    Task Attempts Machine Status Progress Start Time Shuffle Finished
    Sort Finished Finish Time Errors Task Logs
    Counters Actions
    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
    FAILED 0.00%
    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010
    07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec)

    -------------------------------------

    DFS Client Exception:

    Task Attempts Machine Status Progress Start Time Shuffle Finished
    Sort Finished Finish Time Errors Task Logs
    Counters Actions
    attempt_201004060646_0057_r_000006_0 /default-rack/
    nimbus3.cs.wright.edu FAILED 0.00%
    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010
    07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec)
    ------------------------------------------

    The file limit is set to 1024. I checked couple of datanodes. I
    haven't checked the headnode though.

    The no of currently open files under my username, on the system on
    which I started the MR jobs are 346


    Thank you for you help :)

    Regards,
    Raghava.

    On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu wrote:

    Can you give me the timestamps of the two exceptions ?
    I want to see if they're related.

    I saw DFSClient$DFSOutputStream.close() in the first stack trace.

    On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu wrote:

    just to double check it's not a file
    limits issue could you run the following on each of the hosts:

    $ ulimit -a
    $ lsof | wc -l

    The first command will show you (among other things) the file
    limits, it
    should be above the default 1024. The second will tell you have
    many files
    are currently open...


    On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    Thank you for all the suggestions. I went through the
    job tracker logs and I have attached the exceptions found in the logs. I
    found two exceptions

    1) org.apache.hadoop.ipc.RemoteException: java.io.IOException:
    Could not complete write to file (DFS Client)

    2) org.apache.hadoop.ipc.RemoteException:
    org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
    /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
    File does not exist. Holder DFSClient_attempt_201004060646_0057_r_000014_0
    does not have any open files.


    The exception occurs at the point of writing out <K,V> pairs in
    the reducer and it occurs only in certain task attempts. I am not using any
    custom output format or record writers but I do use custom input reader.

    What could have gone wrong here?

    Thank you.

    Regards,
    Raghava.


    On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu wrote:

    Raghava:
    Are you able to share the last segment of reducer log ?
    You can get them from web UI:

    http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193

    Adding more log in your reducer task would help pinpoint where
    the issue is.
    Also look in job tracker log.

    Cheers

    On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi Ted,

    Thank you for the suggestion. I enabled it using the
    Configuration
    class because I cannot change hadoop-site.xml file (I am not
    an admin). The
    situation is still the same --- it gets stuck at reduce 99%
    and does not
    move further.

    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>


    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi,

    Thank you Eric, Prashant and Greg. Although the
    timeout problem was
    resolved, reduce is getting stuck at 99%. As of now, it
    has been stuck
    there
    for about 3 hrs. That is too high a wait time for my
    task. Do you guys
    see
    any reason for this?

    Speculative execution is "on" by default right? Or
    should I enable
    it?
    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
    gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried
    speculative
    execution?
    Also, I have had jobs that took a long time for one
    mapper / reducer
    because
    of a record that was significantly larger than those
    contained in the
    other
    filesplits. Do you know if it always slows down for
    the same
    filesplit?
    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" <
    m.vijayaraghava@gmail.com>
    wrote:
    Hello all,

    I got the time out error as mentioned below
    -- after 600
    seconds,
    that attempt was killed and the attempt would be
    deemed a failure. I
    searched around about this error, and one of the
    suggestions to
    include
    "progress" statements in the reducer -- it might be
    taking longer
    than
    600
    seconds and so is timing out. I added calls to
    context.progress() and
    context.setStatus(str) in the reducer. Now, it works
    fine -- there
    are
    no
    timeout errors.

    But, for a few jobs, it takes awfully long
    time to move from
    "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its
    15mins and for
    some
    it
    was more than an hour. The reduce code is not complex
    -- 2 level loop
    and
    couple of if-else blocks. The input size is also not
    huge, for the
    job
    that
    gets struck for an hour at reduce 99%, it would take
    in 130. Some of
    them
    are 1-3 MB in size and couple of them are 16MB in
    size.
    Has anyone encountered this problem before?
    Any pointers? I
    use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after
    another. While
    executing
    the
    4th job, the job fails. It fails in the reducer ---
    the progress
    percentage
    would be map 100%, reduce 99%. It gives out the
    following message
    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status : FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to
    report status for
    602
    seconds. Killing!

    It makes several attempts again to execute it but
    fails with similar
    message. I couldn't get anything from this error
    message and wanted
    to
    look
    at logs (located in the default dir of
    ${HADOOP_HOME/logs}). But I
    don't
    find any files which match the timestamp of the job.
    Also I did not
    find
    history and userlogs in the logs folder. Should I look
    at some other
    place
    for the logs? What could be the possible causes for
    the above error?
    I am using Hadoop 0.20.2 and I am running it on
    a cluster with
    16
    nodes.

    Thank you.

    Regards,
    Raghava.


  • Raghava Mutharaju at Apr 18, 2010 at 8:25 am
    Hi,

    Thank you Ted. I would just describe the problem again, so that it
    is easier for anyone reading this email chain.

    I run a series of jobs one after another. Starting from the 4th job, Reducer
    gets stuck at 99% (Map 100% and Reduce 99%). It gets stuck at 99% for a many
    hours and then the job fails. Earlier there were 2 exceptions in the logs
    --- DFSClient exception (could not completely write into a file <file name>)
    and Lease Expired Exception. Then I increased the ulimit -n (max no of open
    files) from 1024 to 32768 on the advise of Ted. After this, there are no
    exceptions in the logs but the reduce still gets stuck at 99%.

    Do you have any suggestions?

    Thank you.

    Regards,
    Raghava.

    On Sat, Apr 17, 2010 at 9:36 PM, Ted Yu wrote:

    Hi,
    Putting this thread back in pool to leverage collective intelligence.

    If you get the full command line of the java processes, it wouldn't be
    difficult to correlate reduce task(s) with a particular job.

    Cheers

    On Sat, Apr 17, 2010 at 2:20 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hello Ted,

    Thank you for the suggestions :). I haven't come across any other
    serious issue before this one. Infact, the same MR job runs for a smaller
    input size, although, lot slower than what we expected.

    I will use jstack to get stack trace. I had a question in this regard. How
    would I know which MR job (job id) is related to which java process (pid)? I
    can get a list of hadoop jobs with "hadoop job -list" and list of java
    processes with "jps" but how I couldn't determine how to get the
    connection
    between these 2 lists.


    Thank you again.

    Regards,
    Raghava.
    On Fri, Apr 16, 2010 at 11:07 PM, Ted Yu wrote:

    If you look at
    https://issues.apache.org/jira/secure/ManageAttachments.jspa?id=12408776,
    you can see that hdfs-127-branch20-redone-v2.txt<
    https://issues.apache.org/jira/secure/attachment/12431012/hdfs-127-branch20-redone-v2.txt>was
    the latest.
    You need to download the source code corresponding to your version of
    hadoop, apply the patch and rebuild.

    If you haven't experienced serious issue with hadoop for other
    scenarios,
    we should try to find out the root cause for the current problem without
    the
    127 patch.

    My advice is to use jstack to find what each thread was waiting for
    after
    reducers get stuck.
    I would expect a deadlock in either your code or hdfs, I would think it
    should the former.

    You can replace sensitive names in the stack traces and paste it if you
    cannot determine the deadlock.

    Cheers


    On Fri, Apr 16, 2010 at 5:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hello Ted,

    Thank you for the reply. Will this change fix my issue? I asked
    this because I again need to convince my admin to make this change.

    We have a gateway to the cluster-head. We generally run our MR
    jobs
    on the gateway. Should this change be made to the hadoop installation
    on the
    gateway?

    1) I am confused on which patch to be applied? There are 4 patches
    listed
    at https://issues.apache.org/jira/browse/HDFS-127

    2) How to apply the patch? Should we change the lines of code specified
    and rebuild hadoop? Or is there any other way?

    Thank you again.

    Regards,
    Raghava.

    On Fri, Apr 16, 2010 at 6:42 PM, wrote:

    That patch is very important.

    please apply it.

    Sent from my Verizon Wireless BlackBerry
    ------------------------------
    *From: * Raghava Mutharaju <m.vijayaraghava@gmail.com>
    *Date: *Fri, 16 Apr 2010 17:27:11 -0400
    *To: *Ted Yu<yuzhihong@gmail.com>
    *Subject: *Re: Reduce gets struck at 99%

    Hi Ted,

    It took sometime to contact my department's admin (he was on
    leave) and ask him to make ulimit changes effective in the cluster
    (just
    adding entry in /etc/security/limits.conf was not sufficient, so took
    sometime to figure out). Now the ulimit is 32768. I ran the set of MR
    jobs,
    the result is the same --- it gets stuck at Reduce 99%. But this time,
    there
    are no exceptions in the logs. I view JobTracker logs through the Web
    UI. I
    checked "Running Jobs" as well as "Failed Jobs".

    I haven't asked the admin to apply the patch
    https://issues.apache.org/jira/browse/HDFS-127 that you mentioned
    earlier. Is this important?

    Do you any suggestions?

    Thank you.

    Regards,
    Raghava.
    On Fri, Apr 9, 2010 at 3:35 PM, Ted Yu wrote:

    For the user under whom you launch MR jobs.


    On Fri, Apr 9, 2010 at 12:02 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    Sorry to bug you again :) but I do not have an account on all
    the datanodes, I just have it on the machine on which I start the MR
    jobs.
    So is it required to increase the ulimit on all the nodes (in this
    case the
    admin may have to increase it for all the users?)


    Regards,
    Raghava.
    On Fri, Apr 9, 2010 at 11:43 AM, Ted Yu wrote:

    ulimit should be increased on all nodes.

    The link I gave you lists several actions to take. I think they're
    not specifically for hbase.
    Also make sure the following is applied:
    https://issues.apache.org/jira/browse/HDFS-127


    On Thu, Apr 8, 2010 at 10:13 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hello Ted,

    Should the increase in ulimit to 32768 be applied on all
    the
    datanodes (its a 16 node cluster)? Is this related to HBase,
    because I am
    not using HBase.
    Are the exceptions & delay (at Reduce 99%) due to this?

    Regards,
    Raghava.

    On Fri, Apr 9, 2010 at 1:01 AM, Ted Yu wrote:

    Your ulimit is low.
    Ask your admin to increase it to 32768

    See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6


    On Thu, Apr 8, 2010 at 9:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    I am pasting below the timestamps from the log.

    Lease-exception:

    Task Attempts Machine Status Progress Start Time Shuffle
    Finished
    Sort Finished Finish Time Errors Task Logs
    Counters Actions
    attempt_201004060646_0057_r_000014_0 /default-rack/nimbus15
    FAILED 0.00%
    8-Apr-2010 07:38:53 8-Apr-2010 07:39:21 (27sec) 8-Apr-2010
    07:39:21 (0sec) 8-Apr-2010 09:54:33 (2hrs, 15mins, 39sec)

    -------------------------------------

    DFS Client Exception:

    Task Attempts Machine Status Progress Start Time Shuffle
    Finished
    Sort Finished Finish Time Errors Task Logs
    Counters Actions
    attempt_201004060646_0057_r_000006_0 /default-rack/
    nimbus3.cs.wright.edu FAILED 0.00%
    8-Apr-2010 07:38:47 8-Apr-2010 07:39:10 (23sec) 8-Apr-2010
    07:39:11 (0sec) 8-Apr-2010 08:51:33 (1hrs, 12mins, 46sec)
    ------------------------------------------

    The file limit is set to 1024. I checked couple of datanodes. I
    haven't checked the headnode though.

    The no of currently open files under my username, on the system
    on
    which I started the MR jobs are 346


    Thank you for you help :)

    Regards,
    Raghava.


    On Fri, Apr 9, 2010 at 12:14 AM, Ted Yu <yuzhihong@gmail.com
    wrote:
    Can you give me the timestamps of the two exceptions ?
    I want to see if they're related.

    I saw DFSClient$DFSOutputStream.close() in the first stack
    trace.

    On Thu, Apr 8, 2010 at 9:09 PM, Ted Yu <yuzhihong@gmail.com
    wrote:
    just to double check it's not a file
    limits issue could you run the following on each of the hosts:

    $ ulimit -a
    $ lsof | wc -l

    The first command will show you (among other things) the file
    limits, it
    should be above the default 1024. The second will tell you
    have
    many files
    are currently open...


    On Thu, Apr 8, 2010 at 7:40 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:
    Hi Ted,

    Thank you for all the suggestions. I went through the
    job tracker logs and I have attached the exceptions found in
    the logs. I
    found two exceptions

    1) org.apache.hadoop.ipc.RemoteException:
    java.io.IOException:
    Could not complete write to file (DFS Client)

    2) org.apache.hadoop.ipc.RemoteException:
    org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
    No lease on
    /user/raghava/MR_EL/output/_temporary/_attempt_201004060646_0057_r_000014_0/part-r-00014
    File does not exist. Holder
    DFSClient_attempt_201004060646_0057_r_000014_0
    does not have any open files.


    The exception occurs at the point of writing out <K,V> pairs
    in
    the reducer and it occurs only in certain task attempts. I am
    not using any
    custom output format or record writers but I do use custom
    input reader.
    What could have gone wrong here?

    Thank you.

    Regards,
    Raghava.



    On Thu, Apr 8, 2010 at 5:51 PM, Ted Yu <yuzhihong@gmail.com
    wrote:
    Raghava:
    Are you able to share the last segment of reducer log ?
    You can get them from web UI:
    http://snv-it-lin-012.pr.com:50060/tasklog?taskid=attempt_201003221148_1211_r_000003_0&start=-8193
    Adding more log in your reducer task would help pinpoint
    where
    the issue is.
    Also look in job tracker log.

    Cheers

    On Thu, Apr 8, 2010 at 2:46 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi Ted,

    Thank you for the suggestion. I enabled it using the
    Configuration
    class because I cannot change hadoop-site.xml file (I am
    not
    an admin). The
    situation is still the same --- it gets stuck at reduce
    99%
    and does not
    move further.

    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 4:40 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    You need to turn on yourself (hadoop-site.xml):
    <property>
    <name>mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>

    <property>
    <name>mapred.map.tasks.speculative.execution</name>
    <value>true</value>
    </property>


    On Thu, Apr 8, 2010 at 1:14 PM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com
    wrote:
    Hi,

    Thank you Eric, Prashant and Greg. Although the
    timeout problem was
    resolved, reduce is getting stuck at 99%. As of now,
    it
    has been stuck
    there
    for about 3 hrs. That is too high a wait time for my
    task. Do you guys
    see
    any reason for this?

    Speculative execution is "on" by default right?
    Or
    should I enable
    it?
    Regards,
    Raghava.

    On Thu, Apr 8, 2010 at 3:15 PM, Gregory Lawrence <
    gregl@yahoo-inc.com
    wrote:
    Hi,

    I have also experienced this problem. Have you tried
    speculative
    execution?
    Also, I have had jobs that took a long time for one
    mapper / reducer
    because
    of a record that was significantly larger than those
    contained in the
    other
    filesplits. Do you know if it always slows down for
    the same
    filesplit?
    Regards,
    Greg Lawrence


    On 4/8/10 10:30 AM, "Raghava Mutharaju" <
    m.vijayaraghava@gmail.com>
    wrote:
    Hello all,

    I got the time out error as mentioned below
    -- after 600
    seconds,
    that attempt was killed and the attempt would be
    deemed a failure. I
    searched around about this error, and one of the
    suggestions to
    include
    "progress" statements in the reducer -- it might be
    taking longer
    than
    600
    seconds and so is timing out. I added calls to
    context.progress() and
    context.setStatus(str) in the reducer. Now, it works
    fine -- there
    are
    no
    timeout errors.

    But, for a few jobs, it takes awfully long
    time to move from
    "Map
    100%, Reduce 99%" to Reduce 100%. For some jobs its
    15mins and for
    some
    it
    was more than an hour. The reduce code is not
    complex
    -- 2 level loop
    and
    couple of if-else blocks. The input size is also not
    huge, for the
    job
    that
    gets struck for an hour at reduce 99%, it would take
    in 130. Some of
    them
    are 1-3 MB in size and couple of them are 16MB in
    size.
    Has anyone encountered this problem before?
    Any pointers? I
    use
    Hadoop 0.20.2 on a linux cluster of 16 nodes.

    Thank you.

    Regards,
    Raghava.

    On Thu, Apr 1, 2010 at 2:24 AM, Raghava Mutharaju <
    m.vijayaraghava@gmail.com> wrote:

    Hi all,

    I am running a series of jobs one after
    another. While
    executing
    the
    4th job, the job fails. It fails in the reducer ---
    the progress
    percentage
    would be map 100%, reduce 99%. It gives out the
    following message
    10/04/01 01:04:15 INFO mapred.JobClient: Task Id :
    attempt_201003240138_0110_r_000018_1, Status :
    FAILED
    Task attempt_201003240138_0110_r_000018_1 failed to
    report status for
    602
    seconds. Killing!

    It makes several attempts again to execute it but
    fails with similar
    message. I couldn't get anything from this error
    message and wanted
    to
    look
    at logs (located in the default dir of
    ${HADOOP_HOME/logs}). But I
    don't
    find any files which match the timestamp of the job.
    Also I did not
    find
    history and userlogs in the logs folder. Should I
    look
    at some other
    place
    for the logs? What could be the possible causes for
    the above error?
    I am using Hadoop 0.20.2 and I am running it
    on
    a cluster with
    16
    nodes.

    Thank you.

    Regards,
    Raghava.


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 8, '10 at 5:31p
activeApr 18, '10 at 8:25a
posts11
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase