FAQ
Hi,

I continuously run a series of batch job using Hadoop Map Reduce. I also
have a managing daemon that moves data around on the hdfs making way for
more jobs to be run.
I use capacity scheduler to schedule many jobs in parallel.

I see an issue on the Hadoop web monitoring UI at port 50030 which I believe
may be causing a performance bottleneck and wanted to get more information.

Approximately 10% of the reduce tasks show up as "Killed" in the UI. The
logs say that the killed tasks are in the shuffle phase when they are killed
but the logs don't show any exception.
My understanding is that these killed tasks would be started again and this
slows down the whole hadoop job.
I was wondering what the possible issues maybe and how to debug this issue?

I have tried on both the hadoop 0.20.2 and the latest version of hadoop from
yahoo's github.
I've monitored the nodes and there is a lot of free disk space and memory on
all nodes (more than 1 TB free disk and 5 GB free memory at all times on all
nodes).

Since there are no exceptions and any other visible issues, I am finding it
hard to figure out what the problem might be. Could anybody help?

Thanks,
-aniket

Search Discussions

  • Cliff palmer at Sep 23, 2010 at 11:45 am
    Aniket, I wonder if these tasks were run as Speculative Execution. Have you
    been able to determine whether the job runs successfully?
    HTH
    Cliff
    On Thu, Sep 23, 2010 at 12:52 AM, aniket ray wrote:

    Hi,

    I continuously run a series of batch job using Hadoop Map Reduce. I also
    have a managing daemon that moves data around on the hdfs making way for
    more jobs to be run.
    I use capacity scheduler to schedule many jobs in parallel.

    I see an issue on the Hadoop web monitoring UI at port 50030 which I
    believe
    may be causing a performance bottleneck and wanted to get more information.

    Approximately 10% of the reduce tasks show up as "Killed" in the UI. The
    logs say that the killed tasks are in the shuffle phase when they are
    killed
    but the logs don't show any exception.
    My understanding is that these killed tasks would be started again and this
    slows down the whole hadoop job.
    I was wondering what the possible issues maybe and how to debug this issue?

    I have tried on both the hadoop 0.20.2 and the latest version of hadoop
    from
    yahoo's github.
    I've monitored the nodes and there is a lot of free disk space and memory
    on
    all nodes (more than 1 TB free disk and 5 GB free memory at all times on
    all
    nodes).

    Since there are no exceptions and any other visible issues, I am finding it
    hard to figure out what the problem might be. Could anybody help?

    Thanks,
    -aniket
  • Aniket ray at Sep 24, 2010 at 4:13 am
    Hi Cliff,

    Thanks it did turn out to be speculative execution. When I turned it off, no
    more tasks were killed and the performance degraded.

    So my initial assumptions were incorrect after all. I guess I'll have to
    look at other ways to improve performance.

    Thanks for the help.
    -aniket
    On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer wrote:

    Aniket, I wonder if these tasks were run as Speculative Execution. Have
    you
    been able to determine whether the job runs successfully?
    HTH
    Cliff
    On Thu, Sep 23, 2010 at 12:52 AM, aniket ray wrote:

    Hi,

    I continuously run a series of batch job using Hadoop Map Reduce. I also
    have a managing daemon that moves data around on the hdfs making way for
    more jobs to be run.
    I use capacity scheduler to schedule many jobs in parallel.

    I see an issue on the Hadoop web monitoring UI at port 50030 which I
    believe
    may be causing a performance bottleneck and wanted to get more
    information.
    Approximately 10% of the reduce tasks show up as "Killed" in the UI. The
    logs say that the killed tasks are in the shuffle phase when they are
    killed
    but the logs don't show any exception.
    My understanding is that these killed tasks would be started again and this
    slows down the whole hadoop job.
    I was wondering what the possible issues maybe and how to debug this issue?
    I have tried on both the hadoop 0.20.2 and the latest version of hadoop
    from
    yahoo's github.
    I've monitored the nodes and there is a lot of free disk space and memory
    on
    all nodes (more than 1 TB free disk and 5 GB free memory at all times on
    all
    nodes).

    Since there are no exceptions and any other visible issues, I am finding it
    hard to figure out what the problem might be. Could anybody help?

    Thanks,
    -aniket
  • Cliff palmer at Sep 24, 2010 at 3:28 pm
    I'm glad it helped Aniket. I would recommend that you start working on
    performance improvement with your network infrastructure and the balance of
    data across your logical racks.Cliff
    On Fri, Sep 24, 2010 at 12:12 AM, aniket ray wrote:

    Hi Cliff,

    Thanks it did turn out to be speculative execution. When I turned it off,
    no
    more tasks were killed and the performance degraded.

    So my initial assumptions were incorrect after all. I guess I'll have to
    look at other ways to improve performance.

    Thanks for the help.
    -aniket
    On Thu, Sep 23, 2010 at 5:14 PM, cliff palmer wrote:

    Aniket, I wonder if these tasks were run as Speculative Execution. Have
    you
    been able to determine whether the job runs successfully?
    HTH
    Cliff
    On Thu, Sep 23, 2010 at 12:52 AM, aniket ray wrote:

    Hi,

    I continuously run a series of batch job using Hadoop Map Reduce. I
    also
    have a managing daemon that moves data around on the hdfs making way
    for
    more jobs to be run.
    I use capacity scheduler to schedule many jobs in parallel.

    I see an issue on the Hadoop web monitoring UI at port 50030 which I
    believe
    may be causing a performance bottleneck and wanted to get more
    information.
    Approximately 10% of the reduce tasks show up as "Killed" in the UI.
    The
    logs say that the killed tasks are in the shuffle phase when they are
    killed
    but the logs don't show any exception.
    My understanding is that these killed tasks would be started again and this
    slows down the whole hadoop job.
    I was wondering what the possible issues maybe and how to debug this issue?
    I have tried on both the hadoop 0.20.2 and the latest version of hadoop
    from
    yahoo's github.
    I've monitored the nodes and there is a lot of free disk space and
    memory
    on
    all nodes (more than 1 TB free disk and 5 GB free memory at all times
    on
    all
    nodes).

    Since there are no exceptions and any other visible issues, I am
    finding
    it
    hard to figure out what the problem might be. Could anybody help?

    Thanks,
    -aniket

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 23, '10 at 4:53a
activeSep 24, '10 at 3:28p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Aniket ray: 2 posts Cliff palmer: 2 posts

People

Translate

site design / logo © 2022 Grokbase