FAQ
Is there any way that we can run a particular job in a hadoop on subset of
datanodes ?

My problem is I don't want to use all the nodes to run some job,
I am trying to make Job completion Vs No. of nodes graph for a particular
job.
One way to do is I can remove datanodes, and then see how much time the job
is taking.

Just for curiosity sake, want to know is there any other way possible to do
this, without removing datanodes.
I am afraid, if I remove datanodes, I can loose some data blocks that reside
on those machines as I have some files with replication = 1 ?

Thanks,
Praveenesh

Search Discussions

  • Harsh J at Sep 21, 2011 at 1:10 pm
    Praveenesh,

    TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
    you can statically control loads on nodes by removing away
    TaskTrackers from your cluster.

    i.e, if you "service hadoop-0.20-tasktracker stop" or
    "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
    run there anymore.

    Is this what you're looking for?

    (There are ways to achieve the exclusion dynamically, by writing a
    scheduler, but hard to tell without knowing what you need
    specifically, and why do you require it?)
    On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar wrote:
    Is there any way that we can run a particular job in a hadoop on subset of
    datanodes ?

    My problem is I don't want to use all the nodes to run some job,
    I am trying to make Job completion Vs No. of nodes graph for a particular
    job.
    One way to do is I can remove datanodes, and then see how much time the job
    is taking.

    Just for curiosity sake, want to know is there any other way possible to do
    this, without removing datanodes.
    I am afraid, if I remove datanodes, I can loose some data blocks that reside
    on those machines as I have some files with replication = 1 ?

    Thanks,
    Praveenesh


    --
    Harsh J
  • Praveenesh kumar at Sep 21, 2011 at 1:23 pm
    Oh wow.. I didn't know that..
    Actually for me datanodes/tasktrackers are running on same machines.
    I mention datanodes because if I delete those machines from masters list,
    chances are the data will also loose.
    So I don't want to do that..
    but now I guess by stoping tasktrackers individually... I can decrease the
    strength of my cluster by decreasing the number of nodes that will run
    tasktracker .. right ?? This way I won't loose my data also.. Right ??


    On Wed, Sep 21, 2011 at 6:39 PM, Harsh J wrote:

    Praveenesh,

    TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
    you can statically control loads on nodes by removing away
    TaskTrackers from your cluster.

    i.e, if you "service hadoop-0.20-tasktracker stop" or
    "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
    run there anymore.

    Is this what you're looking for?

    (There are ways to achieve the exclusion dynamically, by writing a
    scheduler, but hard to tell without knowing what you need
    specifically, and why do you require it?)
    On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar wrote:
    Is there any way that we can run a particular job in a hadoop on subset of
    datanodes ?

    My problem is I don't want to use all the nodes to run some job,
    I am trying to make Job completion Vs No. of nodes graph for a particular
    job.
    One way to do is I can remove datanodes, and then see how much time the job
    is taking.

    Just for curiosity sake, want to know is there any other way possible to do
    this, without removing datanodes.
    I am afraid, if I remove datanodes, I can loose some data blocks that reside
    on those machines as I have some files with replication = 1 ?

    Thanks,
    Praveenesh


    --
    Harsh J
  • Harsh J at Sep 21, 2011 at 1:28 pm
    Praveenesh,

    Absolutely right. Just stop them individually :)
    On Wed, Sep 21, 2011 at 6:53 PM, praveenesh kumar wrote:
    Oh wow.. I didn't know that..
    Actually for me datanodes/tasktrackers are running on same machines.
    I mention datanodes because if I delete those machines from masters list,
    chances are the data will also loose.
    So I don't want to do that..
    but now I guess by stoping tasktrackers individually... I can decrease the
    strength of my cluster by decreasing the number of nodes that will run
    tasktracker .. right ?? This  way I won't loose my data also.. Right ??


    On Wed, Sep 21, 2011 at 6:39 PM, Harsh J wrote:

    Praveenesh,

    TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
    you can statically control loads on nodes by removing away
    TaskTrackers from your cluster.

    i.e, if you "service hadoop-0.20-tasktracker stop" or
    "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
    run there anymore.

    Is this what you're looking for?

    (There are ways to achieve the exclusion dynamically, by writing a
    scheduler, but hard to tell without knowing what you need
    specifically, and why do you require it?)

    On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <[email protected]>
    wrote:
    Is there any way that we can run a particular job in a hadoop on subset of
    datanodes ?

    My problem is I don't want to use all the nodes to run some job,
    I am trying to make Job completion Vs No. of nodes graph for a particular
    job.
    One way to do is I can remove datanodes, and then see how much time the job
    is taking.

    Just for curiosity sake, want to know is there any other way possible to do
    this, without removing datanodes.
    I am afraid, if I remove datanodes, I can loose some data blocks that reside
    on those machines as I have some files with replication = 1 ?

    Thanks,
    Praveenesh


    --
    Harsh J


    --
    Harsh J
  • Robert Evans at Sep 21, 2011 at 3:02 pm
    Praveen,

    If you are doing performance measurements be aware that having more datanodes then tasktrackers will impact the performance as well (Don't really know for sure how). It will not be the same performance as running on a cluster with just fewer nodes over all. Also if you do shut off datanodes as well as task trackers you will need to give the cluster a while for re-replication to finish before you try to run your performance numbers.

    --Bobby Evans


    On 9/21/11 8:27 AM, "Harsh J" wrote:

    Praveenesh,

    Absolutely right. Just stop them individually :)
    On Wed, Sep 21, 2011 at 6:53 PM, praveenesh kumar wrote:
    Oh wow.. I didn't know that..
    Actually for me datanodes/tasktrackers are running on same machines.
    I mention datanodes because if I delete those machines from masters list,
    chances are the data will also loose.
    So I don't want to do that..
    but now I guess by stoping tasktrackers individually... I can decrease the
    strength of my cluster by decreasing the number of nodes that will run
    tasktracker .. right ?? This way I won't loose my data also.. Right ??


    On Wed, Sep 21, 2011 at 6:39 PM, Harsh J wrote:

    Praveenesh,

    TaskTrackers run your jobs' tasks for you, not DataNodes directly. So
    you can statically control loads on nodes by removing away
    TaskTrackers from your cluster.

    i.e, if you "service hadoop-0.20-tasktracker stop" or
    "hadoop-daemon.sh stop tasktracker" on the specific nodes, jobs won't
    run there anymore.

    Is this what you're looking for?

    (There are ways to achieve the exclusion dynamically, by writing a
    scheduler, but hard to tell without knowing what you need
    specifically, and why do you require it?)

    On Wed, Sep 21, 2011 at 6:32 PM, praveenesh kumar <[email protected]>
    wrote:
    Is there any way that we can run a particular job in a hadoop on subset of
    datanodes ?

    My problem is I don't want to use all the nodes to run some job,
    I am trying to make Job completion Vs No. of nodes graph for a particular
    job.
    One way to do is I can remove datanodes, and then see how much time the job
    is taking.

    Just for curiosity sake, want to know is there any other way possible to do
    this, without removing datanodes.
    I am afraid, if I remove datanodes, I can loose some data blocks that reside
    on those machines as I have some files with replication = 1 ?

    Thanks,
    Praveenesh


    --
    Harsh J


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 21, '11 at 1:02p
activeSep 21, '11 at 3:02p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase