I am reporting on performance of a hadoop task on a cluster with about 50
nodes. I would like to be able to report performance on clusters of 5,10,20
nodes without
changing int current cluster. Is there a way to limit the number of nodes
used by a job and if so how?

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Search Discussions

  • Alex Kozlov at Dec 16, 2011 at 12:09 am
    Hi Steve, there is no simple way to just limit the number of nodes as it
    would involve moving the data: You want to have the 3 replicas on the
    5,10,20 nodes, correct?

    You could potentially just stop the TTs on the extra nodes, but your job(s)
    will likely have to fetch the data from remote nodes and will run slower
    than it/they actually would in the corresponding cluster. Shutting down
    the DNs will cause unnecessary replication and redistribution of data
    (unless your data are small and you can afford to reload the data or to
    reformat the HDFS each time).

    Moving the computations to data is a big part of MR and by restricting the
    job to a subset of nodes one is likely to skew the results.

    --
    Dr. Alex Kozlov
    On Thu, Dec 15, 2011 at 2:03 PM, Steve Lewis wrote:

    I am reporting on performance of a hadoop task on a cluster with about 50
    nodes. I would like to be able to report performance on clusters of 5,10,20
    nodes without
    changing int current cluster. Is there a way to limit the number of nodes
    used by a job and if so how?

    --
    Steven M. Lewis PhD
    4221 105th Ave NE
    Kirkland, WA 98033
    206-384-1340 (cell)
    Skype lordjoe_com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 15, '11 at 10:03p
activeDec 16, '11 at 12:09a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Alex Kozlov: 1 post Steve Lewis: 1 post

People

Translate

site design / logo © 2022 Grokbase