This is pretty predictable.
Determine the average time it takes to process a m/r task.
If you can process 100 m/r tasks simultaneously, and then cut that to 50 m/r tasks you can handle simultaneously, your job will take twice as long to run.
Granted this will give you a rough estimate of how long it will take your job to run.
Date: Wed, 7 Jul 2010 09:00:47 +0200
Subject: Re: decomission a node
Yes the effect of "scaling down" was the first thing I wanted to look at.
To process X GB it currently takes Y seconds with Z nodes.
If I process X GB with Z/2 nodes, does it take Y/2 seconds?
How about Z-1,Z-2,Z-3,.... nodes?
Right now my MR job process alot of small files (2000 files, @2.5MB each)
individually, so the next test would involve changing my MR job to combine
the small files into bigger pieces (closer to hdfs block size) and see
is more effective.
Each line of my small files has a timestamp column and 55 columns with
numerical data and my reducer needs to calc the column averages for
certain time periods (last day, last hour,etc.) based on the timestamp.
On 07/06/2010 08:06 PM, Allen Wittenauer wrote:
On Jul 6, 2010, at 8:35 AM, Michael Segel wrote:
I'm also not sure how dropping a node will test the scalability. You would be testing resilience.
He's testing scale down, not scale up (which is the way we normally think of things... I was confused by the wording too).
In other words, "if I drop a node, how much of a performance hit is my job going to take?"
Also, for this type of timing/testing, I'd probably make sure speculative execution is off. It will likely throw some curve balls into the time.
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4