|
Alan Miller |
at Jul 7, 2010 at 7:07 am
|
⇧ |
| |
Yes the effect of "scaling down" was the first thing I wanted to look at.
To process X GB it currently takes Y seconds with Z nodes.
If I process X GB with Z/2 nodes, does it take Y/2 seconds?
How about Z-1,Z-2,Z-3,.... nodes?
Right now my MR job process alot of small files (2000 files, @2.5MB each)
individually, so the next test would involve changing my MR job to combine
the small files into bigger pieces (closer to hdfs block size) and see
if that
is more effective.
Each line of my small files has a timestamp column and 55 columns with
numerical data and my reducer needs to calc the column averages for
certain time periods (last day, last hour,etc.) based on the timestamp.
Alan
On 07/06/2010 08:06 PM, Allen Wittenauer wrote:On Jul 6, 2010, at 8:35 AM, Michael Segel wrote:
I'm also not sure how dropping a node will test the scalability. You would be testing resilience.
He's testing scale down, not scale up (which is the way we normally think of things... I was confused by the wording too).
In other words, "if I drop a node, how much of a performance hit is my job going to take?"
Also, for this type of timing/testing, I'd probably make sure speculative execution is off. It will likely throw some curve balls into the time.