You could consider two scenarios / set of requirements for your estimator:
1. Allow it to 'learn' from certain input data and then project running
times of similar (or moderately dissimilar) workloads. So the first steps
could be to define a couple of relatively small "control" M/R jobs on a
small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
cluster. Try to design the "control" M/R job in a way that it will be
able to completely load down all of the available DataNodes in the
cluster-under-test for at least a brief period of time. Then you wlil
have obtained a decent signal on the capabilities of the cluster under test
and may allow a relatively high degree of predictive accuracy for even much
2. If instead it were your goal to drive the predictions off of a purely
mathematical model - in your terms the "application" and "base file system"
- and without any empirical data - then here is an alternative approach.
- Follow step (1) above against a variety of "applications" and "base
file systems" - especially in configurations for which you wish
to provide high quality predictions.
- Save the results in structured data
- Derive formulas for characterizing the curves of performance via
those variables that you defined (application / base file system)
Now you have a trained model. When it is applied to a new set of
applications / base file systems it can use the curves you have already
determined to provide the result without any runtime requirements.
Obviously the value of this second approach is limited by the degree of
similarity of the training data to the applications you attempt to model.
If all of your training data is on a 50 node cluster against machines with
IDE drives don't expect good results when asked to model a 1000 node cluster
using SAN's / RAID's / SCSI's.
2011/4/16 Sonal Goyal <email@example.com>
What is your MR job doing? What is the amount of data it is processing?
kind of a cluster do you have? Would you be able to share some details
what you are trying to do?
If you are looking for metrics, you could look at the Terasort run ..
Thanks and Regards,
>Hadoop ETL and Data
Nube Technologies <http://www.nubetech.co
On Sat, Apr 16, 2011 at 3:31 PM, real great..
As a part of my final year BE final project I want to estimate the time
required by a M/R job given an application and a base file system.
Can you folks please help me by posting some thoughts on this issue or
posting some links here.