FAQ
Hi,
As a part of my final year BE final project I want to estimate the time
required by a M/R job given an application and a base file system.
Can you folks please help me by posting some thoughts on this issue or
posting some links here.

--
Regards,
R.V.

Search Discussions

  • Sonal Goyal at Apr 16, 2011 at 1:39 pm
    What is your MR job doing? What is the amount of data it is processing? What
    kind of a cluster do you have? Would you be able to share some details about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.
  • Stephen Boesch at Apr 16, 2011 at 8:09 pm
    You could consider two scenarios / set of requirements for your estimator:


    1. Allow it to 'learn' from certain input data and then project running
    times of similar (or moderately dissimilar) workloads. So the first steps
    could be to define a couple of relatively small "control" M/R jobs on a
    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
    cluster. Try to design the "control" M/R job in a way that it will be
    able to completely load down all of the available DataNodes in the
    cluster-under-test for at least a brief period of time. Then you wlil
    have obtained a decent signal on the capabilities of the cluster under test
    and may allow a relatively high degree of predictive accuracy for even much
    larger jobs
    2. If instead it were your goal to drive the predictions off of a purely
    mathematical model - in your terms the "application" and "base file system"
    - and without any empirical data - then here is an alternative approach.
    - Follow step (1) above against a variety of "applications" and "base
    file systems" - especially in configurations for which you wish
    your model
    to provide high quality predictions.
    - Save the results in structured data
    - Derive formulas for characterizing the curves of performance via
    those variables that you defined (application / base file system)

    Now you have a trained model. When it is applied to a new set of
    applications / base file systems it can use the curves you have already
    determined to provide the result without any runtime requirements.

    Obviously the value of this second approach is limited by the degree of
    similarity of the training data to the applications you attempt to model.
    If all of your training data is on a 50 node cluster against machines with
    IDE drives don't expect good results when asked to model a 1000 node cluster
    using SAN's / RAID's / SCSI's.


    2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>
    What is your MR job doing? What is the amount of data it is processing?
    What
    kind of a cluster do you have? Would you be able to share some details
    about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.
  • Stephen Boesch at Apr 16, 2011 at 8:19 pm
    some additional thoughts about the the 'variables' involved in
    characterizing the M/R application itself.


    - the configuration of the cluster for numbers of mappers vs reducers
    compared to the characteristics (amount of work/procesing) required in each
    of the map/shuffle/reduce stages


    - is the application using multiple chained M/R stages? Multi stage
    M/R's are more difficult to tune properly in terms of keeping all workers
    busy . That may be challenging to model.

    2011/4/16 Stephen Boesch <javadba@gmail.com>
    You could consider two scenarios / set of requirements for your estimator:


    1. Allow it to 'learn' from certain input data and then project running
    times of similar (or moderately dissimilar) workloads. So the first steps
    could be to define a couple of relatively small "control" M/R jobs on a
    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
    cluster. Try to design the "control" M/R job in a way that it will be
    able to completely load down all of the available DataNodes in the
    cluster-under-test for at least a brief period of time. Then you wlil
    have obtained a decent signal on the capabilities of the cluster under test
    and may allow a relatively high degree of predictive accuracy for even much
    larger jobs
    2. If instead it were your goal to drive the predictions off of a
    purely mathematical model - in your terms the "application" and "base file
    system" - and without any empirical data - then here is an alternative
    approach.
    - Follow step (1) above against a variety of "applications" and
    "base file systems" - especially in configurations for which you wish your
    model to provide high quality predictions.
    - Save the results in structured data
    - Derive formulas for characterizing the curves of performance via
    those variables that you defined (application / base file system)

    Now you have a trained model. When it is applied to a new set of
    applications / base file systems it can use the curves you have already
    determined to provide the result without any runtime requirements.

    Obviously the value of this second approach is limited by the degree of
    similarity of the training data to the applications you attempt to model.
    If all of your training data is on a 50 node cluster against machines with
    IDE drives don't expect good results when asked to model a 1000 node cluster
    using SAN's / RAID's / SCSI's.


    2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>
    What is your MR job doing? What is the amount of data it is processing?
    What
    kind of a cluster do you have? Would you be able to share some details
    about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.
  • Ted Dunning at Apr 16, 2011 at 10:04 pm
    Sounds like this paper might help you:

    Predicting Multiple Performance Metrics for Queries: Better Decisions
    Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
    Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David
    Patterson

    http://radlab.cs.berkeley.edu/publication/187
    On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch wrote:

    some additional thoughts about the the  'variables' involved in
    characterizing the M/R application itself.


    - the configuration of the cluster for numbers of mappers vs reducers
    compared to the characteristics (amount of work/procesing) required in each
    of the map/shuffle/reduce stages


    - is the application using multiple chained M/R stages?  Multi stage
    M/R's are more difficult to tune properly in terms of keeping all workers
    busy  . That may be challenging to model.

    2011/4/16 Stephen Boesch <javadba@gmail.com>
    You could consider two scenarios / set of requirements for your estimator:


    1. Allow it to 'learn' from certain input data and then project running
    times of similar (or moderately dissimilar) workloads.   So the first steps
    could be to define a couple of  relatively small "control" M/R jobs on a
    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
    cluster.  Try to design the "control" M/R job  in a way that it will be
    able to completely load down all of the  available DataNodes in the
    cluster-under-test for at least a brief period of time.   Then you wlil
    have obtained a decent signal on the capabilities of the cluster under test
    and may allow a relatively high degree of predictive accuracy for even much
    larger jobs
    2. If instead it were your goal to drive the predictions off of a
    purely mathematical model  - in your terms the "application" and "base file
    system" - and without any empirical data - then here is an alternative
    approach.
    - Follow step (1) above against a variety of "applications" and
    "base file systems" - especially in configurations for which  you wish your
    model to provide high quality predictions.
    - Save  the results in structured data
    - Derive formulas for characterizing the curves of performance via
    those variables that you defined (application /  base file system)

    Now you have a trained model.  When it is applied to a new set of
    applications / base file systems it can use the curves you have already
    determined to provide the result without any runtime requirements.

    Obviously the value of this second approach is limited by the degree of
    similarity of the training data to the applications you attempt to model.
    If all of your training data is on a 50 node cluster against machines with
    IDE drives don't expect good results when asked to model a 1000 node cluster
    using SAN's / RAID's / SCSI's.


    2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>
    What is your MR job doing? What is the amount of data it is processing?
    What
    kind of a cluster do you have? Would you be able to share some details
    about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.
  • Real great.. at Apr 17, 2011 at 2:00 pm
    Thanks a lot guys..will go throught it all.
    On Sun, Apr 17, 2011 at 3:33 AM, Ted Dunning wrote:

    Sounds like this paper might help you:

    Predicting Multiple Performance Metrics for Queries: Better Decisions
    Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
    Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, & David
    Patterson

    http://radlab.cs.berkeley.edu/publication/187
    On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch wrote:

    some additional thoughts about the the 'variables' involved in
    characterizing the M/R application itself.


    - the configuration of the cluster for numbers of mappers vs reducers
    compared to the characteristics (amount of work/procesing) required in each
    of the map/shuffle/reduce stages


    - is the application using multiple chained M/R stages? Multi stage
    M/R's are more difficult to tune properly in terms of keeping all workers
    busy . That may be challenging to model.

    2011/4/16 Stephen Boesch <javadba@gmail.com>
    You could consider two scenarios / set of requirements for your
    estimator:

    1. Allow it to 'learn' from certain input data and then project
    running
    times of similar (or moderately dissimilar) workloads. So the
    first steps
    could be to define a couple of relatively small "control" M/R jobs
    on a
    small-ish dataset and throw it at the unknown (cluster-under-test)
    hdfs/ M/R
    cluster. Try to design the "control" M/R job in a way that it
    will be
    able to completely load down all of the available DataNodes in the
    cluster-under-test for at least a brief period of time. Then you
    wlil
    have obtained a decent signal on the capabilities of the cluster
    under test
    and may allow a relatively high degree of predictive accuracy for
    even much
    larger jobs
    2. If instead it were your goal to drive the predictions off of a
    purely mathematical model - in your terms the "application" and
    "base file
    system" - and without any empirical data - then here is an
    alternative
    approach.
    - Follow step (1) above against a variety of "applications" and
    "base file systems" - especially in configurations for which you
    wish your
    model to provide high quality predictions.
    - Save the results in structured data
    - Derive formulas for characterizing the curves of performance
    via
    those variables that you defined (application / base file
    system)
    Now you have a trained model. When it is applied to a new set of
    applications / base file systems it can use the curves you have already
    determined to provide the result without any runtime requirements.

    Obviously the value of this second approach is limited by the degree of
    similarity of the training data to the applications you attempt to
    model.
    If all of your training data is on a 50 node cluster against machines
    with
    IDE drives don't expect good results when asked to model a 1000 node
    cluster
    using SAN's / RAID's / SCSI's.


    2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>
    What is your MR job doing? What is the amount of data it is
    processing?
    What
    kind of a cluster do you have? Would you be able to share some details
    about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the
    time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue
    or
    posting some links here.

    --
    Regards,
    R.V.


    --
    Regards,
    R.V.
  • Matthew Foley at Apr 17, 2011 at 7:08 pm
    Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?"

    Or are you only dealing with one particular M/R job? If so, the above is a good way to look at it: first identify the controlling parameters, then analyze how they co-vary with execution time. Now you've reduced it to a question that can be answered by a series of "make hypothesis" / "do experiment" steps :-) Pick a parameter you think is a likely candidate, and make a series of measurements of execution time for different values of the parameter. Repeat until you've fully characterized the problem space.

    Good luck,
    --Matt

    On Apr 16, 2011, at 6:39 AM, Sonal Goyal wrote:

    What is your MR job doing? What is the amount of data it is processing? What
    kind of a cluster do you have? Would you be able to share some details about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.
  • Lance Norskog at Apr 17, 2011 at 11:57 pm
    ROC Convex Hull is an analysis technique for optimizing parameters for
    given outputs.

    For example, if a classification technique has tuning knobs, ROCCH
    will find the settings that give a desired failure rate.
    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?"

    Or are you only dealing with one particular M/R job?  If so, the above is a good way to look at it: first identify the controlling parameters, then analyze how they co-vary with execution time.  Now you've reduced it to a question that can be answered by a series of "make hypothesis" / "do experiment" steps :-)  Pick a parameter you think is a likely candidate, and make a series of measurements of execution time for different values of the parameter.  Repeat until you've fully characterized the problem space.

    Good luck,
    --Matt

    On Apr 16, 2011, at 6:39 AM, Sonal Goyal wrote:

    What is your MR job doing? What is the amount of data it is processing? What
    kind of a cluster do you have? Would you be able to share some details about
    what you are trying to do?

    If you are looking for metrics, you could look at the Terasort run ..

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>





    On Sat, Apr 16, 2011 at 3:31 PM, real great..
    wrote:
    Hi,
    As a part of my final year BE final project I want to estimate the time
    required by a M/R job given an application and a base file system.
    Can you folks please help me by posting some thoughts on this issue or
    posting some links here.

    --
    Regards,
    R.V.


    --
    Lance Norskog
    goksron@gmail.com
  • Ted Dunning at Apr 18, 2011 at 12:08 am
    Turing completion isn't the central question here, really. The truth
    is, map-reduce programs have considerably pressure to be written in a
    scalable fashion which limits them to fairly simple behaviors that
    result in pretty linear dependence of run-time on input size for a
    given program.

    The cool thing about the paper that I linked to the other day is that
    there are enough cues about the expected runtime of the program
    available to make good predictions *without* looking at the details.
    No doubt the estimation facility could make good use of something as
    simple as the hash of the jar in question, but even without that it is
    possible to produce good estimates.

    I suppose that this means that all of us Hadoop programmers are really
    just kind of boring folk. On average, anyway.
    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?"
  • James Seigel Tynt at Apr 18, 2011 at 12:22 am
    Yup. I'm boring


    On 2011-04-17, at 6:07 PM, Ted Dunning wrote:

    Turing completion isn't the central question here, really. The truth
    is, map-reduce programs have considerably pressure to be written in a
    scalable fashion which limits them to fairly simple behaviors that
    result in pretty linear dependence of run-time on input size for a
    given program.

    The cool thing about the paper that I linked to the other day is that
    there are enough cues about the expected runtime of the program
    available to make good predictions *without* looking at the details.
    No doubt the estimation facility could make good use of something as
    simple as the hash of the jar in question, but even without that it is
    possible to produce good estimates.

    I suppose that this means that all of us Hadoop programmers are really
    just kind of boring folk. On average, anyway.
    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!) range of behaviors, a more tractable problem might be to characterize the descriptive parameters needed to answer the question: "If the following problem P runs in T0 amount of time on a certain benchmark platform B0, how long T1 will it take to run on a differently configured real-world platform B1 ?"
  • Real great.. at Apr 18, 2011 at 1:39 am
    @mathew: initially i wanted to concentrate on generic class of
    applications..wouldnt mind to stick on to one now..can i know something more
    about the descriptive parameters?

    @all: any results of anybody having done something similar?
    On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt wrote:

    Yup. I'm boring


    On 2011-04-17, at 6:07 PM, Ted Dunning wrote:

    Turing completion isn't the central question here, really. The truth
    is, map-reduce programs have considerably pressure to be written in a
    scalable fashion which limits them to fairly simple behaviors that
    result in pretty linear dependence of run-time on input size for a
    given program.

    The cool thing about the paper that I linked to the other day is that
    there are enough cues about the expected runtime of the program
    available to make good predictions *without* looking at the details.
    No doubt the estimation facility could make good use of something as
    simple as the hash of the jar in question, but even without that it is
    possible to produce good estimates.

    I suppose that this means that all of us Hadoop programmers are really
    just kind of boring folk. On average, anyway.
    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!)
    range of behaviors, a more tractable problem might be to characterize the
    descriptive parameters needed to answer the question: "If the following
    problem P runs in T0 amount of time on a certain benchmark platform B0, how
    long T1 will it take to run on a differently configured real-world platform
    B1 ?"


    --
    Regards,
    R.V.
  • Matthew Foley at Apr 18, 2011 at 6:29 am
    R.V.,
    I was only suggesting one way to tackle the problem; I don't have a list of appropriate parameters.
    I think Ted has much more experience in this area, and he is encouraging you to stay with the generic approach. You should study that paper he recommended, the approach looks really powerful.
    --Matt

    On Apr 17, 2011, at 6:39 PM, real great.. wrote:

    @mathew: initially i wanted to concentrate on generic class of
    applications..wouldnt mind to stick on to one now..can i know something more
    about the descriptive parameters?

    @all: any results of anybody having done something similar?
    On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt wrote:

    Yup. I'm boring


    On 2011-04-17, at 6:07 PM, Ted Dunning wrote:

    Turing completion isn't the central question here, really. The truth
    is, map-reduce programs have considerably pressure to be written in a
    scalable fashion which limits them to fairly simple behaviors that
    result in pretty linear dependence of run-time on input size for a
    given program.

    The cool thing about the paper that I linked to the other day is that
    there are enough cues about the expected runtime of the program
    available to make good predictions *without* looking at the details.
    No doubt the estimation facility could make good use of something as
    simple as the hash of the jar in question, but even without that it is
    possible to produce good estimates.

    I suppose that this means that all of us Hadoop programmers are really
    just kind of boring folk. On average, anyway.
    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!)
    range of behaviors, a more tractable problem might be to characterize the
    descriptive parameters needed to answer the question: "If the following
    problem P runs in T0 amount of time on a certain benchmark platform B0, how
    long T1 will it take to run on a differently configured real-world platform
    B1 ?"


    --
    Regards,
    R.V.
  • Real great.. at Apr 18, 2011 at 8:50 am
    sure,will do..:)
    On Mon, Apr 18, 2011 at 11:58 AM, Matthew Foley wrote:

    R.V.,
    I was only suggesting one way to tackle the problem; I don't have a list of
    appropriate parameters.
    I think Ted has much more experience in this area, and he is encouraging
    you to stay with the generic approach. You should study that paper he
    recommended, the approach looks really powerful.
    --Matt

    On Apr 17, 2011, at 6:39 PM, real great.. wrote:

    @mathew: initially i wanted to concentrate on generic class of
    applications..wouldnt mind to stick on to one now..can i know something
    more
    about the descriptive parameters?

    @all: any results of anybody having done something similar?
    On Mon, Apr 18, 2011 at 5:55 AM, James Seigel Tynt wrote:

    Yup. I'm boring


    On 2011-04-17, at 6:07 PM, Ted Dunning wrote:

    Turing completion isn't the central question here, really. The truth
    is, map-reduce programs have considerably pressure to be written in a
    scalable fashion which limits them to fairly simple behaviors that
    result in pretty linear dependence of run-time on input size for a
    given program.

    The cool thing about the paper that I linked to the other day is that
    there are enough cues about the expected runtime of the program
    available to make good predictions *without* looking at the details.
    No doubt the estimation facility could make good use of something as
    simple as the hash of the jar in question, but even without that it is
    possible to produce good estimates.

    I suppose that this means that all of us Hadoop programmers are really
    just kind of boring folk. On average, anyway.

    On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley <mattf@yahoo-inc.com>
    wrote:
    Since general M/R jobs vary over a huge (Turing problem equivalent!)
    range of behaviors, a more tractable problem might be to characterize the
    descriptive parameters needed to answer the question: "If the following
    problem P runs in T0 amount of time on a certain benchmark platform B0, how
    long T1 will it take to run on a differently configured real-world platform
    B1 ?"


    --
    Regards,
    R.V.

    --
    Regards,
    R.V.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 16, '11 at 10:02a
activeApr 18, '11 at 8:50a
posts13
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase