FAQ
I did some analysis on the performance of Hadoop-based workflows. Some of
the results are counter-intuitive so I thought the community at large would
be interested:

http://nathanmarz.com/blog/hadoop-mathematics/

Would love to hear any feedback or comments you have.

Search Discussions

  • Yuri Pradkin at Jan 6, 2010 at 1:30 am

    On Sunday 03 January 2010 11:30:29 Nathan Marz wrote:
    I did some analysis on the performance of Hadoop-based workflows. Some of
    the results are counter-intuitive so I thought the community at large would
    be interested:

    http://nathanmarz.com/blog/hadoop-mathematics/

    Would love to hear any feedback or comments you have.
    Just thinking out loud: runtime generally is not equal to the "hours of
    data". In your processing model it's your next iteration that will have this
    amount of data. By equating them, you're assuming an equilibrium case. But
    if you're in an equilibrium, what would increasing the capacity mean? - I
    think it'd mean that you process your so-far accumulated data faster, so your
    next iteration will have less data to process and so on until you get to a new
    equilibrium (but how fast will you get there?). Lesson learned: increasing
    capacity shifts equilibrium. It's kind of like a reaction time of your
    system... Sometimes, I imagine, as long as you're keeping up with the arrival
    rate you don't care if it takes a week's full or a day's. In fact there may
    be some constraints on the minimum size of the input.

    Another comment: you're assuming that processing time is linear in the input
    size. Of course it depends on the processing you're doing, but even if YOUR
    processing is linear, Hadoop needs to sort keys, and that is at best
    O(n*log(n)), so there is inherent non-linearity present.

    Similarly about the number of nodes: doubling your nodes will not double your
    service rate for a variety of reasons.

    -Yuri
  • Nathan Marz at Jan 6, 2010 at 6:45 am
    That's a great way of putting it: "increasing capacity shifts the
    equilibrium". If you work some examples you'll find that it doesn't take
    many iterations for a workflow to converge to its stable runtime. There is
    some minimum capacity you need for there to be an equilibrium though, or
    else the runtime doesn't stabilize.

    Regarding the sorting, I believe the complexity in Hadoop is more like n / R
    * log(n/R), where R is in the number of reducers. And in practice, I've
    found sorting to not take very long. But you're right - fundamentally this
    model is an approximation.

    On Tue, Jan 5, 2010 at 5:29 PM, Yuri Pradkin wrote:
    On Sunday 03 January 2010 11:30:29 Nathan Marz wrote:
    I did some analysis on the performance of Hadoop-based workflows. Some of
    the results are counter-intuitive so I thought the community at large would
    be interested:

    http://nathanmarz.com/blog/hadoop-mathematics/

    Would love to hear any feedback or comments you have.
    Just thinking out loud: runtime generally is not equal to the "hours of
    data". In your processing model it's your next iteration that will have
    this
    amount of data. By equating them, you're assuming an equilibrium case.
    But
    if you're in an equilibrium, what would increasing the capacity mean? - I
    think it'd mean that you process your so-far accumulated data faster, so
    your
    next iteration will have less data to process and so on until you get to a
    new
    equilibrium (but how fast will you get there?). Lesson learned: increasing
    capacity shifts equilibrium. It's kind of like a reaction time of your
    system... Sometimes, I imagine, as long as you're keeping up with the
    arrival
    rate you don't care if it takes a week's full or a day's. In fact there
    may
    be some constraints on the minimum size of the input.

    Another comment: you're assuming that processing time is linear in the
    input
    size. Of course it depends on the processing you're doing, but even if
    YOUR
    processing is linear, Hadoop needs to sort keys, and that is at best
    O(n*log(n)), so there is inherent non-linearity present.

    Similarly about the number of nodes: doubling your nodes will not double
    your
    service rate for a variety of reasons.

    -Yuri


    --
    Nathan Marz
    Twitter: @nathanmarz
    http://nathanmarz.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 3, '10 at 7:31p
activeJan 6, '10 at 6:45a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Nathan Marz: 2 posts Yuri Pradkin: 1 post

People

Translate

site design / logo © 2022 Grokbase