On Sunday 03 January 2010 11:30:29 Nathan Marz wrote:I did some analysis on the performance of Hadoop-based workflows. Some of

the results are counter-intuitive so I thought the community at large would

be interested:

http://nathanmarz.com/blog/hadoop-mathematics/Would love to hear any feedback or comments you have.

Just thinking out loud: runtime generally is not equal to the "hours of

data". In your processing model it's your next iteration that will have this

amount of data. By equating them, you're assuming an equilibrium case. But

if you're in an equilibrium, what would increasing the capacity mean? - I

think it'd mean that you process your so-far accumulated data faster, so your

next iteration will have less data to process and so on until you get to a new

equilibrium (but how fast will you get there?). Lesson learned: increasing

capacity shifts equilibrium. It's kind of like a reaction time of your

system... Sometimes, I imagine, as long as you're keeping up with the arrival

rate you don't care if it takes a week's full or a day's. In fact there may

be some constraints on the minimum size of the input.

Another comment: you're assuming that processing time is linear in the input

size. Of course it depends on the processing you're doing, but even if YOUR

processing is linear, Hadoop needs to sort keys, and that is at best

O(n*log(n)), so there is inherent non-linearity present.

Similarly about the number of nodes: doubling your nodes will not double your

service rate for a variety of reasons.

-Yuri