Is there a timeframe for when rack-locality will be available?
- Aaron

Owen O'Malley wrote:
On Nov 9, 2007, at 8:49 AM, Stu Hood wrote:

Currently there is no sanctioned method of 'piping' the reduce output
of one job directly into the map input of another (although it has
been discussed: see the thread I linked before:
http://www.nabble.com/Poly-reduce--tf4313116.html ).
Did you read the conclusion of the previous thread? The performance
gains in avoiding the second map input are trivial compared the gains
in simplicity of having a single data path and re-execution story.
During a reasonably large job, roughly 98% of your maps are reading
data on the _same_ node. Once we put in rack locality, it will be even

I'd much much rather build the map/reduce primitive and support it
very well than add the additional complexity of any sort of
poly-reduce. I think it is very appropriate for systems like Pig to
include that kind of optimization, but it should not be part of the
base framework.

I watched the front of the Dryad talk and was struck by how complex it
quickly became. It does give the application writer a lot of control,
but to do the equivalent of a map/reduce sort with 100k maps and 4k
reduces with automatic spill-over to disk during the shuffle seemed
_really_ complicated.

On a side note, in the part of the talk that I watched, the scaling
graph went from 2 to 9 nodes. Hadoop's scaling graphs go to 1000's of
nodes. Did they ever suggest later in the talk that it scales up higher?

-- Owen

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 12 | next ›
Discussion Overview
groupcommon-user @
postedNov 9, '07 at 7:01a
activeNov 13, '07 at 3:01a



site design / logo © 2022 Grokbase