- I am on EC2, what would be the advantage of using Zookeeper over
JavaSpaces? Either would have to be maintained by me, as they are not
provided on EC2 directly;
- pack that with a map-local counter into a global ID - you mean, just
take the global counter and make the local instance counter equal to it?
- 2^53 is quite sufficient for my purposes, but where is the number
- Looking at your last point, I saw what I have previously missed: I need
numbers consecutive within each reducer, and then I need them consecutive
between reducers. I assume that reducers are sorted. For example, if my
records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the
other one - maps 4,5,6. If that's the case, I need to know how the reducers
are sorted. Then I could simply run the second stage.
On Wed, Oct 28, 2009 at 1:07 PM, brien colwell wrote:
Another approach is to initialize each map task with an ID (using
JavaSpaces, something like Zookeeper, or some aspect of the input data) and
then pack that with a map-local counter into a global ID. This makes
assumptions like the number of map tasks less than 2^10 and the number of
records per mapper will be less than 2^53. The packed global IDs are
consecutive per map task. If globally consecutive is needed, a second stage
can create a histogram of map task ID -> number of records and use it to
transform the global IDs to globally consecutive .
Mark Kerzner wrote:
environmental variables are available in Java, but the environment itself
not shared between instances. I read your code - you are solving exactly
same problem I am interested in - but I did not see how it works in
By the way, it occurs to me that JavaSpaces, which is a different approach
to distributed computing, trumpled by Hadoop, could be used here! Just run
one instance with GigaSpaces at all times, and you got your self-increment
for any number of jobs. It is perfect for concurrent processing and very
On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <email@example.com
I posted an approach to this using streaming, but if the environment
variables are available in standard Java interface, this may work for
You'll have to be able to tolerate some small gaps in the ids.
Mark Kerzner wrote:
Aaron, although your notes are not a ready solution, but they are a
On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <firstname.lastname@example.org>
There is no in-MapReduce mechanism for cross-task synchronization.
need to use something like Zookeeper for this, or another external
Note that this will greatly complicate your life.
If I were you, I'd try to either redesign my pipeline elsewhere to
this need, or maybe get really clever. For example, do your numbers
be sequential, or just unique?
If the latter, then take the byte offset into the reducer's current
file and combine that with the reducer id (e.g.,
<current-byte-offset><zero-padded-reducer-id>) to guarantee that
building unique sequences. If the former... rethink your pipeline? :)
On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <email@example.com>
I need to number all output records consecutively, like, 1,2,3...
This is no problem with one reducer, making recordId an instance
the Reducer class, and setting conf.setNumReduceTasks(1)
However, it is an architectural decision forced by processing need,
the reducer becomes a bottleneck. Can I have a global variable for all
reducers, which would give each the next consecutive recordId? In the
database scenario, this would be the unique autokey. How to do it in