I need to solve the following problem in MAP REDUCE paradigm, looking for
I have a million entries in a file...line by line, which is my input.
I have a series of hadoop jobs which work on these entries, they work on one
entry at a time, but for one specific hadoop job I need to look at 50
entries at a time and analyze them then do some Bizz logic. My problem has
been to access exact 50 entries in one go.
Options i am thinking are....
1. Do the processing as part of REDUCE. I will ensure that i use the same
intermediate key for a batch of 50 entries inside MAP. (have a static
counter, for every 50 change the intermediate key and so on) so that REDUCE
will get an iterator of 50
2. The option above has a lot of I/O, sorting etc. So instead...
Inside MAP, create a in mem pool (intialized in configure() ) and when 50 is
reached do the Bizz logic and clear pool.
I was looking to see if there is any better way to statically group entries
by a pre-deteremined number and process them in hadoop