FAQ
Hi,

I need to solve the following problem in MAP REDUCE paradigm, looking for
advice.

I have a million entries in a file...line by line, which is my input.
I have a series of hadoop jobs which work on these entries, they work on one
entry at a time, but for one specific hadoop job I need to look at 50
entries at a time and analyze them then do some Bizz logic. My problem has
been to access exact 50 entries in one go.

Options i am thinking are....

1. Do the processing as part of REDUCE. I will ensure that i use the same
intermediate key for a batch of 50 entries inside MAP. (have a static
counter, for every 50 change the intermediate key and so on) so that REDUCE
will get an iterator of 50

2. The option above has a lot of I/O, sorting etc. So instead...
Inside MAP, create a in mem pool (intialized in configure() ) and when 50 is
reached do the Bizz logic and clear pool.

I was looking to see if there is any better way to statically group entries
by a pre-deteremined number and process them in hadoop

thanks
Anis

Search Discussions

  • Doug Cutting at Jan 25, 2007 at 7:35 pm
    [ This question is probably more appropriate for hadoop-user. ]

    Anis Ahmed wrote:
    1. Do the processing as part of REDUCE. I will ensure that i use the same
    intermediate key for a batch of 50 entries inside MAP. (have a static
    counter, for every 50 change the intermediate key and so on) so that REDUCE
    will get an iterator of 50
    This is probably the simplest approach. Has it proven too slow?
    2. The option above has a lot of I/O, sorting etc. So instead...
    Inside MAP, create a in mem pool (intialized in configure() ) and when
    50 is
    reached do the Bizz logic and clear pool.
    Alternately you could define an InputFormat that reads 50 lines at a
    time instead of a single line.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 24, '07 at 4:03p
activeJan 25, '07 at 7:35p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Doug Cutting: 1 post Anis Ahmed: 1 post

People

Translate

site design / logo © 2022 Grokbase