|| at Jan 14, 2011 at 4:47 pm
I have a sort job consisting of only the Mapper (no Reducer) task. I want my
results to contain only the top n records. Is there any way of restricting
the number of records that are emitted by the Mappers?
Basically I am looking to see if there is an equivalent of achieving
the behavior similar to LIMIT in SQL queries.
I think I understand your goal. However the question is toward (what I
think) is the wrong solution.
A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.
If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.