FAQ
Hi,

I have a sort job consisting of only the Mapper (no Reducer) task. I want my
results to contain only the top n records. Is there any way of restricting
the number of records that are emitted by the Mappers?

Basically I am looking to see if there is an equivalent of achieving
the behavior similar to LIMIT in SQL queries.

Thanks & Regards,
Rakesh

Search Discussions

  • Anthony Urso at Jan 12, 2011 at 7:14 pm
    Either use an instance variable or a Combiner. The latter is correct
    if you want the top-n per key from the mapper.
    On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum wrote:
    Hi,

    I have a sort job consisting of only the Mapper (no Reducer) task. I want my
    results to contain only the top n records. Is there any way of restricting
    the number of records that are emitted by the Mappers?

    Basically I am looking to see if there is an equivalent of achieving
    the behavior similar to LIMIT in SQL queries.

    Thanks & Regards,
    Rakesh
  • Hari Sreekumar at Jan 14, 2011 at 3:06 pm
    Ideally, mappers should be independent of other mappers. Still, you can use
    counters and start skipping records when counter>some value to achieve
    similar behavior. It will not be very reliable if you want very exact
    results though.
    On Thu, Jan 13, 2011 at 12:43 AM, Anthony Urso wrote:

    Either use an instance variable or a Combiner. The latter is correct
    if you want the top-n per key from the mapper.
    On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum wrote:
    Hi,

    I have a sort job consisting of only the Mapper (no Reducer) task. I want my
    results to contain only the top n records. Is there any way of
    restricting
    the number of records that are emitted by the Mappers?

    Basically I am looking to see if there is an equivalent of achieving
    the behavior similar to LIMIT in SQL queries.

    Thanks & Regards,
    Rakesh
  • Alex Kozlov at Jan 14, 2011 at 4:11 pm
    Hi Rakesh, What do you mean by the top N? The first ones or you need to
    sort them in memory? You can always output records in the cleanup() method
    at the end of the mapper run.
    On Fri, Jan 14, 2011 at 7:05 AM, Hari Sreekumar wrote:

    Ideally, mappers should be independent of other mappers. Still, you can use
    counters and start skipping records when counter>some value to achieve
    similar behavior. It will not be very reliable if you want very exact
    results though.
    On Thu, Jan 13, 2011 at 12:43 AM, Anthony Urso wrote:

    Either use an instance variable or a Combiner. The latter is correct
    if you want the top-n per key from the mapper.

    On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum <[email protected]>
    wrote:
    Hi,

    I have a sort job consisting of only the Mapper (no Reducer) task. I
    want
    my
    results to contain only the top n records. Is there any way of
    restricting
    the number of records that are emitted by the Mappers?

    Basically I am looking to see if there is an equivalent of achieving
    the behavior similar to LIMIT in SQL queries.

    Thanks & Regards,
    Rakesh
  • Niels Basjes at Jan 14, 2011 at 4:47 pm
    Hi,
    I have a sort job consisting of only the Mapper (no Reducer) task. I want my
    results to contain only the top n records. Is there any way of restricting
    the number of records that are emitted by the Mappers?

    Basically I am looking to see if there is an equivalent of achieving
    the behavior similar to LIMIT in SQL queries.
    I think I understand your goal. However the question is toward (what I
    think) is the wrong solution.

    A mapper gets 1 record as input and only knows about that one record.
    There is no way to limit there.

    If you implement a simple reducer you can very easily let is stop
    reading the input iterator after N records and limit the output in
    that way.

    Doing it in the reducer also allows you to easily add a concept of
    "Top N" by using the "Secondary Sort" trick to sort the input before
    it arrives at the reducer.

    HTH

    Niels Basjes

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 12, '11 at 6:45p
activeJan 14, '11 at 4:47p
posts5
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase