FAQ
Hi all,

I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?


--
Best Regards

Jeff Zhang

Search Discussions

  • Niels Basjes at Jun 27, 2011 at 7:29 pm
    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH
    On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote:

    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang


    --
    Best regards / Met vriendelijke groeten,

    Niels Basjes
  • David Rosenstrauch at Jun 27, 2011 at 8:35 pm
    Building on this, you could do something like the following to make it
    more random:

    if (numRecordsWritten < NUM_RECORDS_DESIRED) {
    int n = generateARandomNumberBetween1and100();
    if (n == 100) {
    context.write(key, value);
    }
    }

    The above would somewhat randomly output 1 record out of every 100, up
    to a specified maximum amount desired, and discard all the rest.

    HTH,

    DR
    On 06/27/2011 03:28 PM, Niels Basjes wrote:
    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH

    On Mon, Jun 27, 2011 at 09:11, Jeff Zhangwrote:
    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang
  • Anthony Urso at Jun 27, 2011 at 8:11 pm

    On Mon, Jun 27, 2011 at 12:11 AM, Jeff Zhang wrote:
    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such experience ?
    I've done this before, and it will work fine as long as all of your
    splits have identical numbers of records.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 27, '11 at 7:12a
activeJun 27, '11 at 8:35p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase