FAQ
Hi all,

I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?


--
Best Regards

Jeff Zhang

Search Discussions

  • Niels Basjes at Jun 27, 2011 at 7:29 pm
    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH
    On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote:

    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang


    --
    Best regards / Met vriendelijke groeten,

    Niels Basjes
  • Habermaas, William at Jun 27, 2011 at 7:55 pm
    I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it.

    Bill

    -----Original Message-----
    From: niels@basj.es On Behalf Of Niels Basjes
    Sent: Monday, June 27, 2011 3:29 PM
    To: mapreduce-user@hadoop.apache.org
    Cc: core-user@hadoop.apache.org
    Subject: Re: How to select random n records using mapreduce ?

    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH
    On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote:

    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang


    --
    Best regards / Met vriendelijke groeten,

    Niels Basjes
  • Jeff Schmitz at Jun 27, 2011 at 8:01 pm
    Wait - Habermaas like in Critical Theory????

    -----Original Message-----
    From: Habermaas, William
    Sent: Monday, June 27, 2011 2:55 PM
    To: common-user@hadoop.apache.org
    Subject: RE: How to select random n records using mapreduce ?

    I did something similar. Basically I had a random sampling algorithm
    that I called from the mapper. If it returned true I would collect the
    data, otherwise I would discard it.

    Bill

    -----Original Message-----
    From: niels@basj.es On Behalf Of Niels Basjes
    Sent: Monday, June 27, 2011 3:29 PM
    To: mapreduce-user@hadoop.apache.org
    Cc: core-user@hadoop.apache.org
    Subject: Re: How to select random n records using mapreduce ?

    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH
    On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote:

    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang


    --
    Best regards / Met vriendelijke groeten,

    Niels Basjes
  • Matt Pouttu-Clarke at Jun 27, 2011 at 8:02 pm
    If the incoming data is unique you can create a hash of the data and then do
    a modulus of the hash to select a random set. So if you wanted 10% of the
    data randomly:

    hash % 10 == 0

    Gives a random 10%

    On 6/27/11 12:54 PM, "Habermaas, William" wrote:

    I did something similar. Basically I had a random sampling algorithm that I
    called from the mapper. If it returned true I would collect the data,
    otherwise I would discard it.

    Bill

    -----Original Message-----
    From: niels@basj.es On Behalf Of Niels Basjes
    Sent: Monday, June 27, 2011 3:29 PM
    To: mapreduce-user@hadoop.apache.org
    Cc: core-user@hadoop.apache.org
    Subject: Re: How to select random n records using mapreduce ?

    The only solution I can think of is by creating a counter in Hadoop
    that is incremented each time a mapper lets a record through.
    As soon as the value reaches a preselected value the mappers simply
    discard the additional input they receive.

    Note that this will not at all be random.... yet it's the best I can
    come up with right now.

    HTH
    On Mon, Jun 27, 2011 at 09:11, Jeff Zhang wrote:

    Hi all,
    I'd like to select random N records from a large amount of data using
    hadoop, just wonder how can I archive this ? Currently my idea is that let
    each mapper task select N / mapper_number records. Does anyone has such
    experience ?

    --
    Best Regards

    Jeff Zhang

    iCrossing Privileged and Confidential Information
    This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 27, '11 at 7:12a
activeJun 27, '11 at 8:02p
posts5
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase