FAQ
HI,
I hava many <key,value> pairs now, and want to get all different values
for each key, which way is efficient for this work.

such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
output: <1,2/3/4> <2,1/2>

Thanks!

walter

Search Discussions

  • Harsh J at Aug 3, 2011 at 5:17 am
    Use MapReduce :)

    If map output: (key, value)
    Then reduce input becomes: (key, [iterator of values across all maps
    with (key, value)])

    I believe this is very similar to the wordcount example, but minus the
    summing. For a given key, you get all the values that carry that key
    in the reducer. Have you tried to run a simple program to achieve this
    before asking? Or is something specifically not working?
    On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wrote:
    HI,
    I hava many <key,value> pairs now, and want to get all different values
    for each key, which way is efficient for this work.

    such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
    output: <1,2/3/4> <2,1/2>

    Thanks!

    walter


    --
    Harsh J
  • Jianxin Wang at Aug 3, 2011 at 6:23 am
    hi,harsh
    After map, I can get all values for one key, but I want dedup these
    values, only get all unique values. now I just do it like the image.

    I think the following code is not efficient.(using a HashSet to dedup)
    Thanks:)

    private static class MyReducer extends
    Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
    {
    HashSet<Long> uids=new HashSet<Long>();
    LongsWritable unique_uids=new LongsWritable();
    public void reduce(LongWritable key,Iterable<LongWritable> values,Context
    context)throws IOException,InterruptedException
    {
    uids.clear();
    for(LongWritable v:values)
    {
    uids.add(v.get());
    }
    int size=uids.size();
    long[] l=new long[size];
    int i=0;
    for(long uid:uids)
    {
    l[i]=uid;
    i++;
    }
    unique_uids.Set(l);
    context.write(key,unique_uids);
    }
    }


    2011/8/3 Harsh J <harsh@cloudera.com>
    Use MapReduce :)

    If map output: (key, value)
    Then reduce input becomes: (key, [iterator of values across all maps
    with (key, value)])

    I believe this is very similar to the wordcount example, but minus the
    summing. For a given key, you get all the values that carry that key
    in the reducer. Have you tried to run a simple program to achieve this
    before asking? Or is something specifically not working?
    On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wrote:
    HI,
    I hava many <key,value> pairs now, and want to get all different values
    for each key, which way is efficient for this work.

    such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
    output: <1,2/3/4> <2,1/2>

    Thanks!

    walter


    --
    Harsh J
  • Matthew John at Aug 3, 2011 at 10:27 am
    Hey,

    I feel HashSet is a good method to dedup. To increase the overall efficiency
    you could also look into Combiner running the same Reducer code. That would
    ensure less data in the sort-shuffle phase.

    Regards,
    Matthew
    On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wrote:

    hi,harsh
    After map, I can get all values for one key, but I want dedup these
    values, only get all unique values. now I just do it like the image.

    I think the following code is not efficient.(using a HashSet to dedup)
    Thanks:)

    private static class MyReducer extends
    Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
    {
    HashSet<Long> uids=new HashSet<Long>();
    LongsWritable unique_uids=new LongsWritable();
    public void reduce(LongWritable key,Iterable<LongWritable> values,Context
    context)throws IOException,InterruptedException
    {
    uids.clear();
    for(LongWritable v:values)
    {
    uids.add(v.get());
    }
    int size=uids.size();
    long[] l=new long[size];
    int i=0;
    for(long uid:uids)
    {
    l[i]=uid;
    i++;
    }
    unique_uids.Set(l);
    context.write(key,unique_uids);
    }
    }


    2011/8/3 Harsh J <harsh@cloudera.com>
    Use MapReduce :)

    If map output: (key, value)
    Then reduce input becomes: (key, [iterator of values across all maps
    with (key, value)])

    I believe this is very similar to the wordcount example, but minus the
    summing. For a given key, you get all the values that carry that key
    in the reducer. Have you tried to run a simple program to achieve this
    before asking? Or is something specifically not working?
    On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wrote:
    HI,
    I hava many <key,value> pairs now, and want to get all different values
    for each key, which way is efficient for this work.

    such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
    output: <1,2/3/4> <2,1/2>

    Thanks!

    walter


    --
    Harsh J
  • Jianxin Wang at Aug 3, 2011 at 10:37 am
    thanks! Matthew :
    *
    *
    * how about using SecondarySory to get <key,values>, the values are
    sorted for every key.*
    *then traverse the sorted values to get all unique values.*
    * *
    * I am not sure which way is more efficient. I doubt HashSet is a
    complicated data structure.
    *
    2011/8/3 Matthew John <tmatthewjohn1988@gmail.com>
    Hey,

    I feel HashSet is a good method to dedup. To increase the overall
    efficiency
    you could also look into Combiner running the same Reducer code. That would
    ensure less data in the sort-shuffle phase.

    Regards,
    Matthew
    On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wrote:

    hi,harsh
    After map, I can get all values for one key, but I want dedup these
    values, only get all unique values. now I just do it like the image.

    I think the following code is not efficient.(using a HashSet to dedup)
    Thanks:)

    private static class MyReducer extends
    Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
    {
    HashSet<Long> uids=new HashSet<Long>();
    LongsWritable unique_uids=new LongsWritable();
    public void reduce(LongWritable key,Iterable<LongWritable> values,Context
    context)throws IOException,InterruptedException
    {
    uids.clear();
    for(LongWritable v:values)
    {
    uids.add(v.get());
    }
    int size=uids.size();
    long[] l=new long[size];
    int i=0;
    for(long uid:uids)
    {
    l[i]=uid;
    i++;
    }
    unique_uids.Set(l);
    context.write(key,unique_uids);
    }
    }


    2011/8/3 Harsh J <harsh@cloudera.com>
    Use MapReduce :)

    If map output: (key, value)
    Then reduce input becomes: (key, [iterator of values across all maps
    with (key, value)])

    I believe this is very similar to the wordcount example, but minus the
    summing. For a given key, you get all the values that carry that key
    in the reducer. Have you tried to run a simple program to achieve this
    before asking? Or is something specifically not working?
    On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang wrote:
    HI,
    I hava many <key,value> pairs now, and want to get all different values
    for each key, which way is efficient for this work.

    such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
    output: <1,2/3/4> <2,1/2>

    Thanks!

    walter


    --
    Harsh J
  • Harsh J at Aug 3, 2011 at 12:54 pm
    Secondary sort is the way to go. Easier to dedup a sorted input set.
    Although you can also try to filter in map and combine phases to a
    safe extent possible (sets, etc.), to speed up the process and reduce
    data transfers.
    On Wed, Aug 3, 2011 at 4:07 PM, Jianxin Wang wrote:
    thanks! Matthew :
    *
    *
    *    how about using SecondarySory to get <key,values>, the values are
    sorted for every key.*
    *then traverse the sorted values to get all unique values.*
    *    *
    *   I am not sure which way is more efficient. I doubt HashSet is a
    complicated data structure.
    *
    2011/8/3 Matthew John <tmatthewjohn1988@gmail.com>
    Hey,

    I feel HashSet is a good method to dedup. To increase the overall
    efficiency
    you could also look into Combiner running the same Reducer code. That would
    ensure less data in the sort-shuffle phase.

    Regards,
    Matthew
    On Wed, Aug 3, 2011 at 11:52 AM, Jianxin Wang wrote:

    hi,harsh
    After map, I can get all values for one key, but I want dedup these
    values, only get all unique values. now I just do it like the image.

    I think the following code is not efficient.(using a HashSet to dedup)
    Thanks:)

    private static class MyReducer extends
    Reducer<LongWritable,LongWritable,LongWritable,LongsWritable>
    {
    HashSet<Long> uids=new HashSet<Long>();
    LongsWritable unique_uids=new LongsWritable();
    public void reduce(LongWritable key,Iterable<LongWritable> values,Context
    context)throws IOException,InterruptedException
    {
    uids.clear();
    for(LongWritable v:values)
    {
    uids.add(v.get());
    }
    int size=uids.size();
    long[] l=new long[size];
    int i=0;
    for(long uid:uids)
    {
    l[i]=uid;
    i++;
    }
    unique_uids.Set(l);
    context.write(key,unique_uids);
    }
    }


    2011/8/3 Harsh J <harsh@cloudera.com>
    Use MapReduce :)

    If map output: (key, value)
    Then reduce input becomes: (key, [iterator of values across all maps
    with (key, value)])

    I believe this is very similar to the wordcount example, but minus the
    summing. For a given key, you get all the values that carry that key
    in the reducer. Have you tried to run a simple program to achieve this
    before asking? Or is something specifically not working?

    On Wed, Aug 3, 2011 at 9:20 AM, Jianxin Wang <wangjx798@gmail.com>
    wrote:
    HI,
    I hava many <key,value> pairs now, and want to get all different values
    for each key, which way is efficient for this work.

    such as input : <1,2> <1,3> <1,4> <1,3> <2,1> <2,2>
    output: <1,2/3/4> <2,1/2>

    Thanks!

    walter


    --
    Harsh J


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 3, '11 at 3:51a
activeAug 3, '11 at 12:54p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase