FAQ
Hello ,

I am new to Hadoop.Can anybody suggest any example or procedure of
outputting TOP N items having maximum total count, where the input file has
have (Item, count ) pair in each line .

Items can repeat.

Thanks
Neil
http://neilghosh.com

--
Thanks and Regards
Neil
http://neilghosh.com

Search Discussions

  • James Seigel at Sep 10, 2010 at 9:12 pm
    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing code...

    Counter counter = context.getCounter(“ME", "total output records");
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input file has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Neil Ghosh at Sep 10, 2010 at 10:51 pm
    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the reducer.

    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing
    code...

    Counter counter = context.getCounter(“ME", "total output records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input file has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Aaron Baff at Sep 10, 2010 at 11:08 pm
    I'm still fairly new at MapReduce, but here's my thoughts the solution.

    Use the Item as the Key, the Count as the Value, in the Reducer, sum up all of the Count's and output the Item,sum(Count). To make it more efficient, use the same Reducer as the Combiner.

    Then do a 2nd Job where you map the Count as the Key, and Item as the Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing, just output the Count,Item).

    Aaron Baff | Developer | Telescope, Inc.

    email: aaron.baff@telescope.tv | office: 424 270 2913 | www.telescope.tv

    Bored with summer reruns? Spice up your TV week by watching and voting for your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on NBC.

    The information contained in this email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Any views expressed in this message are those of the individual and may not necessarily reflect the views of Telescope Inc. or its associated companies.


    -----Original Message-----
    From: Neil Ghosh
    Sent: Friday, September 10, 2010 3:51 PM
    To: James Seigel
    Cc: common-user@hadoop.apache.org
    Subject: Re: TOP N items

    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the reducer.

    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing
    code...

    Counter counter = context.getCounter("ME", "total output records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input file has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Neil Ghosh at Sep 11, 2010 at 12:15 am
    Thanks Aaron. I employed two Jobs and solved the problem.

    I was just wondering is there anyway , it can be done in single job so that
    disk/network I/O is less and no temporary storage is required between 1st
    and second job.

    Neil
    On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff wrote:

    I'm still fairly new at MapReduce, but here's my thoughts the solution.

    Use the Item as the Key, the Count as the Value, in the Reducer, sum up all
    of the Count's and output the Item,sum(Count). To make it more efficient,
    use the same Reducer as the Combiner.

    Then do a 2nd Job where you map the Count as the Key, and Item as the
    Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing,
    just output the Count,Item).

    Aaron Baff | Developer | Telescope, Inc.

    email: aaron.baff@telescope.tv | office: 424 270 2913 | www.telescope.tv

    Bored with summer reruns? Spice up your TV week by watching and voting for
    your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on NBC.

    The information contained in this email is confidential and may be legally
    privileged. It is intended solely for the addressee. Access to this email by
    anyone else is unauthorized. If you are not the intended recipient, any
    disclosure, copying, distribution or any action taken or omitted to be taken
    in reliance on it, is prohibited and may be unlawful. Any views expressed in
    this message are those of the individual and may not necessarily reflect the
    views of Telescope Inc. or its associated companies.


    -----Original Message-----
    From: Neil Ghosh
    Sent: Friday, September 10, 2010 3:51 PM
    To: James Seigel
    Cc: common-user@hadoop.apache.org
    Subject: Re: TOP N items

    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the reducer.

    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing
    code...

    Counter counter = context.getCounter("ME", "total output records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input file has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Alex Kozlov at Sep 11, 2010 at 12:27 am
    Hi Neil,

    Uniques and Top N, as well as percentiles, are inherently difficult to
    distribute/parallelize since you have to have a global view of the dataset.
    You can optimize the computations given some assumptions about the input
    (the # of unique values, prevalence of the most frequent value larger than
    certain threshold, etc.). There is no way to avoid two jobs in a general
    case.

    Can you specify your problem more precisely and assumptions, if any?

    --
    Alex Kozlov
    Solutions Architect
    Cloudera, Inc
    twitter: alexvk2009

    Hadoop World 2010, October 12, New York City - Register now:
    http://www.cloudera.com/company/press-center/hadoop-world-nyc/
    On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh wrote:

    Thanks Aaron. I employed two Jobs and solved the problem.

    I was just wondering is there anyway , it can be done in single job so
    that
    disk/network I/O is less and no temporary storage is required between 1st
    and second job.

    Neil
    On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff wrote:

    I'm still fairly new at MapReduce, but here's my thoughts the solution.

    Use the Item as the Key, the Count as the Value, in the Reducer, sum up all
    of the Count's and output the Item,sum(Count). To make it more efficient,
    use the same Reducer as the Combiner.

    Then do a 2nd Job where you map the Count as the Key, and Item as the
    Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing,
    just output the Count,Item).

    Aaron Baff | Developer | Telescope, Inc.

    email: aaron.baff@telescope.tv | office: 424 270 2913 |
    www.telescope.tv
    Bored with summer reruns? Spice up your TV week by watching and voting for
    your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on NBC.
    The information contained in this email is confidential and may be legally
    privileged. It is intended solely for the addressee. Access to this email by
    anyone else is unauthorized. If you are not the intended recipient, any
    disclosure, copying, distribution or any action taken or omitted to be taken
    in reliance on it, is prohibited and may be unlawful. Any views expressed in
    this message are those of the individual and may not necessarily reflect the
    views of Telescope Inc. or its associated companies.


    -----Original Message-----
    From: Neil Ghosh
    Sent: Friday, September 10, 2010 3:51 PM
    To: James Seigel
    Cc: common-user@hadoop.apache.org
    Subject: Re: TOP N items

    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the reducer.

    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing
    code...

    Counter counter = context.getCounter("ME", "total output
    records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input file has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Neil Ghosh at Sep 11, 2010 at 12:38 am
    Hi Alex ,

    Thanks so much for the reply . As of now I don't have any issue with 2
    Jobs.I was just making sure that I am not missing any obvious way of writing
    the program in one job.I will get back if I need to optimize on performance
    based on specific pattern of input.

    Thank you so much you all for helping me on this issue.
    Neil
    On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov wrote:

    Hi Neil,

    Uniques and Top N, as well as percentiles, are inherently difficult to
    distribute/parallelize since you have to have a global view of the dataset.
    You can optimize the computations given some assumptions about the input
    (the # of unique values, prevalence of the most frequent value larger than
    certain threshold, etc.). There is no way to avoid two jobs in a general
    case.

    Can you specify your problem more precisely and assumptions, if any?

    --
    Alex Kozlov
    Solutions Architect
    Cloudera, Inc
    twitter: alexvk2009

    Hadoop World 2010, October 12, New York City - Register now:
    http://www.cloudera.com/company/press-center/hadoop-world-nyc/
    On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh wrote:

    Thanks Aaron. I employed two Jobs and solved the problem.

    I was just wondering is there anyway , it can be done in single job so
    that
    disk/network I/O is less and no temporary storage is required between 1st
    and second job.

    Neil

    On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <Aaron.Baff@telescope.tv>
    wrote:
    I'm still fairly new at MapReduce, but here's my thoughts the solution.

    Use the Item as the Key, the Count as the Value, in the Reducer, sum up all
    of the Count's and output the Item,sum(Count). To make it more
    efficient,
    use the same Reducer as the Combiner.

    Then do a 2nd Job where you map the Count as the Key, and Item as the
    Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing,
    just output the Count,Item).

    Aaron Baff | Developer | Telescope, Inc.

    email: aaron.baff@telescope.tv | office: 424 270 2913 |
    www.telescope.tv
    Bored with summer reruns? Spice up your TV week by watching and voting for
    your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on NBC.
    The information contained in this email is confidential and may be legally
    privileged. It is intended solely for the addressee. Access to this email by
    anyone else is unauthorized. If you are not the intended recipient, any
    disclosure, copying, distribution or any action taken or omitted to be taken
    in reliance on it, is prohibited and may be unlawful. Any views
    expressed in
    this message are those of the individual and may not necessarily reflect the
    views of Telescope Inc. or its associated companies.


    -----Original Message-----
    From: Neil Ghosh
    Sent: Friday, September 10, 2010 3:51 PM
    To: James Seigel
    Cc: common-user@hadoop.apache.org
    Subject: Re: TOP N items

    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the reducer.

    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your reducing
    code...

    Counter counter = context.getCounter("ME", "total output
    records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input
    file
    has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com
  • Runping Qi at Sep 11, 2010 at 5:00 am
    Assuming N is not too large in the sense that your reducers can keep a tree
    map of N elements, then you can have your reducer maintain the top N
    elements in a tree-map (or a priority queue, or a heap, whatever), with
    counts as keys in the tree-map. As the reducers progress, you throw away
    the items with smaller counts, and always keep the top N seen so far in the
    tree-map. At the end of the reducing process, each reducer outputs the
    elements in its tree-map, which should be the top N in that reducer. If your
    job has only one reducer, then the output is your final answer. Otherwise,
    you can use a second job (or any tools) to merge/sort the R output files
    (each has the top N elements for the corresponding reducer). If N * R <<
    input size, the second step should take much less time than the first one.


    On Fri, Sep 10, 2010 at 5:37 PM, Neil Ghosh wrote:

    Hi Alex ,

    Thanks so much for the reply . As of now I don't have any issue with 2
    Jobs.I was just making sure that I am not missing any obvious way of
    writing
    the program in one job.I will get back if I need to optimize on performance
    based on specific pattern of input.

    Thank you so much you all for helping me on this issue.
    Neil
    On Sat, Sep 11, 2010 at 5:56 AM, Alex Kozlov wrote:

    Hi Neil,

    Uniques and Top N, as well as percentiles, are inherently difficult to
    distribute/parallelize since you have to have a global view of the dataset.
    You can optimize the computations given some assumptions about the input
    (the # of unique values, prevalence of the most frequent value larger than
    certain threshold, etc.). There is no way to avoid two jobs in a general
    case.

    Can you specify your problem more precisely and assumptions, if any?

    --
    Alex Kozlov
    Solutions Architect
    Cloudera, Inc
    twitter: alexvk2009

    Hadoop World 2010, October 12, New York City - Register now:
    http://www.cloudera.com/company/press-center/hadoop-world-nyc/
    On Fri, Sep 10, 2010 at 5:14 PM, Neil Ghosh wrote:

    Thanks Aaron. I employed two Jobs and solved the problem.

    I was just wondering is there anyway , it can be done in single job so
    that
    disk/network I/O is less and no temporary storage is required between
    1st
    and second job.

    Neil

    On Sat, Sep 11, 2010 at 4:37 AM, Aaron Baff <Aaron.Baff@telescope.tv>
    wrote:
    I'm still fairly new at MapReduce, but here's my thoughts the
    solution.
    Use the Item as the Key, the Count as the Value, in the Reducer, sum
    up
    all
    of the Count's and output the Item,sum(Count). To make it more
    efficient,
    use the same Reducer as the Combiner.

    Then do a 2nd Job where you map the Count as the Key, and Item as the
    Value, use 1 Reducer, and Identity Reduce it (e.g. don't do any reducing,
    just output the Count,Item).

    Aaron Baff | Developer | Telescope, Inc.

    email: aaron.baff@telescope.tv | office: 424 270 2913 |
    www.telescope.tv
    Bored with summer reruns? Spice up your TV week by watching and
    voting
    for
    your favorite act on America's Got Talent, 9pm ET/CT Tuesday nights on NBC.
    The information contained in this email is confidential and may be legally
    privileged. It is intended solely for the addressee. Access to this email by
    anyone else is unauthorized. If you are not the intended recipient,
    any
    disclosure, copying, distribution or any action taken or omitted to be taken
    in reliance on it, is prohibited and may be unlawful. Any views
    expressed in
    this message are those of the individual and may not necessarily
    reflect
    the
    views of Telescope Inc. or its associated companies.


    -----Original Message-----
    From: Neil Ghosh
    Sent: Friday, September 10, 2010 3:51 PM
    To: James Seigel
    Cc: common-user@hadoop.apache.org
    Subject: Re: TOP N items

    Thanks James,

    This gives me only N results for sure but not necessarily the top N

    I have used the Item as Key and Count as Value as input to the
    reducer.
    and my reducing logic is to sum the count for a particular item.

    Now my output comes as grouped but not in order.

    Do I need to use custom comparator ?

    Thanks
    Neil
    On Sat, Sep 11, 2010 at 2:41 AM, James Seigel wrote:

    Welcome to the land of the fuzzy elephant!

    Of course there are many ways to do it. Here is one, it might not
    be
    brilliant or the right was, but I am sure you will get more :)

    Use the identity mapper...

    job.setMapperClass(Mapper.class);

    then have one reducer....

    job.setNumReduceTasks(1);

    then have a reducer that has something like this around your
    reducing
    code...

    Counter counter = context.getCounter("ME", "total output
    records"
    );
    if (counter.getValue() < LIMIT) {

    <do your reducey stuff here>

    context.write(key, value);
    counter.increment(1);
    }


    Cheers
    James.



    On 2010-09-10, at 3:04 PM, Neil Ghosh wrote:

    Hello ,

    I am new to Hadoop.Can anybody suggest any example or procedure of
    outputting TOP N items having maximum total count, where the input
    file
    has
    have (Item, count ) pair in each line .

    Items can repeat.

    Thanks
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com


    --
    Thanks and Regards
    Neil
    http://neilghosh.com

    --
    Thanks and Regards
    Neil
    http://neilghosh.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 10, '10 at 9:05p
activeSep 11, '10 at 5:00a
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase