Hi,

Is it possible to have multiple mappers where each mapper is
operating on a different input file and whose result (which is a key value
pair from different mappers) is processed by a single reducer?

Regards,
Sahana

Search Discussions

  • Harsh J at Sep 7, 2011 at 9:12 am
    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?
    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat wrote:
    Hi,
    Is it possible to have multiple mappers  where each mapper is
    operating on a different input file and whose result (which is a key value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J
  • Sahana Bhat at Sep 7, 2011 at 9:33 am
    Hi,

    I understand that given a file, the file is split across 'n' mapper
    instances, which is the normal case.

    The scenario i have is :
    1. Two files which are not totally identical in terms of number of columns
    (but have data that is similar in a few columns) need to be processed and
    after computation a single output file has to be generated.

    Note : CV - computedvalue

    File1 belonging to one dataset has data for :
    Date,counter1,counter2, CV1,CV2

    File2 belonging to another dataset has data for :
    Date,counter1,counter2,CV3,CV4,CV5

    Computation to be carried out on these two files is :
    CV6 =(CV1*CV5)/100

    And the final emitted output file should have data in the sequence:
    Date,counter1,counter2,CV6

    The idea is to have two mappers (not instances) run on each of the file, and
    a single reducer that emits the final result file.

    Thanks,
    Sahana
    On Wed, Sep 7, 2011 at 2:40 PM, Harsh J wrote:

    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?
    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat wrote:
    Hi,
    Is it possible to have multiple mappers where each mapper is
    operating on a different input file and whose result (which is a key value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J
  • Sudharsan Sampath at Sep 7, 2011 at 9:46 am
    Hi,

    Its possible by setting the num of reduce tasks to be 1. Based on your
    example, it looks like u need to group ur records based on "Date, counter1
    and counter2". So that should go in the logic of building your key for your
    map o/p.

    Thanks
    Sudhan S
    On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat wrote:

    Hi,

    I understand that given a file, the file is split across 'n' mapper
    instances, which is the normal case.

    The scenario i have is :
    1. Two files which are not totally identical in terms of number of columns
    (but have data that is similar in a few columns) need to be processed and
    after computation a single output file has to be generated.

    Note : CV - computedvalue

    File1 belonging to one dataset has data for :
    Date,counter1,counter2, CV1,CV2

    File2 belonging to another dataset has data for :
    Date,counter1,counter2,CV3,CV4,CV5

    Computation to be carried out on these two files is :
    CV6 =(CV1*CV5)/100

    And the final emitted output file should have data in the sequence:
    Date,counter1,counter2,CV6

    The idea is to have two mappers (not instances) run on each of the file,
    and a single reducer that emits the final result file.

    Thanks,
    Sahana
    On Wed, Sep 7, 2011 at 2:40 PM, Harsh J wrote:

    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?
    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat wrote:
    Hi,
    Is it possible to have multiple mappers where each mapper is
    operating on a different input file and whose result (which is a key value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J
  • Harsh J at Sep 7, 2011 at 9:58 am
    Sahana,

    Yes this is possible as well. Please take a look at the MultipleInputs
    API @ http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

    It will allow you to add a path each with its own mapper
    implementation, and you can then have a common reducer since the key
    is what you'll be matching against.
    On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat wrote:
    Hi,
    I understand that given a file, the file is split across 'n' mapper
    instances, which is the normal case.
    The scenario i have is :
    1. Two files which are not totally identical in terms of number of columns
    (but have data that is similar in a few columns) need to be processed and
    after computation a single output file has to be generated.
    Note : CV - computedvalue
    File1 belonging to one dataset has data for :
    Date,counter1,counter2, CV1,CV2
    File2 belonging to another dataset has data for :
    Date,counter1,counter2,CV3,CV4,CV5
    Computation to be carried out on these two files is :
    CV6 =(CV1*CV5)/100
    And the final emitted output file should have data in the sequence:
    Date,counter1,counter2,CV6
    The idea is to have two mappers (not instances) run on each of the file, and
    a single reducer that emits the final result file.
    Thanks,
    Sahana
    On Wed, Sep 7, 2011 at 2:40 PM, Harsh J wrote:

    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?
    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat wrote:
    Hi,
    Is it possible to have multiple mappers  where each mapper is
    operating on a different input file and whose result (which is a key
    value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J


    --
    Harsh J
  • Praveenesh kumar at Sep 7, 2011 at 10:05 am
    Harsh, Can you please tell how can we use MultipleInputs using Job Object on
    hadoop 0.20.2. As you can see, in MultipleInputs, its using JobConf object.
    I want to use Job object as mentioned in new hadoop 0.21 API.
    I remember you talked about pulling out things from new API and add it into
    out project.
    Can you please add more light how can we do this ?

    Thanks ,
    Praveenesh.
    On Wed, Sep 7, 2011 at 2:57 AM, Harsh J wrote:

    Sahana,

    Yes this is possible as well. Please take a look at the MultipleInputs
    API @
    http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

    It will allow you to add a path each with its own mapper
    implementation, and you can then have a common reducer since the key
    is what you'll be matching against.
    On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat wrote:
    Hi,
    I understand that given a file, the file is split across 'n' mapper
    instances, which is the normal case.
    The scenario i have is :
    1. Two files which are not totally identical in terms of number of columns
    (but have data that is similar in a few columns) need to be processed and
    after computation a single output file has to be generated.
    Note : CV - computedvalue
    File1 belonging to one dataset has data for :
    Date,counter1,counter2, CV1,CV2
    File2 belonging to another dataset has data for :
    Date,counter1,counter2,CV3,CV4,CV5
    Computation to be carried out on these two files is :
    CV6 =(CV1*CV5)/100
    And the final emitted output file should have data in the sequence:
    Date,counter1,counter2,CV6
    The idea is to have two mappers (not instances) run on each of the file, and
    a single reducer that emits the final result file.
    Thanks,
    Sahana
    On Wed, Sep 7, 2011 at 2:40 PM, Harsh J wrote:

    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?
    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat wrote:
    Hi,
    Is it possible to have multiple mappers where each mapper is
    operating on a different input file and whose result (which is a key
    value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J


    --
    Harsh J
  • Harsh J at Sep 7, 2011 at 11:20 am
    Praveenesh,

    The JIRA https://issues.apache.org/jira/browse/MAPREDUCE-369
    introduced it and carries a patch that I think would apply without
    much trouble on your cluster's sources. You can mail me directly if
    you need help applying a patch.

    Alternatively, you can do something like downloading 0.21 where is is
    found, and then pulling out the particular source files and adding
    them to your project's source trees with their license and package
    names intact (which I think is a legal requirement? others can correct
    me if I'm wrong), and then you can utilize it as a regular import.

    HTH.
    On Wed, Sep 7, 2011 at 3:34 PM, praveenesh kumar wrote:
    Harsh, Can you please tell how can we use MultipleInputs using Job Object on
    hadoop 0.20.2. As you can see, in MultipleInputs, its using JobConf object.
    I want to use Job object as mentioned in new hadoop 0.21 API.
    I remember you talked about pulling out things from new API and add it into
    out project.
    Can you please add more light how can we do this ?

    Thanks ,
    Praveenesh.
    On Wed, Sep 7, 2011 at 2:57 AM, Harsh J wrote:

    Sahana,

    Yes this is possible as well. Please take a look at the MultipleInputs
    API @
    http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

    It will allow you to add a path each with its own mapper
    implementation, and you can then have a common reducer since the key
    is what you'll be matching against.
    On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat wrote:
    Hi,
    I understand that given a file, the file is split across 'n'
    mapper
    instances, which is the normal case.
    The scenario i have is :
    1. Two files which are not totally identical in terms of number of
    columns
    (but have data that is similar in a few columns) need to be processed
    and
    after computation a single output file has to be generated.
    Note : CV - computedvalue
    File1 belonging to one dataset has data for :
    Date,counter1,counter2, CV1,CV2
    File2 belonging to another dataset has data for :
    Date,counter1,counter2,CV3,CV4,CV5
    Computation to be carried out on these two files is :
    CV6 =(CV1*CV5)/100
    And the final emitted output file should have data in the sequence:
    Date,counter1,counter2,CV6
    The idea is to have two mappers (not instances) run on each of the file,
    and
    a single reducer that emits the final result file.
    Thanks,
    Sahana
    On Wed, Sep 7, 2011 at 2:40 PM, Harsh J wrote:

    Sahana,

    Yes. But, isn't that how it is normally? What makes you question this
    capability?

    On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sana.bhat@gmail.com>
    wrote:
    Hi,
    Is it possible to have multiple mappers  where each mapper
    is
    operating on a different input file and whose result (which is a key
    value
    pair from different mappers) is processed by a single reducer?
    Regards,
    Sahana


    --
    Harsh J


    --
    Harsh J


    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedSep 7, '11 at 9:07a
activeSep 7, '11 at 11:20a
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase