FAQ
Hi,

Can I restrict the output of mappers running on a node to go to reducer(s)
running on the same node?

Let me explain why I want to do this-

I am converting huge number of XML files into SequenceFiles. So
theoretically I don't even need reducers, mappers would read xml files and
output Sequencefiles. But the problem with this approach is I will end up
getting huge number of small output files.

To avoid generating large number of smaller files, I can Identity reducers.
But by running reducers, I am unnecessarily transfering data over network. I
ran some test case using a small subset of my data (~90GB). With map only
jobs, my cluster finished conversion in only 6 minutes. But with map and
Identity reducers job, it takes around 38 minutes.

I have to process close to a terabyte of data. So I was thinking of a faster
alternatives-

* Writing a custom OutputFormat
* Somehow restrict output of mappers running on a node to go to reducers
running on the same node. May be I can write my own partitioner (simple) but
not sure how Hadoop's framework assigns partitions to reduce tasks.

Any pointers ?

Or this is not possible at all ?

Thanks,
Tarandeep

Search Discussions

  • Jothi Padmanabhan at Jun 18, 2009 at 2:30 am
    You could look at CombineFileInputFormat to generate a single split out of
    several files.

    Your partitioner would be able to assign keys to specific reducers, but you
    would not have control on which node a given reduce task will run.

    Jothi

    On 6/18/09 5:10 AM, "Tarandeep Singh" wrote:

    Hi,

    Can I restrict the output of mappers running on a node to go to reducer(s)
    running on the same node?

    Let me explain why I want to do this-

    I am converting huge number of XML files into SequenceFiles. So
    theoretically I don't even need reducers, mappers would read xml files and
    output Sequencefiles. But the problem with this approach is I will end up
    getting huge number of small output files.

    To avoid generating large number of smaller files, I can Identity reducers.
    But by running reducers, I am unnecessarily transfering data over network. I
    ran some test case using a small subset of my data (~90GB). With map only
    jobs, my cluster finished conversion in only 6 minutes. But with map and
    Identity reducers job, it takes around 38 minutes.

    I have to process close to a terabyte of data. So I was thinking of a faster
    alternatives-

    * Writing a custom OutputFormat
    * Somehow restrict output of mappers running on a node to go to reducers
    running on the same node. May be I can write my own partitioner (simple) but
    not sure how Hadoop's framework assigns partitions to reduce tasks.

    Any pointers ?

    Or this is not possible at all ?

    Thanks,
    Tarandeep
  • Jason hadoop at Jun 18, 2009 at 2:57 am
    You can open your sequence file in the mapper configure method, write to it
    in your map, and close it in the mapper close method.
    Then you end up with 1 sequence file per map. I am making an assumption that
    each key,value to your map some how represents a single xml file/item.
    On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan wrote:

    You could look at CombineFileInputFormat to generate a single split out of
    several files.

    Your partitioner would be able to assign keys to specific reducers, but you
    would not have control on which node a given reduce task will run.

    Jothi

    On 6/18/09 5:10 AM, "Tarandeep Singh" wrote:

    Hi,

    Can I restrict the output of mappers running on a node to go to
    reducer(s)
    running on the same node?

    Let me explain why I want to do this-

    I am converting huge number of XML files into SequenceFiles. So
    theoretically I don't even need reducers, mappers would read xml files and
    output Sequencefiles. But the problem with this approach is I will end up
    getting huge number of small output files.

    To avoid generating large number of smaller files, I can Identity reducers.
    But by running reducers, I am unnecessarily transfering data over
    network. I
    ran some test case using a small subset of my data (~90GB). With map only
    jobs, my cluster finished conversion in only 6 minutes. But with map and
    Identity reducers job, it takes around 38 minutes.

    I have to process close to a terabyte of data. So I was thinking of a faster
    alternatives-

    * Writing a custom OutputFormat
    * Somehow restrict output of mappers running on a node to go to reducers
    running on the same node. May be I can write my own partitioner (simple) but
    not sure how Hadoop's framework assigns partitions to reduce tasks.

    Any pointers ?

    Or this is not possible at all ?

    Thanks,
    Tarandeep

    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Tarandeep Singh at Jun 18, 2009 at 4:44 pm
    Jason, correct me if I am wrong-

    opening Sequence file in the configure (or setup method in 0.20) and writing
    to it is same as doing output.collect( ), unless you mean I should make the
    sequence file writer static variable and set reuse jvm flag to -1. In that
    case the subsequent mappers might be run in the same JVM and they can use
    the same writer and hence produce one file. But in that case I need to add a
    hook to close the writer - may be use the shutdown hook.

    Jothi, the idea of combine input format is good. But I guess I have to write
    somethign of my own to make it work in my case.

    Thanks guys for the suggestions... but I feel we should have some support
    from the framework to merge the output of mapper only job so that we don't
    get a lot number of smaller files. Sometimes you just don't want to run
    reducers and unnecessarily transfer a whole lot of data across the network.

    Thanks,
    Tarandeep
    On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop wrote:

    You can open your sequence file in the mapper configure method, write to it
    in your map, and close it in the mapper close method.
    Then you end up with 1 sequence file per map. I am making an assumption
    that
    each key,value to your map some how represents a single xml file/item.

    On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <[email protected]
    wrote:
    You could look at CombineFileInputFormat to generate a single split out of
    several files.

    Your partitioner would be able to assign keys to specific reducers, but you
    would not have control on which node a given reduce task will run.

    Jothi

    On 6/18/09 5:10 AM, "Tarandeep Singh" wrote:

    Hi,

    Can I restrict the output of mappers running on a node to go to
    reducer(s)
    running on the same node?

    Let me explain why I want to do this-

    I am converting huge number of XML files into SequenceFiles. So
    theoretically I don't even need reducers, mappers would read xml files and
    output Sequencefiles. But the problem with this approach is I will end
    up
    getting huge number of small output files.

    To avoid generating large number of smaller files, I can Identity reducers.
    But by running reducers, I am unnecessarily transfering data over
    network. I
    ran some test case using a small subset of my data (~90GB). With map
    only
    jobs, my cluster finished conversion in only 6 minutes. But with map
    and
    Identity reducers job, it takes around 38 minutes.

    I have to process close to a terabyte of data. So I was thinking of a faster
    alternatives-

    * Writing a custom OutputFormat
    * Somehow restrict output of mappers running on a node to go to
    reducers
    running on the same node. May be I can write my own partitioner
    (simple)
    but
    not sure how Hadoop's framework assigns partitions to reduce tasks.

    Any pointers ?

    Or this is not possible at all ?

    Thanks,
    Tarandeep

    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals
  • Jason hadoop at Jun 19, 2009 at 3:07 pm
    Yes, you are correct. I had not thought about sharing a file handle through
    multiple tasks via jvm reuse.

    On Thu, Jun 18, 2009 at 9:43 AM, Tarandeep Singh wrote:

    Jason, correct me if I am wrong-

    opening Sequence file in the configure (or setup method in 0.20) and
    writing
    to it is same as doing output.collect( ), unless you mean I should make the
    sequence file writer static variable and set reuse jvm flag to -1. In that
    case the subsequent mappers might be run in the same JVM and they can use
    the same writer and hence produce one file. But in that case I need to add
    a
    hook to close the writer - may be use the shutdown hook.

    Jothi, the idea of combine input format is good. But I guess I have to
    write
    somethign of my own to make it work in my case.

    Thanks guys for the suggestions... but I feel we should have some support
    from the framework to merge the output of mapper only job so that we don't
    get a lot number of smaller files. Sometimes you just don't want to run
    reducers and unnecessarily transfer a whole lot of data across the network.

    Thanks,
    Tarandeep

    On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop <[email protected]
    wrote:
    You can open your sequence file in the mapper configure method, write to it
    in your map, and close it in the mapper close method.
    Then you end up with 1 sequence file per map. I am making an assumption
    that
    each key,value to your map some how represents a single xml file/item.

    On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <
    [email protected]
    wrote:
    You could look at CombineFileInputFormat to generate a single split out of
    several files.

    Your partitioner would be able to assign keys to specific reducers, but you
    would not have control on which node a given reduce task will run.

    Jothi

    On 6/18/09 5:10 AM, "Tarandeep Singh" wrote:

    Hi,

    Can I restrict the output of mappers running on a node to go to
    reducer(s)
    running on the same node?

    Let me explain why I want to do this-

    I am converting huge number of XML files into SequenceFiles. So
    theoretically I don't even need reducers, mappers would read xml
    files
    and
    output Sequencefiles. But the problem with this approach is I will
    end
    up
    getting huge number of small output files.

    To avoid generating large number of smaller files, I can Identity reducers.
    But by running reducers, I am unnecessarily transfering data over
    network. I
    ran some test case using a small subset of my data (~90GB). With map
    only
    jobs, my cluster finished conversion in only 6 minutes. But with map
    and
    Identity reducers job, it takes around 38 minutes.

    I have to process close to a terabyte of data. So I was thinking of a faster
    alternatives-

    * Writing a custom OutputFormat
    * Somehow restrict output of mappers running on a node to go to
    reducers
    running on the same node. May be I can write my own partitioner
    (simple)
    but
    not sure how Hadoop's framework assigns partitions to reduce tasks.

    Any pointers ?

    Or this is not possible at all ?

    Thanks,
    Tarandeep

    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.amazon.com/dp/1430219424?tag=jewlerymall
    www.prohadoopbook.com a community for Hadoop Professionals

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 17, '09 at 11:40p
activeJun 19, '09 at 3:07p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase