Jason, correct me if I am wrong-
opening Sequence file in the configure (or setup method in 0.20) and writing
to it is same as doing output.collect( ), unless you mean I should make the
sequence file writer static variable and set reuse jvm flag to -1. In that
case the subsequent mappers might be run in the same JVM and they can use
the same writer and hence produce one file. But in that case I need to add a
hook to close the writer - may be use the shutdown hook.
Jothi, the idea of combine input format is good. But I guess I have to write
somethign of my own to make it work in my case.
Thanks guys for the suggestions... but I feel we should have some support
from the framework to merge the output of mapper only job so that we don't
get a lot number of smaller files. Sometimes you just don't want to run
reducers and unnecessarily transfer a whole lot of data across the network.
Thanks,
Tarandeep
On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop wrote:You can open your sequence file in the mapper configure method, write to it
in your map, and close it in the mapper close method.
Then you end up with 1 sequence file per map. I am making an assumption
that
each key,value to your map some how represents a single xml file/item.
On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <
[email protected]wrote:
You could look at CombineFileInputFormat to generate a single split out of
several files.
Your partitioner would be able to assign keys to specific reducers, but you
would not have control on which node a given reduce task will run.
Jothi
On 6/18/09 5:10 AM, "Tarandeep Singh" wrote:
Hi,
Can I restrict the output of mappers running on a node to go to
reducer(s)
running on the same node?
Let me explain why I want to do this-
I am converting huge number of XML files into SequenceFiles. So
theoretically I don't even need reducers, mappers would read xml files and
output Sequencefiles. But the problem with this approach is I will end
up
getting huge number of small output files.
To avoid generating large number of smaller files, I can Identity reducers.
But by running reducers, I am unnecessarily transfering data over
network. I
ran some test case using a small subset of my data (~90GB). With map
only
jobs, my cluster finished conversion in only 6 minutes. But with map
and
Identity reducers job, it takes around 38 minutes.
I have to process close to a terabyte of data. So I was thinking of a faster
alternatives-
* Writing a custom OutputFormat
* Somehow restrict output of mappers running on a node to go to
reducers
running on the same node. May be I can write my own partitioner
(simple)
but
not sure how Hadoop's framework assigns partitions to reduce tasks.
Any pointers ?
Or this is not possible at all ?
Thanks,
Tarandeep
--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymallwww.prohadoopbook.com a community for Hadoop Professionals