FAQ
Hi all. I'm struggling a bit to figure this out and wondering if anyone had any pointers.

I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat. All seems to work fine. What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat. Is that possible to do using the new mapreduce.* API?

I have two types of files I want to process in the Mapper. Currently I'm using the context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing. It seems cleaner to use the SequenceFile.Metadata if I can. Does that make sense or am I off in the weeds?

Thanks

Andy

Search Discussions

  • Tom White at Oct 2, 2009 at 9:26 am

    On Thu, Oct 1, 2009 at 5:10 PM, Andy Sautins wrote:

    Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.

    I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?
    By default no SequenceFile metadata is written by
    SequenceFileOutputFormat. SequenceFile metadata is written at the
    beginning of the file, so it needs to be passed in when the
    SequenceFile is opened. One way of doing this would be to extend
    SequenceFileOutputFormat and override the getSequenceWriter() method
    to call the SequenceFile.createWriter() factory method that takes
    metadata.
    I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?
    Another approach would be to use MultipleInputs which allows you to
    use different mappers for different input paths. Could this help?
    Thanks

    Andy
  • Andy Sautins at Oct 2, 2009 at 8:49 pm
    Thanks for the response Tom. I'll probably try the approach of extending SequenceFileOutputFormat to write sequence file metadata.

    What I am getting from your response is that it doesn't seem like using the sequence file metadata is that common, especially for sequence files generated as map/reduce output. Sounds like using MultipleInput and having files in different locations is a more common way of addressing having different file types fed into the same job. Does that sound right?

    Thanks again for the insight.

    -----Original Message-----
    From: Tom White
    Sent: Friday, October 02, 2009 3:26 AM
    To: common-user@hadoop.apache.org
    Cc: core-user@hadoop.apache.org
    Subject: Re: Map/Reduce and sequence file metadata...

    On Thu, Oct 1, 2009 at 5:10 PM, Andy Sautins
    wrote:
    Hi all. I'm struggling a bit to figure this out and wondering if anyone had any  pointers.

    I'm using SequenceFiles as output from a MapReduce job ( using SequenceFileOutputFormat ) and then in a followup MapReduce job reading in the results using SequenceFileInputFormat.  All seems to work fine.  What I haven't figured out is how to write the SequenceFile.Metadata in the SequenceFileOutputFormat and then read the metadata in SequenceFileInputFormat.  Is that possible to do using the new mapreduce.* API?
    By default no SequenceFile metadata is written by
    SequenceFileOutputFormat. SequenceFile metadata is written at the
    beginning of the file, so it needs to be passed in when the
    SequenceFile is opened. One way of doing this would be to extend
    SequenceFileOutputFormat and override the getSequenceWriter() method
    to call the SequenceFile.createWriter() factory method that takes
    metadata.
    I have two types of files I want to process in the Mapper.  Currently I'm using the  context.getInputSplit() and parsing the resulting fileSplit.getPath() to determine what file I'm processing.  It seems cleaner to use the SequenceFile.Metadata if I can.  Does that make sense or am I off in the weeds?
    Another approach would be to use MultipleInputs which allows you to
    use different mappers for different input paths. Could this help?
    Thanks

    Andy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 1, '09 at 4:11p
activeOct 2, '09 at 8:49p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Andy Sautins: 2 posts Tom White: 1 post

People

Translate

site design / logo © 2022 Grokbase