FAQ
Hi,

I'm using the avro format both for input and output, for a mapper and a
reducer. I would like to output multiple avro items with different schemata.
For sequence files I would use the MultipleOutputs class from the mapreduce
package.

I looked into the same class but from the old package "mapred" and realized
that I can pass an AvroOutputFormat.class parameter when adding another
output. However, I didn't manage to figure out how to provide an avro schema
for each output. Moreover, when writing to output , I need to provide a key
and a value, but in case of avro we usually just pass a specific avro
object. All above makes me think that the old MultipleOutputs API wouldn't
work with avro files. Am I right?

Any pointers of how to output multiple avro records in the same reducer are
appreciated.

P.S. Another thought was to create an avro schema of type union that will
contain all possible output schemata, but I would like to avoid that.

Thanks in advance!!!

--
Best,
Vyacheslav

Search Discussions

  • Vyacheslav Zholudev at Jul 27, 2011 at 7:37 am
    Hi,

    I'm using the avro format both for input and output, for a mapper and a
    reducer. I would like to output multiple avro items with different schemata.
    For sequence files I would use the MultipleOutputs class from the mapreduce
    package.

    I looked into the same class but from the old package "mapred" and realized
    that I can pass an AvroOutputFormat.class parameter when adding another
    output. However, I didn't manage to figure out how to provide an avro schema
    for each output. Moreover, when writing to output , I need to provide a key
    and a value, but in case of avro we usually just pass a specific avro
    object. All above makes me think that the old MultipleOutputs API wouldn't
    work with avro files. Am I right?

    Any pointers of how to output multiple avro records in the same reducer are
    appreciated.

    P.S. Another thought was to create an avro schema of type union that will
    contain all possible output schemata, but I would like to avoid that.

    Thanks in advance!!!

    --
    Best,
    Vyacheslav
  • Jason at Jul 30, 2011 at 5:26 pm
    You can extend/customize MultipleOutputs and pass schema related settings via properties prefixed with MO name, just like it is done with format classes there.

    Also to send a dummy key or value why not just to use NullWritable? It's efficient as it does not consume any space.

    Sent from my iPhone
    On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev wrote:

    Hi,

    I'm using the avro format both for input and output, for a mapper and a reducer. I would like to output multiple avro items with different schemata. For sequence files I would use the MultipleOutputs class from the mapreduce package.

    I looked into the same class but from the old package "mapred" and realized that I can pass an AvroOutputFormat.class parameter when adding another output. However, I didn't manage to figure out how to provide an avro schema for each output. Moreover, when writing to output , I need to provide a key and a value, but in case of avro we usually just pass a specific avro object. All above makes me think that the old MultipleOutputs API wouldn't work with avro files. Am I right?

    Any pointers of how to output multiple avro records in the same reducer are appreciated.

    P.S. Another thought was to create an avro schema of type union that will contain all possible output schemata, but I would like to avoid that.

    Thanks in advance!!!

    --
    Best,
    Vyacheslav
  • Vyacheslav Zholudev at Jul 30, 2011 at 7:09 pm
    Thanks, Jason. I will try that

    Vyacheslav
    On 30 July 2011 19:26, Jason wrote:

    You can extend/customize MultipleOutputs and pass schema related settings
    via properties prefixed with MO name, just like it is done with format
    classes there.

    Also to send a dummy key or value why not just to use NullWritable? It's
    efficient as it does not consume any space.

    Sent from my iPhone

    On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev <
    vyacheslav.zholudev@gmail.com> wrote:
    Hi,

    I'm using the avro format both for input and output, for a mapper and a
    reducer. I would like to output multiple avro items with different schemata.
    For sequence files I would use the MultipleOutputs class from the mapreduce
    package.
    I looked into the same class but from the old package "mapred" and
    realized that I can pass an AvroOutputFormat.class parameter when adding
    another output. However, I didn't manage to figure out how to provide an
    avro schema for each output. Moreover, when writing to output , I need to
    provide a key and a value, but in case of avro we usually just pass a
    specific avro object. All above makes me think that the old MultipleOutputs
    API wouldn't work with avro files. Am I right?
    Any pointers of how to output multiple avro records in the same reducer
    are appreciated.
    P.S. Another thought was to create an avro schema of type union that will
    contain all possible output schemata, but I would like to avoid that.
    Thanks in advance!!!

    --
    Best,
    Vyacheslav


    --
    Best,
    Vyacheslav Zholudev
  • Vyacheslav Zholudev at Aug 4, 2011 at 8:42 am
    Hi all,

    I tried to follow the suggestions and also looked at the code how the Avro thing works in mappers and reducers and created a simple class for Avro multiple outputs. If you are interested in looking or reviewing you can follow the link:
    http://pastebin.com/HMPfgttg

    Any suggestions and comments are highly appreciated

    Vyacheslav
    On Jul 30, 2011, at 7:26 PM, Jason wrote:

    You can extend/customize MultipleOutputs and pass schema related settings via properties prefixed with MO name, just like it is done with format classes there.

    Also to send a dummy key or value why not just to use NullWritable? It's efficient as it does not consume any space.

    Sent from my iPhone
    On Jul 26, 2011, at 5:46 AM, Vyacheslav Zholudev wrote:

    Hi,

    I'm using the avro format both for input and output, for a mapper and a reducer. I would like to output multiple avro items with different schemata. For sequence files I would use the MultipleOutputs class from the mapreduce package.

    I looked into the same class but from the old package "mapred" and realized that I can pass an AvroOutputFormat.class parameter when adding another output. However, I didn't manage to figure out how to provide an avro schema for each output. Moreover, when writing to output , I need to provide a key and a value, but in case of avro we usually just pass a specific avro object. All above makes me think that the old MultipleOutputs API wouldn't work with avro files. Am I right?

    Any pointers of how to output multiple avro records in the same reducer are appreciated.

    P.S. Another thought was to create an avro schema of type union that will contain all possible output schemata, but I would like to avoid that.

    Thanks in advance!!!

    --
    Best,
    Vyacheslav

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 26, '11 at 12:47p
activeAug 4, '11 at 8:42a
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Vyacheslav Zholudev: 4 posts Jason: 1 post

People

Translate

site design / logo © 2022 Grokbase