FAQ
Hi,

I am new to Hadoop and need some help on writing bytes from map and reduce
functions. I take K1, V1 as Text, Text for Map as input. I want to output
K2, V2 from Map where K2 is Text and V2 is some format which represents
bytes. At the moment I used BytesWritable for V2.Then for the Reduce
function I take K1, V1 as Text, BytesWritable. I want to output again K2, V2
as Text and some format representing bytes similar to the Map case.

Is BytesWritable the correct format for this requirement? Also what should
be the output format that I should provide in Job Configuration?

Thank you,
Saliya

Search Discussions

  • welman Lu at Mar 31, 2010 at 7:16 am
    Hi, Saliya,

    The data transformation in MapReduce is:

    *map* (k1,v1) -> list(k2,list(v2))
    *reduce* (k2, list(v2)) -> (k3, list(v3))

    The output from map will be sent to reducer as input directly. In your
    recude function, you can only get k2, v2 as input type. So, in your case,
    the type of the data should be:
    k1 = Text | v1 = Text
    k2 = Text | v2 = BytesWritable
    k3 = Text | v3 = BytesWritable

    Hence for your code, I think you can write:
    In job configuration:
    JobConf conf = new JobConf(YourClass.class);
    conf.setOutputKeyClass(k3.class);
    conf.setOutputValueClass(v3.class);

    then in map class, set the map class as:
    class YourMapClass extends MapReduceBase
    implements Mapper<k1, v1, k2, v2> {
    ....
    }

    If your v3 is different from the v2, then you can in the job configuration
    set
    conf.setMapOutputKeyClass(k2.class);
    conf.setMapOutputValueClass(v2.class);

    Hope this can help you!


    Best Regards
    Jiamin Lu
  • Saliya Ekanayake at Mar 31, 2010 at 1:05 pm
    Hi Jiamin,

    Thank you for the quick reply. Actually I have done it like this. Also I
    have set the output format as SequenceFileOuputFormat. Everything works well
    and I get reduced outputs (part files).

    Now I want to read them from a separate java program to display the content
    as in a GUI. But I couldn't find a way to read these part files. Is this
    possible?

    Thank you,
    Saliya
    On Wed, Mar 31, 2010 at 3:15 AM, welman Lu wrote:

    Hi, Saliya,

    The data transformation in MapReduce is:

    *map* (k1,v1) -> list(k2,list(v2))
    *reduce* (k2, list(v2)) -> (k3, list(v3))

    The output from map will be sent to reducer as input directly. In your
    recude function, you can only get k2, v2 as input type. So, in your case,
    the type of the data should be:
    k1 = Text | v1 = Text
    k2 = Text | v2 = BytesWritable
    k3 = Text | v3 = BytesWritable

    Hence for your code, I think you can write:
    In job configuration:
    JobConf conf = new JobConf(YourClass.class);
    conf.setOutputKeyClass(k3.class);
    conf.setOutputValueClass(v3.class);

    then in map class, set the map class as:
    class YourMapClass extends MapReduceBase
    implements Mapper<k1, v1, k2, v2> {
    ....
    }

    If your v3 is different from the v2, then you can in the job configuration
    set
    conf.setMapOutputKeyClass(k2.class);
    conf.setMapOutputValueClass(v2.class);

    Hope this can help you!


    Best Regards
    Jiamin Lu

    --
    Saliya Ekanayake
    http://www.esaliya.blogspot.com
    http://www.esaliya.wordpress.com
  • welman Lu at Mar 31, 2010 at 1:35 pm
    Hi, Saliya,

    If you said the part files, I think you are talking about the results of the
    reduce function that stored inside the HDFS, right?
    If so, I think this example in "Hadoop The Definitive Guide" can help you.
    -------------
    Example 3-1. Displaying files from a Hadoop filesystem on standard output
    using a
    URLStreamHandler
    public class URLCat {
    static {
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public static void main(String[] args) throws Exception {
    InputStream in = null;
    try {
    in = new URL(args[0]).openStream();
    IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
    IOUtils.closeStream(in);
    }
    }
    }

    Take a try, good luck!


    Best Regards
    Jiamin Lu
  • Saliya Ekanayake at Mar 31, 2010 at 4:40 pm
    Hi Jiamin,

    Thank you once again. Let me explain a bit on my scenario. I am using Amazon
    Elastic MapReduce. So the output file is written to some folder inside S3.

    I have only a single reduce task and inside that I do,

    byte[] bytes = some-code-to-generate-bytes
    output.collect(new Text("key"), new BytesWritable(bytes));

    In the main method I have set the outputformat of the job configuration as
    SequenceFileOutputFormat.


    Now when I run this it creates a file in the given S3 output directory as
    expected. I have a java client in my local machine which downloads this file
    from S3 and tries to read it. The issue comes when reading this file,
    because I am not sure how can I read this file to get the original set of
    bytes I wrote from the reduce task. I looked into the
    SequenceFileOutputFormat and it seems that this file contains a header and
    body. So do I have to manually read it as bytes and extract out the portion
    that I need or is there a built in API class to read such file?

    Thank you
    Saliya
    On Wed, Mar 31, 2010 at 9:35 AM, welman Lu wrote:

    Hi, Saliya,

    If you said the part files, I think you are talking about the results of
    the reduce function that stored inside the HDFS, right?
    If so, I think this example in "Hadoop The Definitive Guide" can help you.
    -------------
    Example 3-1. Displaying files from a Hadoop filesystem on standard output
    using a
    URLStreamHandler
    public class URLCat {
    static {
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public static void main(String[] args) throws Exception {
    InputStream in = null;
    try {
    in = new URL(args[0]).openStream();
    IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
    IOUtils.closeStream(in);
    }
    }
    }

    Take a try, good luck!


    Best Regards
    Jiamin Lu



    --
    Saliya Ekanayake
    http://www.esaliya.blogspot.com
    http://www.esaliya.wordpress.com
  • Saliya Ekanayake at Apr 1, 2010 at 5:39 pm
    Hi Jiamin,

    I am thankful for the previous feedback provided by you. In fact I was able
    to solve the problem by writing a custom OutputFormat, which simply writes
    the byte values I want.

    Regards,
    Saliya
    On Wed, Mar 31, 2010 at 12:40 PM, Saliya Ekanayake wrote:

    Hi Jiamin,

    Thank you once again. Let me explain a bit on my scenario. I am using
    Amazon Elastic MapReduce. So the output file is written to some folder
    inside S3.

    I have only a single reduce task and inside that I do,

    byte[] bytes = some-code-to-generate-bytes
    output.collect(new Text("key"), new BytesWritable(bytes));

    In the main method I have set the outputformat of the job configuration as
    SequenceFileOutputFormat.


    Now when I run this it creates a file in the given S3 output directory as
    expected. I have a java client in my local machine which downloads this file
    from S3 and tries to read it. The issue comes when reading this file,
    because I am not sure how can I read this file to get the original set of
    bytes I wrote from the reduce task. I looked into the
    SequenceFileOutputFormat and it seems that this file contains a header and
    body. So do I have to manually read it as bytes and extract out the portion
    that I need or is there a built in API class to read such file?

    Thank you
    Saliya

    On Wed, Mar 31, 2010 at 9:35 AM, welman Lu wrote:

    Hi, Saliya,

    If you said the part files, I think you are talking about the results of
    the reduce function that stored inside the HDFS, right?
    If so, I think this example in "Hadoop The Definitive Guide" can help you.

    -------------
    Example 3-1. Displaying files from a Hadoop filesystem on standard output
    using a
    URLStreamHandler
    public class URLCat {
    static {
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public static void main(String[] args) throws Exception {
    InputStream in = null;
    try {
    in = new URL(args[0]).openStream();
    IOUtils.copyBytes(in, System.out, 4096, false);
    } finally {
    IOUtils.closeStream(in);
    }
    }
    }

    Take a try, good luck!


    Best Regards
    Jiamin Lu



    --
    Saliya Ekanayake
    http://www.esaliya.blogspot.com
    http://www.esaliya.wordpress.com


    --
    Saliya Ekanayake
    http://www.esaliya.blogspot.com
    http://www.esaliya.wordpress.com
  • Stephen Watt at Apr 1, 2010 at 8:04 pm
    Hi Folks

    The Hadoop Functional Tests (hadoop-*-test.jar) that ship with it and can
    be run as part of the build, has a TestTotalOrderPartitioner JUNIT test
    cases in it. One of the assert's in the test case, "testmemcmp" passes
    with Sun Java but fails with IBM Java. The test iterates through a number
    of Text objects called splitStrings and writes them serially to a
    sequential file. The test case then iterates through a DIFFERENT Text
    array called testStrings and passes in the text value to the getPartition
    method to retrieve which partition the "key" is in.

    It fails on the assert when it tries to get the partition for the key "z"
    in which exists in the testStrings array but not in the splitStrings array
    (the values written in the sequential file). The assert is expecting a 9
    value for the partition but it is getting a 0. They don't match, thus the
    assert fails.

    To me, it seems, there is a bug in this test case. I have no idea why Sun
    Java returns a partition for this non-existent key. I'm new to partitions,
    so I wanted to run this by the list before I open a bug.

    Kind regards
    Steve Watt

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMar 31, '10 at 4:24a
activeApr 1, '10 at 8:04p
posts7
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase