FAQ
Hi all,
I am trying a simple extension of WordCount example in Hadoop. I want to
get a frequency of wordcounts in descending order. To that I employ a linear
chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
usual example). For the next MR job => I set the mapper to swap the <word,
count> to <count, word>. Then, have the Identity reducer to simply store
the results.

My MR-1 does its job correctly and store the result in a temp path.

Question 1: The mapper of the second MR job (MR-2) doesn't like the input
format. I have properly set the input format for MapClass2 of what it
expects and what its output must be. It seems to expecting a LongWritable. I
suspect that it is trying to look at some index file. I am not sure.


It throws an error like this:

<code>
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.Text
</code>

Some Info:
- I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
for now.
- I use hadoop-0.20.2

For MR-1:
- conf1.setOutputKeyClass(Text.class);
- conf1.setOutputValueClass(IntWritable.class);

For MR-2
- takes in a Text (word) and IntWritable (sum)
- conf2.setOutputKeyClass(IntWritable.class);
- conf2.setOutputValueClass(Text.class);

<code>
public class MapClass2 extends MapReduceBase
implements Mapper<Text, IntWritable, IntWritable, Text> {

@Override
public void map(Text word, IntWritable sum,
OutputCollector<IntWritable, Text> output,
Reporter reporter) throws IOException {

output.collect(sum, word); // <sum, word>
}
}
</code>

Any suggestions would be helpful. Is my MapClass2 code right in the first
place...for swapping? Or should I assume that mapper reads line by line,
so, must read in one line, then, use StrTokenizer to split them up and
convert the second token (sum) from str to Int....?? Or should I mess around
with OutputKeyComparator class?

Thanks,
PD

Search Discussions

  • Bejoy Hadoop at Oct 15, 2011 at 8:06 am
    Hi
    I believe what is happening in your case is that.
    The first map reduce jobs runs to completion
    When you trigger the second map reduce job, it is triggered with the default input format, TextInputFormat and definitely expects the key value as LongWritable and Text type. In default the MapReduce jobs output format is TextOutputFormat, key value as tab seperated. When you need to consume this output of an MR job as key value pairs by another MR job, use KeyValueInputFormat, ie while setting config parameters for second job set
    jobConf.setInputFormat(KeyValueInput Format.class).
    Now if your output key value pairs use a different separator other than default tab then for second job you need to specify that as well using key.value.separator.in.input.line

    In short for your case in second map reduce job doing the following would get things in place
    -use jobConf.setInputFormat(KeyValueInputFormat.class)
    -alter your mapper to accept key values of type Text,Text
    -swap the key and values for output

    To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce API.
    Hope it helps.

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Periya.Data" <periya.data@gmail.com>
    Date: Fri, 14 Oct 2011 17:31:27
    To: <common-user@hadoop.apache.org>; <cdh-user@cloudera.org>
    Reply-To: common-user@hadoop.apache.org
    Subject: mapreduce linear chaining: ClassCastException

    Hi all,
    I am trying a simple extension of WordCount example in Hadoop. I want to
    get a frequency of wordcounts in descending order. To that I employ a linear
    chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
    usual example). For the next MR job => I set the mapper to swap the <word,
    count> to <count, word>. Then, have the Identity reducer to simply store
    the results.

    My MR-1 does its job correctly and store the result in a temp path.

    Question 1: The mapper of the second MR job (MR-2) doesn't like the input
    format. I have properly set the input format for MapClass2 of what it
    expects and what its output must be. It seems to expecting a LongWritable. I
    suspect that it is trying to look at some index file. I am not sure.


    It throws an error like this:

    <code>
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
    be cast to org.apache.hadoop.io.Text
    </code>

    Some Info:
    - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
    for now.
    - I use hadoop-0.20.2

    For MR-1:
    - conf1.setOutputKeyClass(Text.class);
    - conf1.setOutputValueClass(IntWritable.class);

    For MR-2
    - takes in a Text (word) and IntWritable (sum)
    - conf2.setOutputKeyClass(IntWritable.class);
    - conf2.setOutputValueClass(Text.class);

    <code>
    public class MapClass2 extends MapReduceBase
    implements Mapper<Text, IntWritable, IntWritable, Text> {

    @Override
    public void map(Text word, IntWritable sum,
    OutputCollector<IntWritable, Text> output,
    Reporter reporter) throws IOException {

    output.collect(sum, word); // <sum, word>
    }
    }
    </code>

    Any suggestions would be helpful. Is my MapClass2 code right in the first
    place...for swapping? Or should I assume that mapper reads line by line,
    so, must read in one line, then, use StrTokenizer to split them up and
    convert the second token (sum) from str to Int....?? Or should I mess around
    with OutputKeyComparator class?

    Thanks,
    PD
  • Bejoy Hadoop at Oct 15, 2011 at 8:08 am
    Hi
    I believe what is happening in your case is that.
    The first map reduce jobs runs to completion
    When you trigger the second map reduce job, it is triggered with the default input format, TextInputFormat and definitely expects the key value as LongWritable and Text type. In default the MapReduce jobs output format is TextOutputFormat, key value as tab seperated. When you need to consume this output of an MR job as key value pairs by another MR job, use KeyValueInputFormat, ie while setting config parameters for second job set
    jobConf.setInputFormat(KeyValueInput Format.class).
    Now if your output key value pairs use a different separator other than default tab then for second job you need to specify that as well using key.value.separator.in.input.line

    In short for your case in second map reduce job doing the following would get things in place
    -use jobConf.setInputFormat(KeyValueInputFormat.class)
    -alter your mapper to accept key values of type Text,Text
    -swap the key and values within mapper for output to reducer with conversions.

    To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce API.

    Hope it helps.

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Periya.Data" <periya.data@gmail.com>
    Date: Fri, 14 Oct 2011 17:31:27
    To: <common-user@hadoop.apache.org>; <cdh-user@cloudera.org>
    Reply-To: common-user@hadoop.apache.org
    Subject: mapreduce linear chaining: ClassCastException

    Hi all,
    I am trying a simple extension of WordCount example in Hadoop. I want to
    get a frequency of wordcounts in descending order. To that I employ a linear
    chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
    usual example). For the next MR job => I set the mapper to swap the <word,
    count> to <count, word>. Then, have the Identity reducer to simply store
    the results.

    My MR-1 does its job correctly and store the result in a temp path.

    Question 1: The mapper of the second MR job (MR-2) doesn't like the input
    format. I have properly set the input format for MapClass2 of what it
    expects and what its output must be. It seems to expecting a LongWritable. I
    suspect that it is trying to look at some index file. I am not sure.


    It throws an error like this:

    <code>
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
    be cast to org.apache.hadoop.io.Text
    </code>

    Some Info:
    - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
    for now.
    - I use hadoop-0.20.2

    For MR-1:
    - conf1.setOutputKeyClass(Text.class);
    - conf1.setOutputValueClass(IntWritable.class);

    For MR-2
    - takes in a Text (word) and IntWritable (sum)
    - conf2.setOutputKeyClass(IntWritable.class);
    - conf2.setOutputValueClass(Text.class);

    <code>
    public class MapClass2 extends MapReduceBase
    implements Mapper<Text, IntWritable, IntWritable, Text> {

    @Override
    public void map(Text word, IntWritable sum,
    OutputCollector<IntWritable, Text> output,
    Reporter reporter) throws IOException {

    output.collect(sum, word); // <sum, word>
    }
    }
    </code>

    Any suggestions would be helpful. Is my MapClass2 code right in the first
    place...for swapping? Or should I assume that mapper reads line by line,
    so, must read in one line, then, use StrTokenizer to split them up and
    convert the second token (sum) from str to Int....?? Or should I mess around
    with OutputKeyComparator class?

    Thanks,
    PD
  • Periya.Data at Oct 15, 2011 at 5:59 pm
    Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2
    nicely. I had to convert the sum (in text) format to IntWritable and I am
    able to get all the word frequency <Freq, Word> in ascending order. I used
    "KeyValueTextInputFormat.class". My program was complaining when I used
    "KeyValueInputFormat".

    Now, let me investigate how to do that in descending order...and then
    top-20...etc. I know I must look into RawComparator and more...

    Thanks,
    PD.
    On Sat, Oct 15, 2011 at 1:08 AM, wrote:

    Hi
    I believe what is happening in your case is that.
    The first map reduce jobs runs to completion
    When you trigger the second map reduce job, it is triggered with the
    default input format, TextInputFormat and definitely expects the key value
    as LongWritable and Text type. In default the MapReduce jobs output format
    is TextOutputFormat, key value as tab seperated. When you need to consume
    this output of an MR job as key value pairs by another MR job, use
    KeyValueInputFormat, ie while setting config parameters for second job set
    jobConf.setInputFormat(KeyValueInput Format.class).
    Now if your output key value pairs use a different separator other than
    default tab then for second job you need to specify that as well using
    key.value.separator.in.input.line

    In short for your case in second map reduce job doing the following would
    get things in place
    -use jobConf.setInputFormat(KeyValueInputFormat.class)
    -alter your mapper to accept key values of type Text,Text
    -swap the key and values within mapper for output to reducer with
    conversions.

    To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce
    API.

    Hope it helps.

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Periya.Data" <periya.data@gmail.com>
    Date: Fri, 14 Oct 2011 17:31:27
    To: <common-user@hadoop.apache.org>; <cdh-user@cloudera.org>
    Reply-To: common-user@hadoop.apache.org
    Subject: mapreduce linear chaining: ClassCastException

    Hi all,
    I am trying a simple extension of WordCount example in Hadoop. I want to
    get a frequency of wordcounts in descending order. To that I employ a
    linear
    chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
    usual example). For the next MR job => I set the mapper to swap the <word,
    count> to <count, word>. Then, have the Identity reducer to simply store
    the results.

    My MR-1 does its job correctly and store the result in a temp path.

    Question 1: The mapper of the second MR job (MR-2) doesn't like the input
    format. I have properly set the input format for MapClass2 of what it
    expects and what its output must be. It seems to expecting a LongWritable.
    I
    suspect that it is trying to look at some index file. I am not sure.


    It throws an error like this:

    <code>
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
    be cast to org.apache.hadoop.io.Text
    </code>

    Some Info:
    - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
    for now.
    - I use hadoop-0.20.2

    For MR-1:
    - conf1.setOutputKeyClass(Text.class);
    - conf1.setOutputValueClass(IntWritable.class);

    For MR-2
    - takes in a Text (word) and IntWritable (sum)
    - conf2.setOutputKeyClass(IntWritable.class);
    - conf2.setOutputValueClass(Text.class);

    <code>
    public class MapClass2 extends MapReduceBase
    implements Mapper<Text, IntWritable, IntWritable, Text> {

    @Override
    public void map(Text word, IntWritable sum,
    OutputCollector<IntWritable, Text> output,
    Reporter reporter) throws IOException {

    output.collect(sum, word); // <sum, word>
    }
    }
    </code>

    Any suggestions would be helpful. Is my MapClass2 code right in the first
    place...for swapping? Or should I assume that mapper reads line by line,
    so, must read in one line, then, use StrTokenizer to split them up and
    convert the second token (sum) from str to Int....?? Or should I mess
    around
    with OutputKeyComparator class?

    Thanks,
    PD
  • Bejoy Hadoop at Oct 15, 2011 at 7:09 pm
    Great!..

    Sorry for the KeyValueInputFormat It is KeyValueInputTextFormat itself. I was replying from my handheld and was getting the class name from memory, so excuse me for that. :)

    For your further requirements like descending order, playing around with Comparator is required I believe.

    Thank you

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Periya.Data" <periya.data@gmail.com>
    Date: Sat, 15 Oct 2011 10:59:00
    To: <common-user@hadoop.apache.org>; <bejoy.hadoop@gmail.com>
    Subject: Re: mapreduce linear chaining: ClassCastException

    Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2
    nicely. I had to convert the sum (in text) format to IntWritable and I am
    able to get all the word frequency <Freq, Word> in ascending order. I used
    "KeyValueTextInputFormat.class". My program was complaining when I used
    "KeyValueInputFormat".

    Now, let me investigate how to do that in descending order...and then
    top-20...etc. I know I must look into RawComparator and more...

    Thanks,
    PD.
    On Sat, Oct 15, 2011 at 1:08 AM, wrote:

    Hi
    I believe what is happening in your case is that.
    The first map reduce jobs runs to completion
    When you trigger the second map reduce job, it is triggered with the
    default input format, TextInputFormat and definitely expects the key value
    as LongWritable and Text type. In default the MapReduce jobs output format
    is TextOutputFormat, key value as tab seperated. When you need to consume
    this output of an MR job as key value pairs by another MR job, use
    KeyValueInputFormat, ie while setting config parameters for second job set
    jobConf.setInputFormat(KeyValueInput Format.class).
    Now if your output key value pairs use a different separator other than
    default tab then for second job you need to specify that as well using
    key.value.separator.in.input.line

    In short for your case in second map reduce job doing the following would
    get things in place
    -use jobConf.setInputFormat(KeyValueInputFormat.class)
    -alter your mapper to accept key values of type Text,Text
    -swap the key and values within mapper for output to reducer with
    conversions.

    To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce
    API.

    Hope it helps.

    Regards
    Bejoy K S

    -----Original Message-----
    From: "Periya.Data" <periya.data@gmail.com>
    Date: Fri, 14 Oct 2011 17:31:27
    To: <common-user@hadoop.apache.org>; <cdh-user@cloudera.org>
    Reply-To: common-user@hadoop.apache.org
    Subject: mapreduce linear chaining: ClassCastException

    Hi all,
    I am trying a simple extension of WordCount example in Hadoop. I want to
    get a frequency of wordcounts in descending order. To that I employ a
    linear
    chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the
    usual example). For the next MR job => I set the mapper to swap the <word,
    count> to <count, word>. Then, have the Identity reducer to simply store
    the results.

    My MR-1 does its job correctly and store the result in a temp path.

    Question 1: The mapper of the second MR job (MR-2) doesn't like the input
    format. I have properly set the input format for MapClass2 of what it
    expects and what its output must be. It seems to expecting a LongWritable.
    I
    suspect that it is trying to look at some index file. I am not sure.


    It throws an error like this:

    <code>
    java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
    be cast to org.apache.hadoop.io.Text
    </code>

    Some Info:
    - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it
    for now.
    - I use hadoop-0.20.2

    For MR-1:
    - conf1.setOutputKeyClass(Text.class);
    - conf1.setOutputValueClass(IntWritable.class);

    For MR-2
    - takes in a Text (word) and IntWritable (sum)
    - conf2.setOutputKeyClass(IntWritable.class);
    - conf2.setOutputValueClass(Text.class);

    <code>
    public class MapClass2 extends MapReduceBase
    implements Mapper<Text, IntWritable, IntWritable, Text> {

    @Override
    public void map(Text word, IntWritable sum,
    OutputCollector<IntWritable, Text> output,
    Reporter reporter) throws IOException {

    output.collect(sum, word); // <sum, word>
    }
    }
    </code>

    Any suggestions would be helpful. Is my MapClass2 code right in the first
    place...for swapping? Or should I assume that mapper reads line by line,
    so, must read in one line, then, use StrTokenizer to split them up and
    convert the second token (sum) from str to Int....?? Or should I mess
    around
    with OutputKeyComparator class?

    Thanks,
    PD

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 15, '11 at 12:32a
activeOct 15, '11 at 7:09p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Bejoy Hadoop: 3 posts Periya.Data: 2 posts

People

Translate

site design / logo © 2022 Grokbase