Hi Alex,

My original files are ascii text. I was using <Object, Text, Text, Text>
and everything worked fine.
Because my files are small (>2MB on avg.) I get one-map task per file.
For my test I had 2000 files, totalling 5GB and the whole run took
approx 40 minutes.

I read that I could improve performance by merging my original files
into one big SequenceFile.

I did that and that's why I trying to use <Object, BytesWritable, Text,
My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks
but apparently my new
map() is computationally more intensive and the whole run now takes 64

In my map(Text key, BytesWritable value, Context context) value
contains the contents
of a whole file. I tried to break it down into line-based records which
I send to reduce().

StringBuilder line = *new* StringBuilder*()*;
*char* linefeed = '\n';
*for* *(**byte* byt : value.getBytes*())* *{*
*if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{*
process_line*(*line.toString*()*, context*)*;
line.delete*(*0, line.length*())*;
*}* *else* *{*

On 07/08/2010 11:22 PM, Alex Kozlov wrote:
Hi Alan,

Is the content of the original file ascii text? Then you should be
using <Object, Text, Text, Text> signature. By default 'hadoop fs
-text ...' just will call toString() on the object. You get the
object itself in the map() method and can do whatever you want with
it. If Text or BytesWritable does not work for you, you can always
write your own class implementing Writable

Let me know if you need more details how to do this.

Alex K

On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller wrote:

Hi Alex,

I'm not sure what you mean. I already set my mapper's signature to:

public class MyMapper extends Mapper<Object, BytesWritable,
Text, Text> {
public void map(Text key, BytesWritable value, Context context)

In my map() loop the contents of value is the text from the
original file
and the value.toString() returns a String of bytes as hex pairs
separated by space.
But I'd like the original tab separated list of strings (i.e. the
lines in my original files).

I see BytesWritable.getBytes() returns a byte[]. I guess I could
write my own
RecordReader to convert the byte[] back to text strings but I
thought this is
something the framework would provide.


On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
Hi Alan,

SequenceFiles keep track of the key and value type, so you should
be able to use the Writables in the signature. Though it looks
like you're using the new API, and I admit that I'm not an expert
with the new API. Have you tried using the Writables in the


On Thu, Jul 8, 2010 at 6:44 AM, Some Body

To get around the small-file-problem (I have thousands of 2MB
log files) I wrote
a class to convert all my log files into a single SequenceFile in
(Text key, BytesWritable value) format. That works fine. I
can run this:

hadoop fs -text /my.seq |grep peemt114.log | head -1
10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
native-hadoop library
10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
peemt114.log 70 65 65 6d 74 31 31 34 09

which shows my file name key (peemt114.log)
and file contents value which appears to be converted to hex.
The hex values up to the first tab (09) translate to my

I'm trying to adapt my mapper to use the SequenceFile as input.

I changed the job's inputFormatClass to:
and modified my mapper signature to:
public class MyMapper extends Mapper<Object, BytesWritable,
Text, Text> {

but how do I convert the value back to Text? When I print out
the key,values using:
System.out.printf("MAPPER INKEY: [%s]\n", key);
System.out.printf("MAPPER INVAL: [%s]\n",
I get::
MAPPER INKEY: [peemt114.log]
MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]


Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 6 | next ›
Discussion Overview
groupmapreduce-user @
postedJul 8, '10 at 1:49p
activeJul 9, '10 at 6:42p



site design / logo © 2022 Grokbase