I was hoping to use -inputformat SequenceFileAsTextInputFormat to process compressed sequencefiles in streaming jobs
However, using a python mapper that just echoes out each line as it gets, and numreducetasks=0 - here's what the streaming job output looks like:
SEQ^F org.apache.hadoop.io.IntWritable^Yorg.apache.hadoop.io.Text^A^A'org.apache.hadoop.io.compress.GzipCodec^@^@^@^@Z+r������^F�
So seems like the input file was not treated as sequencefile.
I must be missing some args - except don't understand what. Help appreciated ..
Thx,
Joydeep