FAQ
MessageTo get around the small-file-problem (I have thousands of 2MB log files) I wrote
a class to convert all my log files into a single SequenceFile in
(Text key, BytesWritable value) format. That works fine. I can run this:

hadoop fs -text /my.seq |grep peemt114.log | head -1
10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib
library
10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip].......

which shows my file name key (peemt114.log)
and file contents value which appears to be converted to hex.
The hex values up to the first tab (09) translate to my hostname.

I'm trying to adapt my mapper to use the SequenceFile as input.

I changed the job's inputFormatClass to:
MyJob.setInputFormatClass(SequenceFileInputFormat.class);
and modified my mapper signature to:
public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {

but how do I convert the value back to Text? When I print out the key,values using:
System.out.printf("MAPPER INKEY: [%s]\n", key);
System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
I get::
MAPPER INKEY: [peemt114.log]
MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

Alan

Search Discussions

  • Alex Loddengaard at Jul 8, 2010 at 6:43 pm
    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should be able to
    use the Writables in the signature. Though it looks like you're using the
    new API, and I admit that I'm not an expert with the new API. Have you
    tried using the Writables in the signature?

    Alex
    On Thu, Jul 8, 2010 at 6:44 AM, Some Body wrote:

    To get around the small-file-problem (I have thousands of 2MB log files) I
    wrote
    a class to convert all my log files into a single SequenceFile in
    (Text key, BytesWritable value) format. That works fine. I can run this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
    library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
    peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip].......

    which shows my file name key (peemt114.log)
    and file contents value which appears to be converted to hex.
    The hex values up to the first tab (09) translate to my hostname.

    I'm trying to adapt my mapper to use the SequenceFile as input.

    I changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
    and modified my mapper signature to:
    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {

    but how do I convert the value back to Text? When I print out the
    key,values using:
    System.out.printf("MAPPER INKEY: [%s]\n", key);
    System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
    I get::
    MAPPER INKEY: [peemt114.log]
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

    Alan
  • Alan Miller at Jul 8, 2010 at 9:05 pm
    Hi Alex,

    I'm not sure what you mean. I already set my mapper's signature to:
    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text> {
    ...
    public void map(Text key, BytesWritable value, Context context)
    }
    }

    In my map() loop the contents of value is the text from the original file
    and the value.toString() returns a String of bytes as hex pairs
    separated by space.
    But I'd like the original tab separated list of strings (i.e. the lines
    in my original files).

    I see BytesWritable.getBytes() returns a byte[]. I guess I could write
    my own
    RecordReader to convert the byte[] back to text strings but I thought
    this is
    something the framework would provide.

    Alan
    On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should be
    able to use the Writables in the signature. Though it looks like
    you're using the new API, and I admit that I'm not an expert with the
    new API. Have you tried using the Writables in the signature?

    Alex

    On Thu, Jul 8, 2010 at 6:44 AM, Some Body wrote:

    To get around the small-file-problem (I have thousands of 2MB log
    files) I wrote
    a class to convert all my log files into a single SequenceFile in
    (Text key, BytesWritable value) format. That works fine. I can
    run this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
    native-hadoop library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
    decompressor
    peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip].......

    which shows my file name key (peemt114.log)
    and file contents value which appears to be converted to hex.
    The hex values up to the first tab (09) translate to my hostname.

    I'm trying to adapt my mapper to use the SequenceFile as input.

    I changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
    and modified my mapper signature to:
    public class MyMapper extends Mapper<Object, BytesWritable,
    Text, Text> {

    but how do I convert the value back to Text? When I print out the
    key,values using:
    System.out.printf("MAPPER INKEY: [%s]\n", key);
    System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
    I get::
    MAPPER INKEY: [peemt114.log]
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

    Alan
  • Alex Kozlov at Jul 8, 2010 at 9:23 pm
    Hi Alan,

    Is the content of the original file ascii text? Then you should be using
    <Object, Text, Text, Text> signature. By default 'hadoop fs -text ...' just
    will call toString() on the object. You get the object itself in the map()
    method and can do whatever you want with it. If Text or BytesWritable does
    not work for you, you can always write your own class implementing
    Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.

    Let me know if you need more details how to do this.

    Alex K
    On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller wrote:

    Hi Alex,

    I'm not sure what you mean. I already set my mapper's signature to:

    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
    {
    ...
    public void map(Text key, BytesWritable value, Context context)
    }
    }

    In my map() loop the contents of value is the text from the original file
    and the value.toString() returns a String of bytes as hex pairs separated
    by space.
    But I'd like the original tab separated list of strings (i.e. the lines in
    my original files).

    I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
    own
    RecordReader to convert the byte[] back to text strings but I thought this
    is
    something the framework would provide.

    Alan


    On 07/08/2010 08:42 PM, Alex Loddengaard wrote:

    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should be able
    to use the Writables in the signature. Though it looks like you're using
    the new API, and I admit that I'm not an expert with the new API. Have you
    tried using the Writables in the signature?

    Alex
    On Thu, Jul 8, 2010 at 6:44 AM, Some Body wrote:

    To get around the small-file-problem (I have thousands of 2MB log files) I
    wrote
    a class to convert all my log files into a single SequenceFile in
    (Text key, BytesWritable value) format. That works fine. I can run this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
    library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
    peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip].......

    which shows my file name key (peemt114.log)
    and file contents value which appears to be converted to hex.
    The hex values up to the first tab (09) translate to my hostname.

    I'm trying to adapt my mapper to use the SequenceFile as input.

    I changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
    and modified my mapper signature to:
    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
    {

    but how do I convert the value back to Text? When I print out the
    key,values using:
    System.out.printf("MAPPER INKEY: [%s]\n", key);
    System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
    I get::
    MAPPER INKEY: [peemt114.log]
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

    Alan
  • Alan Miller at Jul 9, 2010 at 5:16 pm
    Hi Alex,

    My original files are ascii text. I was using <Object, Text, Text, Text>
    and everything worked fine.
    Because my files are small (>2MB on avg.) I get one-map task per file.
    For my test I had 2000 files, totalling 5GB and the whole run took
    approx 40 minutes.

    I read that I could improve performance by merging my original files
    into one big SequenceFile.

    I did that and that's why I trying to use <Object, BytesWritable, Text,
    Text>
    My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks
    but apparently my new
    map() is computationally more intensive and the whole run now takes 64
    minutes.

    In my map(Text key, BytesWritable value, Context context) value
    contains the contents
    of a whole file. I tried to break it down into line-based records which
    I send to reduce().

    StringBuilder line = *new* StringBuilder*()*;
    *char* linefeed = '\n';
    *for* *(**byte* byt : value.getBytes*())* *{*
    *if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{*
    line.append*((**char**)*byt*)*;
    process_line*(*line.toString*()*, context*)*;
    line.delete*(*0, line.length*())*;
    *}* *else* *{*
    line.append*((**char**)*byt*)*;
    *}*
    *}*

    Alan
    On 07/08/2010 11:22 PM, Alex Kozlov wrote:
    Hi Alan,

    Is the content of the original file ascii text? Then you should be
    using <Object, Text, Text, Text> signature. By default 'hadoop fs
    -text ...' just will call toString() on the object. You get the
    object itself in the map() method and can do whatever you want with
    it. If Text or BytesWritable does not work for you, you can always
    write your own class implementing Writable
    <http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>
    interface.

    Let me know if you need more details how to do this.

    Alex K

    On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller wrote:

    Hi Alex,

    I'm not sure what you mean. I already set my mapper's signature to:

    public class MyMapper extends Mapper<Object, BytesWritable,
    Text, Text> {
    ...
    public void map(Text key, BytesWritable value, Context context)
    }
    }

    In my map() loop the contents of value is the text from the
    original file
    and the value.toString() returns a String of bytes as hex pairs
    separated by space.
    But I'd like the original tab separated list of strings (i.e. the
    lines in my original files).

    I see BytesWritable.getBytes() returns a byte[]. I guess I could
    write my own
    RecordReader to convert the byte[] back to text strings but I
    thought this is
    something the framework would provide.

    Alan

    On 07/08/2010 08:42 PM, Alex Loddengaard wrote:
    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should
    be able to use the Writables in the signature. Though it looks
    like you're using the new API, and I admit that I'm not an expert
    with the new API. Have you tried using the Writables in the
    signature?

    Alex

    On Thu, Jul 8, 2010 at 6:44 AM, Some Body
    wrote:

    To get around the small-file-problem (I have thousands of 2MB
    log files) I wrote
    a class to convert all my log files into a single SequenceFile in
    (Text key, BytesWritable value) format. That works fine. I
    can run this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the
    native-hadoop library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully
    loaded & initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new
    decompressor
    peemt114.log 70 65 65 6d 74 31 31 34 09
    .........[snip].......

    which shows my file name key (peemt114.log)
    and file contents value which appears to be converted to hex.
    The hex values up to the first tab (09) translate to my
    hostname.

    I'm trying to adapt my mapper to use the SequenceFile as input.

    I changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
    and modified my mapper signature to:
    public class MyMapper extends Mapper<Object, BytesWritable,
    Text, Text> {

    but how do I convert the value back to Text? When I print out
    the key,values using:
    System.out.printf("MAPPER INKEY: [%s]\n", key);
    System.out.printf("MAPPER INVAL: [%s]\n",
    value.toString());
    I get::
    MAPPER INKEY: [peemt114.log]
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

    Alan
  • Alex Kozlov at Jul 9, 2010 at 6:42 pm
    Hi Alan,

    You don't need to do this complex trickery if you write <Object,Text> to the
    Sequence File. How do you create the Sequence File? In your case it might
    make sense to create a <Text,Text> Sequence File where the first object is
    the file name or compete path and the second is the content.

    Then you just call:

    process_line*(*value.toString*()*, context*)*;

    without having to do the StringBuffer thing.

    Alex K
    On Fri, Jul 9, 2010 at 10:10 AM, Alan Miller wrote:

    Hi Alex,

    My original files are ascii text. I was using <Object, Text, Text, Text>
    and everything worked fine.
    Because my files are small (>2MB on avg.) I get one-map task per file.
    For my test I had 2000 files, totalling 5GB and the whole run took approx
    40 minutes.

    I read that I could improve performance by merging my original files into
    one big SequenceFile.

    I did that and that's why I trying to use <Object, BytesWritable, Text,
    Text>
    My new SequenceFile is only 444MB so my m/r job trigerred 7 map tasks but
    apparently my new
    map() is computationally more intensive and the whole run now takes 64
    minutes.

    In my map(Text key, BytesWritable value, Context context) value contains
    the contents
    of a whole file. I tried to break it down into line-based records which I
    send to reduce().

    StringBuilder line = *new* StringBuilder*()*;
    *char* linefeed = '\n';
    *for* *(**byte* byt : value.getBytes*())* *{*
    *if* *(* *(**int**)*byt == *(**int**)*linefeed *)* *{*
    line.append*((**char**)*byt*)*;
    process_line*(*line.toString*()*, context*)*;
    line.delete*(*0, line.length*())*;
    *}* *else* *{*
    line.append*((**char**)*byt*)*;
    *}*
    *}*

    Alan


    On 07/08/2010 11:22 PM, Alex Kozlov wrote:

    Hi Alan,

    Is the content of the original file ascii text? Then you should be using
    <Object, Text, Text, Text> signature. By default 'hadoop fs -text ...'
    just will call toString() on the object. You get the object itself in the
    map() method and can do whatever you want with it. If Text or BytesWritable
    does not work for you, you can always write your own class implementing
    Writable<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html>interface.

    Let me know if you need more details how to do this.

    Alex K
    On Thu, Jul 8, 2010 at 1:59 PM, Alan Miller wrote:

    Hi Alex,

    I'm not sure what you mean. I already set my mapper's signature to:

    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
    {
    ...
    public void map(Text key, BytesWritable value, Context context)
    }
    }

    In my map() loop the contents of value is the text from the original file
    and the value.toString() returns a String of bytes as hex pairs separated
    by space.
    But I'd like the original tab separated list of strings (i.e. the lines in
    my original files).

    I see BytesWritable.getBytes() returns a byte[]. I guess I could write my
    own
    RecordReader to convert the byte[] back to text strings but I thought this
    is
    something the framework would provide.

    Alan


    On 07/08/2010 08:42 PM, Alex Loddengaard wrote:

    Hi Alan,

    SequenceFiles keep track of the key and value type, so you should be
    able to use the Writables in the signature. Though it looks like you're
    using the new API, and I admit that I'm not an expert with the new API.
    Have you tried using the Writables in the signature?

    Alex
    On Thu, Jul 8, 2010 at 6:44 AM, Some Body wrote:

    To get around the small-file-problem (I have thousands of 2MB log files)
    I wrote
    a class to convert all my log files into a single SequenceFile in
    (Text key, BytesWritable value) format. That works fine. I can run
    this:

    hadoop fs -text /my.seq |grep peemt114.log | head -1
    10/07/08 15:02:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
    library
    10/07/08 15:02:10 INFO zlib.ZlibFactory: Successfully loaded &
    initialized native-zlib library
    10/07/08 15:02:10 INFO compress.CodecPool: Got brand-new decompressor
    peemt114.log 70 65 65 6d 74 31 31 34 09 .........[snip].......

    which shows my file name key (peemt114.log)
    and file contents value which appears to be converted to hex.
    The hex values up to the first tab (09) translate to my hostname.

    I'm trying to adapt my mapper to use the SequenceFile as input.

    I changed the job's inputFormatClass to:
    MyJob.setInputFormatClass(SequenceFileInputFormat.class);
    and modified my mapper signature to:
    public class MyMapper extends Mapper<Object, BytesWritable, Text, Text>
    {

    but how do I convert the value back to Text? When I print out the
    key,values using:
    System.out.printf("MAPPER INKEY: [%s]\n", key);
    System.out.printf("MAPPER INVAL: [%s]\n", value.toString());
    I get::
    MAPPER INKEY: [peemt114.log]
    MAPPER INVAL: [70 65 65 6d 74 31 31 34 09 .....[snip]......]

    Alan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 8, '10 at 1:49p
activeJul 9, '10 at 6:42p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase