FAQ
Hi,
I have a question regarding using sequence file input format in hadoop
streaing jar with mappers and reducers written in python.

If i use sequence file as input format for streaming jar and use mappers
written in python, can I take care of serialization and de-serialization in
mapper/reducer code ? For eg, if i have complex data-types in sequence
file's values, can I de-serialize them in python and run map-red job using
streaming jar.

Thanks in advance,
-JJ

Search Discussions

  • Jeremy Lewi at Jun 2, 2011 at 1:13 pm
    JJ

    If you want to use complex types in a streaming job I think you need to
    encode the values using the typedbytes format within the sequence file;
    i.e the key and value in the sequence file are both typedbytes writable.
    This is independent of the language the mapper and reducer is written in
    because the values needed to be encoded as a byte stream in such a way
    that the binary stream doesn't contain any characters that would cause
    problems when passed in via stdin/stdout.

    In python your mapper/reducer will pull in strings from stdin which can
    be decoded from typedbytes to python types.

    The easiest way to do this is to use dumbo
    (https://github.com/klbostee/dumbo/wiki) to write your python
    mapper/reducer. The dumbo module handles the
    serialization/deserialization to/from typedbytes to native python types.

    J
    On Thu, 2011-06-02 at 00:06 -0700, Mapred Learn wrote:
    Hi,
    I have a question regarding using sequence file input format in hadoop
    streaing jar with mappers and reducers written in python.

    If i use sequence file as input format for streaming jar and use
    mappers written in python, can I take care of serialization and
    de-serialization in mapper/reducer code ? For eg, if i have complex
    data-types in sequence file's values, can I de-serialize them in
    python and run map-red job using streaming jar.

    Thanks in advance,
    -JJ
  • Mapred Learn at Jun 3, 2011 at 1:08 am
    Thanks Jeremy,
    I will look into details provided by you

    Sent from my iPhone
    On Jun 2, 2011, at 6:12 AM, Jeremy Lewi wrote:

    JJ

    If you want to use complex types in a streaming job I think you need to
    encode the values using the typedbytes format within the sequence file;
    i.e the key and value in the sequence file are both typedbytes writable.
    This is independent of the language the mapper and reducer is written in
    because the values needed to be encoded as a byte stream in such a way
    that the binary stream doesn't contain any characters that would cause
    problems when passed in via stdin/stdout.

    In python your mapper/reducer will pull in strings from stdin which can
    be decoded from typedbytes to python types.

    The easiest way to do this is to use dumbo
    (https://github.com/klbostee/dumbo/wiki) to write your python
    mapper/reducer. The dumbo module handles the
    serialization/deserialization to/from typedbytes to native python types.

    J
    On Thu, 2011-06-02 at 00:06 -0700, Mapred Learn wrote:
    Hi,
    I have a question regarding using sequence file input format in hadoop
    streaing jar with mappers and reducers written in python.

    If i use sequence file as input format for streaming jar and use
    mappers written in python, can I take care of serialization and
    de-serialization in mapper/reducer code ? For eg, if i have complex
    data-types in sequence file's values, can I de-serialize them in
    python and run map-red job using streaming jar.

    Thanks in advance,
    -JJ

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 2, '11 at 7:06a
activeJun 3, '11 at 1:08a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Mapred Learn: 2 posts Jeremy Lewi: 1 post

People

Translate

site design / logo © 2022 Grokbase