FAQ
Dear all,

Is it possible to have a binary input to a map code written in python?

Thank you
Hassen

Search Discussions

  • GOEKE, MATTHEW (AG/1000) at Jun 20, 2011 at 8:39 pm
    Hassen,

    If you would like to use python as input I would suggest looking into the streaming api examples around Hadoop.

    Matt

    -----Original Message-----
    From: Hassen Riahi
    Sent: Monday, June 20, 2011 3:13 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: mapreduce and python

    Dear all,

    Is it possible to have a binary input to a map code written in python?

    Thank you
    Hassen
    This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Hassen Riahi at Jun 20, 2011 at 8:51 pm
    Thanks Matt for your reply...

    I would like to use a binary file as input to a map code written in
    python.
    All examples that I found take a text file as input to a map code (in
    python). Any ideas? is it feasible?

    Hassen
    Hassen,

    If you would like to use python as input I would suggest looking
    into the streaming api examples around Hadoop.

    Matt

    -----Original Message-----
    From: Hassen Riahi
    Sent: Monday, June 20, 2011 3:13 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: mapreduce and python

    Dear all,

    Is it possible to have a binary input to a map code written in python?

    Thank you
    Hassen
    This e-mail message may contain privileged and/or confidential
    information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in
    error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media.
    Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to
    monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any
    damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations
    (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient
    of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • GOEKE, MATTHEW (AG/1000) at Jun 20, 2011 at 9:55 pm
    You might want to chase down leads around https://issues.apache.org/jira/browse/MAPREDUCE-606. It looks like there is a patch for it on Jira but I am not quite sure if it is working. If it is worth it to you to keep it in python then it might be worth it to tinker with the patch...

    HTH,
    Matt

    -----Original Message-----
    From: Hassen Riahi
    Sent: Monday, June 20, 2011 3:50 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: mapreduce and python

    Thanks Matt for your reply...

    I would like to use a binary file as input to a map code written in
    python.
    All examples that I found take a text file as input to a map code (in
    python). Any ideas? is it feasible?

    Hassen
    Hassen,

    If you would like to use python as input I would suggest looking
    into the streaming api examples around Hadoop.

    Matt

    -----Original Message-----
    From: Hassen Riahi
    Sent: Monday, June 20, 2011 3:13 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: mapreduce and python

    Dear all,

    Is it possible to have a binary input to a map code written in python?

    Thank you
    Hassen
    This e-mail message may contain privileged and/or confidential
    information, and is intended to be received only by persons entitled
    to receive such information. If you have received this e-mail in
    error, please notify the sender immediately. Please delete it and
    all attachments from any servers, hard drives or any other media.
    Other use of this e-mail by you is strictly prohibited.

    All e-mails and attachments sent and received are subject to
    monitoring, reading and archival by Monsanto, including its
    subsidiaries. The recipient of this e-mail is solely responsible for
    checking for the presence of "Viruses" or other "Malware".
    Monsanto, along with its subsidiaries, accepts no liability for any
    damage caused by any such code transmitted by or accompanying
    this e-mail or any attachment.


    The information contained in this email may be subject to the export
    control laws and regulations of the United States, potentially
    including but not limited to the Export Administration Regulations
    (EAR) and sanctions regulations issued by the U.S. Department of
    Treasury, Office of Foreign Asset Controls (OFAC). As a recipient
    of this information you are obligated to comply with all
    applicable U.S. export laws and regulations.
  • Joe Stein at Jun 21, 2011 at 1:06 am
    Hassen,

    I have lots of binary data that I parse using Python streaming.

    The way I do this is stream the binary data into sequence files (the binary
    data object I save in the key and (null) as the value).

    Each key then gets written back to me line by line, key by key for an entire
    block when streaming.

    To have this work in streaming on the command line you need to use
    *-inputformat
    SequenceFileAsTextInputFormat*

    To create the sequence files I have a jar file that goes from BufferedReader
    and writes to org.apache.hadoop.io.SequenceFile.Writer

    I am not sure if you can do this for your data but if not then make your own
    InputFormat.

    good luck!

    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */
    On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi wrote:

    Dear all,

    Is it possible to have a binary input to a map code written in python?

    Thank you
    Hassen
  • Jeremy Lewi at Jun 21, 2011 at 5:15 am
    Hassen,

    I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
    as a solution for using python to implement mappers and reducers.

    TypedBytes is a hadoop encoding format that allows binary data
    (including lists and maps) to be encoded in a format that permits the
    serialized data to safely be passed to mappers/reducers via the command
    line through hadoop streaming.

    Dumbo is a python library which makes it easy to implement your mappers
    and reducers in python. In particular, it handles decoding the data
    encoded as typedbytes to native python types.

    J
    On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
    Hassen,


    I have lots of binary data that I parse using Python streaming.


    The way I do this is stream the binary data into sequence files (the
    binary data object I save in the key and (null) as the value).


    Each key then gets written back to me line by line, key by key for an
    entire block when streaming.


    To have this work in streaming on the command line you need to
    use -inputformat SequenceFileAsTextInputFormat


    To create the sequence files I have a jar file that goes from
    BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer


    I am not sure if you can do this for your data but if not then make
    your own InputFormat.


    good luck!


    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */

    On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi wrote:
    Dear all,

    Is it possible to have a binary input to a map code written in
    python?

    Thank you
    Hassen

  • Harsh J at Jun 21, 2011 at 5:46 am
    I'd like to +1 to using Dumbo for all things Python and Hadoop
    MapReduce. Its one of the better ways to do things.

    Do look at the initial conversation here:
    http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
    as well.

    The feature/bug fixes specified in the post are present in Apache
    Hadoop 0.21 (which isn't deemed to be suited for production use yet)
    and is also available in other (in-production-use) Hadoop
    distributions such as Cloudera's, which is based off on 0.20.2:
    https://ccp.cloudera.com/display/SUPPORT/Downloads
    On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi wrote:
    Hassen,

    I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
    as a solution for using python to implement mappers and reducers.

    TypedBytes is a hadoop encoding format that allows binary data
    (including lists and maps) to be encoded in a format that permits the
    serialized data to safely be passed to mappers/reducers via the command
    line through hadoop streaming.

    Dumbo is a python library which makes it easy to implement your mappers
    and reducers in python. In particular, it handles decoding the data
    encoded as typedbytes to native python types.

    J
    On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
    Hassen,


    I have lots of binary data that I parse using Python streaming.


    The way I do this is stream the binary data into sequence files (the
    binary data object I save in the key and (null) as the value).


    Each key then gets written back to me line by line, key by key for an
    entire block when streaming.


    To have this work in streaming on the command line you need to
    use -inputformat SequenceFileAsTextInputFormat


    To create the sequence files I have a jar file that goes from
    BufferedReader and writes to org.apache.hadoop.io.SequenceFile.Writer


    I am not sure if you can do this for your data but if not then make
    your own InputFormat.


    good luck!


    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */

    On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.riahi@cern.ch>
    wrote:
    Dear all,

    Is it possible to have a binary input to a map code written in
    python?

    Thank you
    Hassen



    --
    Harsh J
  • Hassen Riahi at Jun 22, 2011 at 1:18 pm
    I'm trying these solutions...Thanks for suggestions.
    I'd like to +1 to using Dumbo for all things Python and Hadoop
    MapReduce. Its one of the better ways to do things.

    Do look at the initial conversation here:
    http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
    as well.

    The feature/bug fixes specified in the post are present in Apache
    Hadoop 0.21 (which isn't deemed to be suited for production use yet)
    and is also available in other (in-production-use) Hadoop
    distributions such as Cloudera's, which is based off on 0.20.2:
    https://ccp.cloudera.com/display/SUPPORT/Downloads
    On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi wrote:
    Hassen,

    I've been very succesful using Hadoop Streaming, Dumbo, and
    TypedBytes
    as a solution for using python to implement mappers and reducers.

    TypedBytes is a hadoop encoding format that allows binary data
    (including lists and maps) to be encoded in a format that permits the
    serialized data to safely be passed to mappers/reducers via the
    command
    line through hadoop streaming.

    Dumbo is a python library which makes it easy to implement your
    mappers
    and reducers in python. In particular, it handles decoding the data
    encoded as typedbytes to native python types.

    J
    On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote:
    Hassen,


    I have lots of binary data that I parse using Python streaming.


    The way I do this is stream the binary data into sequence files (the
    binary data object I save in the key and (null) as the value).


    Each key then gets written back to me line by line, key by key for
    an
    entire block when streaming.


    To have this work in streaming on the command line you need to
    use -inputformat SequenceFileAsTextInputFormat


    To create the sequence files I have a jar file that goes from
    BufferedReader and writes to
    org.apache.hadoop.io.SequenceFile.Writer


    I am not sure if you can do this for your data but if not then make
    your own InputFormat.


    good luck!


    /*
    Joe Stein
    http://www.linkedin.com/in/charmalloc
    Twitter: @allthingshadoop
    */

    On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi <hassen.riahi@cern.ch>
    wrote:
    Dear all,

    Is it possible to have a binary input to a map code
    written in
    python?

    Thank you
    Hassen



    --
    Harsh J

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJun 20, '11 at 8:13p
activeJun 22, '11 at 1:18p
posts8
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase