FAQ
Hi all,

I have a urgent question regarding processing binary (image) data using
Hadoop streaming.
I am looking for simplest solution, preferably without making change to
hadoop and/or streaming package.

I got some hints from this mailing list, including using customized
InputFormat, or sequencefileInputForm. but nothing really help me out. Here
is my problem:

1. A lot of binary (image) files stored on HDFS.
2. a standalone executable take binary (e.g., image) filename as input (key)
and export small metadata as value (e.g., size of image)

How can we passing the this standalone program as a mapper to streaming to
process image across all nodes, given streaming currently only takes stdin
by default.

Thanks.

-Qiming


--
View this message in context: http://www.nabble.com/streaming-a-binary-processing-file-tp23859645p23859645.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Zak Stone at Jun 3, 2009 at 10:41 pm
    One simple solution is to use Dumbo, a Python interface to Hadoop that
    supports binary streaming:

    http://wiki.github.com/klbostee/dumbo

    Zak


    On Wed, Jun 3, 2009 at 5:18 PM, openresearch
    wrote:
    Hi all,

    I have a urgent question regarding processing binary (image) data using
    Hadoop streaming.
    I am looking for simplest solution, preferably without making change to
    hadoop and/or streaming package.

    I got some hints from this mailing list, including using customized
    InputFormat, or sequencefileInputForm. but nothing really help me out. Here
    is my problem:

    1. A lot of binary (image) files stored on HDFS.
    2. a standalone executable take binary (e.g., image) filename as input (key)
    and export small metadata as value (e.g., size of image)

    How can we passing the this standalone program as a mapper to streaming to
    process image across all nodes, given streaming currently only takes stdin
    by default.

    Thanks.

    -Qiming


    --
    View this message in context: http://www.nabble.com/streaming-a-binary-processing-file-tp23859645p23859645.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.
  • Sharad Agarwal at Jun 4, 2009 at 4:45 am
    Binary support has been added for 0.21. One option is to wait for 0.21 to get released, or you might try applying the patch from HADOOP-1722.


    - Sharad
  • Jeff Hammerbacher at Jun 4, 2009 at 6:51 am
    Hey,

    If you don't want to wait for the release, you could try using the latest
    version of Cloudera's Distribution for Hadoop (see
    http://www.cloudera.com/hadoop), which is based off of the 0.18.3 release of
    Apache Hadoop but has the HADOOP-1722 patch backported (see
    http://www.cloudera.com/hadoop-manifest for the detailed manifest). We have
    several customers processing binary data with Hadoop Streaming using our
    distribution.

    Regards,
    Jeff
    On Wed, Jun 3, 2009 at 9:44 PM, Sharad Agarwal wrote:


    Binary support has been added for 0.21. One option is to wait for 0.21 to
    get released, or you might try applying the patch from HADOOP-1722.


    - Sharad

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 3, '09 at 9:35p
activeJun 4, '09 at 6:51a
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase