FAQ
Hi All,

I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
about FileInputFormat a few days ago and get some prompt replys from
this forum and it helped a lot. Thanks again! Now I have another
question. I'm trying to invoke a C++ process from my mapper for each
hdfs file in the input directory to achieve some parallel processing.
But how do I pass the file to the program? I would want to do something
like the following in my mapper:

Process lChldProc = Runtime.getRuntime().exec("myprocess -file
$filepath");

How do I pass the hdfs filesystem to an outside process like that? Is
HadoopStreaming the direction I should go?

Thanks very much for any reply in advance.

Best,
Grace

Search Discussions

  • Robert Evans at Aug 23, 2011 at 2:48 pm
    Hadoop streaming is the simplest way to do this, if you program is set up to take stdin as its input, write to stdout for the output, and each record "file" in your case is a single line of text.

    You need to be able to have it work with the following shell script

    Hadoop fs -cat <input_file> | head -1 | ./myprocess > output.txt

    And ideally what is stored in output.txt are lines of text that can have their order rearranged without impacting the result (This is not a requirement unless you want to use a reduce too, but streaming will still try to parse it that way.

    If not there are tricks you can play to make it work, but they are kind of ugly.

    --Bobby Evans


    On 8/22/11 2:57 PM, "Zhixuan Zhu" wrote:

    Hi All,

    I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
    about FileInputFormat a few days ago and get some prompt replys from
    this forum and it helped a lot. Thanks again! Now I have another
    question. I'm trying to invoke a C++ process from my mapper for each
    hdfs file in the input directory to achieve some parallel processing.
    But how do I pass the file to the program? I would want to do something
    like the following in my mapper:

    Process lChldProc = Runtime.getRuntime().exec("myprocess -file
    $filepath");

    How do I pass the hdfs filesystem to an outside process like that? Is
    HadoopStreaming the direction I should go?

    Thanks very much for any reply in advance.

    Best,
    Grace
  • Arun C Murthy at Aug 23, 2011 at 2:51 pm

    On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:

    Hi All,

    I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
    about FileInputFormat a few days ago and get some prompt replys from
    this forum and it helped a lot. Thanks again! Now I have another
    question. I'm trying to invoke a C++ process from my mapper for each
    hdfs file in the input directory to achieve some parallel processing.
    That seems weird - why aren't you using more maps and one file per-map?
    But how do I pass the file to the program? I would want to do something
    like the following in my mapper:
    IAC, libhdfs is one way to do HDFS ops via c/c++.

    Arun
    Process lChldProc = Runtime.getRuntime().exec("myprocess -file
    $filepath");

    How do I pass the hdfs filesystem to an outside process like that? Is
    HadoopStreaming the direction I should go?

    Thanks very much for any reply in advance.

    Best,
    Grace
  • Zhixuan Zhu at Aug 23, 2011 at 3:00 pm
    I'll actually invoke one executable from each of my map. Because this
    C++ program has been implemented and used in the past, I just want to
    integrate it to our Hadoop map/reduce without having to re-implement the
    process in java. So my map is going to be very simple with just calling
    the process and pass the input files.

    Thanks,
    Grace

    -----Original Message-----
    From: Arun C Murthy
    Sent: Tuesday, August 23, 2011 9:51 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: how to pass a hdfs file to a c++ process

    On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:

    Hi All,

    I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
    about FileInputFormat a few days ago and get some prompt replys from
    this forum and it helped a lot. Thanks again! Now I have another
    question. I'm trying to invoke a C++ process from my mapper for each
    hdfs file in the input directory to achieve some parallel processing.
    That seems weird - why aren't you using more maps and one file per-map?
    But how do I pass the file to the program? I would want to do something
    like the following in my mapper:
    IAC, libhdfs is one way to do HDFS ops via c/c++.

    Arun
    Process lChldProc = Runtime.getRuntime().exec("myprocess -file
    $filepath");

    How do I pass the hdfs filesystem to an outside process like that? Is
    HadoopStreaming the direction I should go?

    Thanks very much for any reply in advance.

    Best,
    Grace
  • Arun Murthy at Aug 23, 2011 at 3:36 pm
    That is a normal use case.

    I'd encourage you to use Java MR (even pig/hive).

    If you really want to use your legacy app use streaming with a map cmd
    such as 'hadoop fs -cat <file> | mylegacyexe'

    Arun

    Sent from my iPhone
    On Aug 23, 2011, at 8:00 AM, Zhixuan Zhu wrote:

    I'll actually invoke one executable from each of my map. Because this
    C++ program has been implemented and used in the past, I just want to
    integrate it to our Hadoop map/reduce without having to re-implement the
    process in java. So my map is going to be very simple with just calling
    the process and pass the input files.

    Thanks,
    Grace

    -----Original Message-----
    From: Arun C Murthy
    Sent: Tuesday, August 23, 2011 9:51 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: how to pass a hdfs file to a c++ process

    On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:

    Hi All,

    I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
    about FileInputFormat a few days ago and get some prompt replys from
    this forum and it helped a lot. Thanks again! Now I have another
    question. I'm trying to invoke a C++ process from my mapper for each
    hdfs file in the input directory to achieve some parallel processing.
    That seems weird - why aren't you using more maps and one file per-map?
    But how do I pass the file to the program? I would want to do something
    like the following in my mapper:
    IAC, libhdfs is one way to do HDFS ops via c/c++.

    Arun
    Process lChldProc = Runtime.getRuntime().exec("myprocess -file
    $filepath");

    How do I pass the hdfs filesystem to an outside process like that? Is
    HadoopStreaming the direction I should go?

    Thanks very much for any reply in advance.

    Best,
    Grace
  • Zhixuan Zhu at Aug 23, 2011 at 3:51 pm
    Thank you very much!

    'hadoop fs -cat <file> | mylegacyexe' is exactly the kind of method I
    came up with and was going to try it out. I'm glad to hear that it's
    actually an "official" alternative.

    Thanks again. This is a great forum!
    Grace


    -----Original Message-----
    From: Arun Murthy
    Sent: Tuesday, August 23, 2011 10:36 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: how to pass a hdfs file to a c++ process

    That is a normal use case.

    I'd encourage you to use Java MR (even pig/hive).

    If you really want to use your legacy app use streaming with a map cmd
    such as 'hadoop fs -cat <file> | mylegacyexe'

    Arun

    Sent from my iPhone
    On Aug 23, 2011, at 8:00 AM, Zhixuan Zhu wrote:

    I'll actually invoke one executable from each of my map. Because this
    C++ program has been implemented and used in the past, I just want to
    integrate it to our Hadoop map/reduce without having to re-implement the
    process in java. So my map is going to be very simple with just calling
    the process and pass the input files.

    Thanks,
    Grace

    -----Original Message-----
    From: Arun C Murthy
    Sent: Tuesday, August 23, 2011 9:51 AM
    To: common-dev@hadoop.apache.org
    Subject: Re: how to pass a hdfs file to a c++ process

    On Aug 22, 2011, at 12:57 PM, Zhixuan Zhu wrote:

    Hi All,

    I'm using hadoop-0.20.2 to try out some simple tasks. I asked a question
    about FileInputFormat a few days ago and get some prompt replys from
    this forum and it helped a lot. Thanks again! Now I have another
    question. I'm trying to invoke a C++ process from my mapper for each
    hdfs file in the input directory to achieve some parallel processing.
    That seems weird - why aren't you using more maps and one file
    per-map?
    But how do I pass the file to the program? I would want to do something
    like the following in my mapper:
    IAC, libhdfs is one way to do HDFS ops via c/c++.

    Arun
    Process lChldProc = Runtime.getRuntime().exec("myprocess -file
    $filepath");

    How do I pass the hdfs filesystem to an outside process like that? Is
    HadoopStreaming the direction I should go?

    Thanks very much for any reply in advance.

    Best,
    Grace

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedAug 22, '11 at 7:57p
activeAug 23, '11 at 3:51p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase