FAQ
Dear Hadoop devs,



Please help me to figure out a way to program the following problem using
Hadoop.

I have a program which I need to invoke in parallel using Hadoop. The
program takes an input file(binary) and produce an output file (binary)



Input.bin ->prog.exe-> output.bin



The input data set is about 1TB in size. Each input data file is about 33MB
in size. (So I have about 31000 files)

The output binary file is about 9KBs in size.



I have implemented this program using Hadoop in the following way.



I keep the input data in a shared parallel file system (Lustre File System).

Then, I collect the input file names and write them to a collection of files
in HDFS (let's say hdfs_input_0.txt ..).

Each hdfs_input file contains roughly the equal number of files URIs to the
original input file.

The map task, simply take a string Value which is a URI to an original input
data file and execute the program as an external program.

The output of the program is also written to the shared file system (Lustre
File System).



The problem in this approach is I am not utilizing the true benefit of
MapReduce. The use of local disks.

Could you please suggest me a way to use local disks for the above
problem.?



I thought, of the following way, but would like to verify from you if there
is a better way.



1. Upload the original data files in HDFS

2. In the map task, read the data file as an binary object.

3. Save it in the local file system.

4. Call the executable

5. Push the output from the local file system to HDFS.



Any suggestion is greatly appreciated.


Thank you,

Jaliya

Search Discussions

  • Stefan Podkowinski at Aug 10, 2009 at 8:39 am
    Jaliya,

    did you consider Hadoop Streaming for your case?
    http://wiki.apache.org/hadoop/HadoopStreaming


    On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
    Ekanayakewrote:
    Dear Hadoop devs,



    Please help me to figure out a way to program the following problem using
    Hadoop.

    I have a program which I need to invoke in parallel using Hadoop. The
    program takes an input file(binary) and produce an output file (binary)



    Input.bin ->prog.exe-> output.bin



    The input data set is about 1TB in size. Each input data file is about 33MB
    in size. (So I have about 31000 files)

    The output binary file is about 9KBs in size.



    I have implemented this program using Hadoop in the following way.



    I keep the input data in a shared parallel file system (Lustre File System).

    Then, I collect the input file names and write them to a collection of files
    in HDFS (let's say hdfs_input_0.txt ..).

    Each hdfs_input file contains roughly the equal number of files URIs to the
    original input file.

    The map task, simply take a string Value which is a URI to an original input
    data file and execute the program as an external program.

    The output of the program is also written to the shared file system (Lustre
    File System).



    The problem in this approach is I am not utilizing the true benefit of
    MapReduce. The use of local disks.

    Could  you please suggest me a way to use local disks for the above
    problem.?



    I thought, of the following way, but would like to verify from you if there
    is a better way.



    1.       Upload the original data files in HDFS

    2.       In the map task, read the data file as an binary object.

    3.       Save it in the local file system.

    4.       Call the executable

    5.       Push the output from the local file system to HDFS.



    Any suggestion is greatly appreciated.


    Thank you,

    Jaliya








  • Jaliya Ekanayake at Aug 20, 2009 at 5:32 pm
    Hi Stefan,



    I am sorry, for the late reply. Somehow the response email has slipped my
    eyes.

    Could you explain a bit on how to use Hadoop streaming with binary data
    formats.

    I can see, explanations on using it with text data formats, but not for
    binary files.


    Thank you,

    Jaliya

    Stefan Podkowinski
    Mon, 10 Aug 2009 01:40:05 -0700

    Jaliya,

    did you consider Hadoop Streaming for your case?
    http://wiki.apache.org/hadoop/HadoopStreaming


    On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
    Ekanayakewrote:
    Dear Hadoop devs,



    Please help me to figure out a way to program the following problem using
    Hadoop.

    I have a program which I need to invoke in parallel using Hadoop. The
    program takes an input file(binary) and produce an output file (binary)



    Input.bin ->prog.exe-> output.bin



    The input data set is about 1TB in size. Each input data file is about 33MB
    in size. (So I have about 31000 files)

    The output binary file is about 9KBs in size.



    I have implemented this program using Hadoop in the following way.



    I keep the input data in a shared parallel file system (Lustre File System).
    Then, I collect the input file names and write them to a collection of files
    in HDFS (let's say hdfs_input_0.txt ..).

    Each hdfs_input file contains roughly the equal number of files URIs to the
    original input file.

    The map task, simply take a string Value which is a URI to an original input
    data file and execute the program as an external program.

    The output of the program is also written to the shared file system (Lustre
    File System).



    The problem in this approach is I am not utilizing the true benefit of
    MapReduce. The use of local disks.

    Could you please suggest me a way to use local disks for the above
    problem.?



    I thought, of the following way, but would like to verify from you if there
    is a better way.



    1. Upload the original data files in HDFS

    2. In the map task, read the data file as an binary object.

    3. Save it in the local file system.

    4. Call the executable

    5. Push the output from the local file system to HDFS.



    Any suggestion is greatly appreciated.


    Thank you,

    Jaliya








  • Aaron Kimball at Aug 20, 2009 at 10:01 pm
    Look into "typed bytes":
    http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
    On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake wrote:

    Hi Stefan,



    I am sorry, for the late reply. Somehow the response email has slipped my
    eyes.

    Could you explain a bit on how to use Hadoop streaming with binary data
    formats.

    I can see, explanations on using it with text data formats, but not for
    binary files.


    Thank you,

    Jaliya

    Stefan Podkowinski
    Mon, 10 Aug 2009 01:40:05 -0700

    Jaliya,

    did you consider Hadoop Streaming for your case?
    http://wiki.apache.org/hadoop/HadoopStreaming


    On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
    Ekanayakewrote:
    Dear Hadoop devs,



    Please help me to figure out a way to program the following problem using
    Hadoop.

    I have a program which I need to invoke in parallel using Hadoop. The
    program takes an input file(binary) and produce an output file (binary)



    Input.bin ->prog.exe-> output.bin



    The input data set is about 1TB in size. Each input data file is about 33MB
    in size. (So I have about 31000 files)

    The output binary file is about 9KBs in size.



    I have implemented this program using Hadoop in the following way.



    I keep the input data in a shared parallel file system (Lustre File System).
    Then, I collect the input file names and write them to a collection of files
    in HDFS (let's say hdfs_input_0.txt ..).

    Each hdfs_input file contains roughly the equal number of files URIs to the
    original input file.

    The map task, simply take a string Value which is a URI to an original input
    data file and execute the program as an external program.

    The output of the program is also written to the shared file system (Lustre
    File System).



    The problem in this approach is I am not utilizing the true benefit of
    MapReduce. The use of local disks.

    Could you please suggest me a way to use local disks for the above
    problem.?



    I thought, of the following way, but would like to verify from you if there
    is a better way.



    1. Upload the original data files in HDFS

    2. In the map task, read the data file as an binary object.

    3. Save it in the local file system.

    4. Call the executable

    5. Push the output from the local file system to HDFS.



    Any suggestion is greatly appreciated.


    Thank you,

    Jaliya









  • Jaliya Ekanayake at Aug 21, 2009 at 4:22 am
    Thanks for the quick reply.
    I looked at it, but still could not figure out how to use HDFS to store
    input data (binary) and call an executable.
    Please note that I cannot modify the executable.

    May be I am asking some dumb question, but could you please explain a bit of
    how to handle the scenario I have described.

    Thanks,
    Jaliya

    -----Original Message-----
    From: Aaron Kimball
    Sent: Thursday, August 20, 2009 3:00 PM
    To: common-dev@hadoop.apache.org
    Cc: core-dev@hadoop.apache.org; core-user@hadoop.apache.org;
    spodxx@gmail.com
    Subject: Re: Using Hadoop with executables and binary data

    Look into "typed bytes":
    http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/

    On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake
    wrote:
    Hi Stefan,



    I am sorry, for the late reply. Somehow the response email has slipped my
    eyes.

    Could you explain a bit on how to use Hadoop streaming with binary data
    formats.

    I can see, explanations on using it with text data formats, but not for
    binary files.


    Thank you,

    Jaliya

    Stefan Podkowinski
    Mon, 10 Aug 2009 01:40:05 -0700

    Jaliya,

    did you consider Hadoop Streaming for your case?
    http://wiki.apache.org/hadoop/HadoopStreaming


    On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
    Ekanayakewrote:
    Dear Hadoop devs,



    Please help me to figure out a way to program the following problem
    using
    Hadoop.

    I have a program which I need to invoke in parallel using Hadoop. The
    program takes an input file(binary) and produce an output file (binary)



    Input.bin ->prog.exe-> output.bin



    The input data set is about 1TB in size. Each input data file is about 33MB
    in size. (So I have about 31000 files)

    The output binary file is about 9KBs in size.



    I have implemented this program using Hadoop in the following way.



    I keep the input data in a shared parallel file system (Lustre File System).
    Then, I collect the input file names and write them to a collection of files
    in HDFS (let's say hdfs_input_0.txt ..).

    Each hdfs_input file contains roughly the equal number of files URIs to the
    original input file.

    The map task, simply take a string Value which is a URI to an original input
    data file and execute the program as an external program.

    The output of the program is also written to the shared file system (Lustre
    File System).



    The problem in this approach is I am not utilizing the true benefit of
    MapReduce. The use of local disks.

    Could you please suggest me a way to use local disks for the above
    problem.?



    I thought, of the following way, but would like to verify from you if there
    is a better way.



    1. Upload the original data files in HDFS

    2. In the map task, read the data file as an binary object.

    3. Save it in the local file system.

    4. Call the executable

    5. Push the output from the local file system to HDFS.



    Any suggestion is greatly appreciated.


    Thank you,

    Jaliya









Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 31, '09 at 6:54a
activeAug 21, '09 at 4:22a
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase