Dear Hadoop devs,
Please help me to figure out a way to program the following problem using
I have a program which I need to invoke in parallel using Hadoop. The
program takes an input file(binary) and produce an output file (binary)
Input.bin ->prog.exe-> output.bin
The input data set is about 1TB in size. Each input data file is about 33MB
in size. (So I have about 31000 files)
The output binary file is about 9KBs in size.
I have implemented this program using Hadoop in the following way.
I keep the input data in a shared parallel file system (Lustre File System).
Then, I collect the input file names and write them to a collection of files
in HDFS (let's say hdfs_input_0.txt ..).
Each hdfs_input file contains roughly the equal number of files URIs to the
original input file.
The map task, simply take a string Value which is a URI to an original input
data file and execute the program as an external program.
The output of the program is also written to the shared file system (Lustre
The problem in this approach is I am not utilizing the true benefit of
MapReduce. The use of local disks.
Could you please suggest me a way to use local disks for the above
I thought, of the following way, but would like to verify from you if there
is a better way.
1. Upload the original data files in HDFS
2. In the map task, read the data file as an binary object.
3. Save it in the local file system.
4. Call the executable
5. Push the output from the local file system to HDFS.
Any suggestion is greatly appreciated.