Hi all ,
I have been working with MapReduce and HDFS for sometime. So the procedure
what I normally follow is :
1) copy in the input file from Local File System to HDFS
2) run the map reduce module
3) copy the output file back to the Local File System from the HDFS
But I feel , step 1 and 3 is adding a lot of overhead to the entire process
My queries are :
1) I am getting the files into the Local File System by establishing a port
connection with another node. So can I ensure that the data which is ported
into the hadoop node is directly written to the HDFS instead of going
through the Local File System and then performing a CopyFromLocal ???
2) Can I copy the reduce output (which creates the final output file)
directly to the Local File System instead of injecting it to the HDFS
(effectively into different nodes in HDFS), so that I can minimize the
overhead ?? I expect this procedure to take much lesser time than copying to
the HDFS and then performing a CopyToLocal.. Finally I should be able to
send this file back to another node using socket communication..
Looking forward to your suggestions !!