Hello everyone. I’m currently doing some research involving Hadoop and Pig, evaluating the cost of data replication vs. the penalty of node failure with respect to job completion time. I’m current modifying a MapReduce simulator called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated by Pig. I have some questions about how failures are handled in a Pig job, so I can reflect that in my implementation. I’m fairly new to Pig, so if some of these questions reflect misunderstanding, please let me know:
1) How does Pig/Hadoop handle failures which would require re-execution of part of all of some MapReduce job that was already completed earlier in the sequence? The only source of information I could find on this is in the paper “Making cloud intermediate data fault-tolerant” by Ko et. al. (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this is accurate.
As an example:
Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. Then a failure occurs on some node. Map and Reduce task outputs are lost from all jobs. It seems to me it would be necessary to re-execute some tasks, but not all, from all three jobs to complete the overall job. How does Pig/Hadoop handle such backtracking in this situation?
(This is based on understanding that when Pig produces a series MapReduce jobs, the replication factor for the intermediate data produced by each Reduce phase can be set individually. Correct me if I’m wrong here.)
2) On a related note, is the intermediate data produced by the Map tasks in an individual MapReduce job deleted after that job completes and the next in the sequence starts, or is it preserved until the whole Pig job finishes?
Any information or guidance would be greatly appreciated.
Thanks,
Drew