Grokbase Groups Pig user May 2011
FAQ
Hello everyone. I’m currently doing some research involving Hadoop and Pig, evaluating the cost of data replication vs. the penalty of node failure with respect to job completion time. I’m current modifying a MapReduce simulator called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated by Pig. I have some questions about how failures are handled in a Pig job, so I can reflect that in my implementation. I’m fairly new to Pig, so if some of these questions reflect misunderstanding, please let me know:

1) How does Pig/Hadoop handle failures which would require re-execution of part of all of some MapReduce job that was already completed earlier in the sequence? The only source of information I could find on this is in the paper “Making cloud intermediate data fault-tolerant” by Ko et. al. (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this is accurate.

As an example:

Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. Then a failure occurs on some node. Map and Reduce task outputs are lost from all jobs. It seems to me it would be necessary to re-execute some tasks, but not all, from all three jobs to complete the overall job. How does Pig/Hadoop handle such backtracking in this situation?

(This is based on understanding that when Pig produces a series MapReduce jobs, the replication factor for the intermediate data produced by each Reduce phase can be set individually. Correct me if I’m wrong here.)

2) On a related note, is the intermediate data produced by the Map tasks in an individual MapReduce job deleted after that job completes and the next in the sequence starts, or is it preserved until the whole Pig job finishes?

Any information or guidance would be greatly appreciated.

Thanks,
Drew

Search Discussions

  • Dmitriy Ryaboy at Jun 1, 2011 at 12:24 am
    Drew,
    Pig relies completely on Hadoop's handling of fault tolerance, and
    does not do anything by itself (so, for example, if a job completely
    fails, Pig won't try to rerun it).
    The way Hadoop deals with failure is pretty well documented in the
    O'Reilly Hadoop book as well as online -- essentially, mappers and
    reducers can be rerun as needed (and thus one needs to be careful
    about side effects and non-determinism in mapper and reducer tasks, as
    they may run multiple times!).

    Data replication is a somewhat orthogonal to the problem of failure of
    a node that's processing data. If HDFS is reading from server A, and
    server A dies, HDFS will just switch to reading from a replica B, and
    kick off a background copy to replace the missing A. Tasks won't fail.

    D
    On Tue, May 31, 2011 at 3:23 PM, David A Boyuka, II wrote:
    Hello everyone. I’m currently doing some research involving Hadoop and Pig, evaluating the cost of data replication vs. the penalty of node failure with respect to job completion time. I’m current modifying a MapReduce simulator called “MRPerf” to accommodate a sequence of MapReduce jobs such as generated by Pig. I have some questions about how failures are handled in a Pig job, so I can reflect that in my implementation. I’m fairly new to Pig, so if some of these questions reflect misunderstanding, please let me know:

    1) How does Pig/Hadoop handle failures which would require re-execution of part of all of some MapReduce job that was already completed earlier in the sequence? The only source of information I could find on this is in the paper “Making cloud intermediate data fault-tolerant” by Ko et. al. (http://portal.acm.org/citation.cfm?id=1807160), but I’m not sure whether this is accurate.

    As an example:

    Suppose a sequence of three MapReduce jobs have been generated, J1, J2 and J3, with a replication factor of 1 for the output of J1 and J2 (i.e. one instance of each HDFS block). Suppose jobs J1 and J2 complete, and J3 is almost done. Then a failure occurs on some node. Map and Reduce task outputs are lost from all jobs. It seems to me it would be necessary to re-execute some tasks, but not all, from all three jobs to complete the overall job. How does Pig/Hadoop handle such backtracking in this situation?

    (This is based on understanding that when Pig produces a series MapReduce jobs, the replication factor for the intermediate data produced by each Reduce phase can be set individually. Correct me if I’m wrong here.)

    2) On a related note, is the intermediate data produced by the Map tasks in an individual MapReduce job deleted after that job completes and the next in the sequence starts, or is it preserved until the whole Pig job finishes?

    Any information or guidance would be greatly appreciated.

    Thanks,
    Drew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 31, '11 at 10:23p
activeJun 1, '11 at 12:24a
posts2
users2
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase