[
https://issues.apache.org/jira/browse/PIG-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shravan Matthur Narayanamurthy updated PIG-457:
-----------------------------------------------
Status: Patch Available (was: Open)
There are two issues that this patch tries to address:
1) Exceptions and traces even after a successful completion:
Currently, we have the same code path for both the success case & failure case for getting & printing error messages. So this fix breaks the code path to use debug for failures in a successful completion which are solved by retries & to use error for failures in an unsuccessful run.
2) Shows 100% even if there are failures
This is a direct result of what hadoop does. It marks the map and reduce tasks as 100 % complete irrespective of their success or failure. In some sense these are unrelated dimensions. Since its better to relate these two, we need to make sure that we don't report 100% complete in case of a failed execution. This is a hack where I check if the progress has become 100% and postpone its display till I am sure that the job has completed successfully.
There are some other fixes to the completion percentage display logic which displays the percentage completion. In the code as we are chasing a moving target and when we assume that the job is in a particular state & try to do some processing based on that assumption, we might get spurious results. One example is we get the list of running jobs and try to get the progress for each job. While doing this, the state of this job might change from running to something else and its not easy to construct all the possible scenarios into the code. Thus when we try to fetch the progress of a previously running job which has changed state, we will get spurious results. To mitigate this, we make a simple assumption that the job can't regress and if we see such a condition, we ignore it as we know its temporary.
Another thing that has been introduced into the logic is an exponential delay scheme which will be useful when we are in a job which is not progressing may be due to bag spilling or some udf running. In this case each progress reported is the same for some time. During this time, we can either implement something where we hard limit saying if we don't see progress we don't report it or we can just report the same progress. There are cons with both approaches: for 1) it might seem like the job is stuck or there is processing happening if we don't display anything. for 2)its surely going to fill the screen with something that is not adding any more information. So we try to introduce delays between each batch of same progress display which increase exponentially with each batch completing. Currently the batch size is half the number of retries which is 6 since sleep time is 5 sec now; like trying to have a progress reported every 30 sec but delaying future displays of the same progress using an exponential delay scheme.
pig produces errors after a job is said to be 100% done
-------------------------------------------------------
Key: PIG-457
URL:
https://issues.apache.org/jira/browse/PIG-457Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
Fix For: types_branch
Attachments: 457-2.patch
It is possible that we get errors for all tasks even the ones we retried. Need to look at the code that handles detecting end of processing and producing errors.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.