It would be a tradeoff between data-locality versus number of tasks
executed. In some of our experiments, it performed much worse (dont have
actual numbers, but it was in the 2x ballpark iirc) : ofcourse, ours was
a highly constrained and specialized experiment anyway !
On the other hand, the benefits in terms of number of tasks can be
extremely useful for job times - in particular, for environments where
there is quota enabled in terms of number of tasks, or number of files
(if map-only output), etc : the benefits can be pretty good.
I am yet to look at the patch in detail, but from what I recall,
performance could be improved by being more intelligent in terms of
clustering splits based on 'locations' returned for the combined
multiple-split, etc : to ensure maximal data-locality for the contained
Not sure if it is in there in final version ...
On Friday 29 October 2010 05:31 PM, Uppuluri, Rohini wrote:
I came across this patch
(https://issues.apache.org/jira/browse/PIG-1518) which supports multifile input format from Pig 0.8 version on wards.
A patch is also available for Pig 0.7. I was wondering if any one tried out the patch with Pig 0.7 and if they could share any notes on performance improvements due to this.