I propose that pig develop a standard set of benchmark queries that can
be run from release to release to measure pig's (hopefully improving)
performance over time. This would be similar in nature to hadoop's
and http://developer.yahoo.com/blogs/hadoop/). This set should be
relatively small (probably under 10). But it should cover a range of
operations being done by pig users.
So, if you have queries that you think would be good candidates and that
you can share (or obfuscate and then share), please do so. In addition
to the query, please give some idea of the type of data it runs over.
In particular we need to know how much data, how many fields are in your
data, the cardinality and distribution of any fields used as a group,
cogroup, or sort key.