Yuntao Jia, our intern this summer, did a simple performance benchmark for Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A Comparison of Approaches to Large-Scale Data Analysis
The report and the performance test kit are both attached here:
We tried our best to get good performance out of Hive and Pig, and we keep the hadoop program as close as it is from the SIGMOD paper. We welcome all suggestions on how we can improve the performance more by both changing the configuration or improving the code.
While we tried our best to be fair, system settings and environments do affect the result a lot. So we encourage everybody to try out the performance test kit on their own cluster, and we will appreciate if everybody can share their results.
Here is the summary. The details are in the report hive_benchmark_2009-06-18.pdf from the link above.
Query: GREP SELECT
Query: RANKINGS SELECT
Query: USERVISITS AGGREGATION
Query: RANKINGS USERVISITS JOIN
Please take a look at hive_benchmark_2009-06-18.pdf from the link above for details. Let's keep discussions on http://issues.apache.org/jira/browse/HIVE-396 so it's easier to keep track.