------------------------------------------------------------------------------
Key: PIG-1972
URL: https://issues.apache.org/jira/browse/PIG-1972
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.8.0
Environment: Pig 0.8 version with PigMix http://wiki.apache.org/pig/PigMix
Reporter: Rajesh Balamohan
While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, for the same problem size response time should decrease as we increase the number of nodes)
Investigating further revealed that L14 query merge-joins small dataset and another large dataset. If the small dataset has many part files with very little amount of data, it causes a huge pressure on NameNode. This is because it is read as a side file in all map slows.
In the environment where I ran the experiment, small dataset was spread across 1900+ part files in HDFS.
Following codepath has the perf issue.
DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge delay. Since
"users_sorted" data is spread across 1900+ small files, FileInputFormat.getSplits() hits the namenode too
frequently.
i.e, (number of machines * number of map slots * 1900+ times). This is the reason why L14 is not scaling up.
Suggestion would be to cache the splitInformation of the small dataset instead of hitting the namenode too frequently.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira