Grokbase Groups Pig dev April 2011
FAQ
Cache split information details for data with large number of small part files
------------------------------------------------------------------------------

Key: PIG-1972
URL: https://issues.apache.org/jira/browse/PIG-1972
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.8.0
Environment: Pig 0.8 version with PigMix http://wiki.apache.org/pig/PigMix
Reporter: Rajesh Balamohan


While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, for the same problem size response time should decrease as we increase the number of nodes)

Investigating further revealed that L14 query merge-joins small dataset and another large dataset. If the small dataset has many part files with very little amount of data, it causes a huge pressure on NameNode. This is because it is read as a side file in all map slows.

In the environment where I ran the experiment, small dataset was spread across 1900+ part files in HDFS.

Following codepath has the perf issue.
DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge delay. Since
"users_sorted" data is spread across 1900+ small files, FileInputFormat.getSplits() hits the namenode too
frequently.

i.e, (number of machines * number of map slots * 1900+ times). This is the reason why L14 is not scaling up.


Suggestion would be to cache the splitInformation of the small dataset instead of hitting the namenode too frequently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Rajesh Balamohan (JIRA) at Apr 29, 2011 at 12:00 am
    [ https://issues.apache.org/jira/browse/PIG-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026786#comment-13026786 ]

    Rajesh Balamohan commented on PIG-1972:
    ---------------------------------------

    If the side data being read is small (ex:<= a block size), it would be replicated only in 3 nodes by default. So when every map is trying to read the side data, it would be choking to read the required details only from the 3 nodes. Suggestion would be to increase the replication factor of the side data being read. Alternatively we can load the side data in the distributedcache as mentioned in this JIRA to reduce the performance impact.
    Cache split information details for data with large number of small part files
    ------------------------------------------------------------------------------

    Key: PIG-1972
    URL: https://issues.apache.org/jira/browse/PIG-1972
    Project: Pig
    Issue Type: Improvement
    Components: impl
    Affects Versions: 0.8.0
    Environment: Pig 0.8 version with PigMix http://wiki.apache.org/pig/PigMix
    Reporter: Rajesh Balamohan

    While running scalability benchmarks with Pig 0.8 & PigMix, L14 query listed in http://wiki.apache.org/pig/PigMix showed no scalability characteristics (i.e, for the same problem size response time should decrease as we increase the number of nodes)
    Investigating further revealed that L14 query merge-joins small dataset and another large dataset. If the small dataset has many part files with very little amount of data, it causes a huge pressure on NameNode. This is because it is read as a side file in all map slows.
    In the environment where I ran the experiment, small dataset was spread across 1900+ part files in HDFS.
    Following codepath has the perf issue.
    DefaultIndexableLoader--> seekNear() --> initRightLoader() is causing the huge delay. Since
    "users_sorted" data is spread across 1900+ small files, FileInputFormat.getSplits() hits the namenode too
    frequently.
    i.e, (number of machines * number of map slots * 1900+ times). This is the reason why L14 is not scaling up.
    Suggestion would be to cache the splitInformation of the small dataset instead of hitting the namenode too frequently.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedApr 7, '11 at 3:07a
activeApr 29, '11 at 12:00a
posts2
users1
websitepig.apache.org

1 user in discussion

Rajesh Balamohan (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase