|
Alan Gates (JIRA) |
at Sep 23, 2008 at 9:20 pm
|
⇧ |
| |
[
https://issues.apache.org/jira/browse/PIG-450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-450:
---------------------------
Attachment: PIG-450.patch
This patch adds a combiner step to distincts that just removes the duplicate values so that less data is carried across from map to reduce. Here are the resulting time differences (all times in seconds):
Num records||Num keys||Num reducers||1.4 || 2.0 || 2.0 with this patch ||
200M | 60 | 1 | 2547 | 1388 | 142 |
200M | 16M | 50 | 384 | 227 | 231 |
The main benefit is with a small number of keys, but there does not appear to be a penalty with a larger number of keys.
PERFORMANCE: Distinct should make use of combiner to remove duplicate values from keys.
----------------------------------------------------------------------------------------
Key: PIG-450
URL:
https://issues.apache.org/jira/browse/PIG-450Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
Fix For: types_branch
Attachments: PIG-450.patch
In 2.0 distinct was improved by removing values in the map and just passing an empty tuple along with the key. This can be further improved by adding a combiner step that passes along only the first empty tuple instead of all of them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.