Grokbase Groups Pig dev April 2008
FAQ
Replicating small tables for joins
----------------------------------

Key: PIG-211
URL: https://issues.apache.org/jira/browse/PIG-211
Project: Pig
Issue Type: New Feature
Components: data
Reporter: John DeTreville
Priority: Minor


Joining a table A with a small table B can be disproportionately expensive if A must be sorted before the join, and the result must be sorted again. This effort can often be reduced or eliminated if table B is replicated in whole to all nodes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Pi Song (JIRA) at Apr 19, 2008 at 3:21 pm
    [ https://issues.apache.org/jira/browse/PIG-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590708#action_12590708 ]

    Pi Song commented on PIG-211:
    -----------------------------

    These might be useful for you:-

    1) What really happens in our Pig MapReduce execution engine is that all the records on both sides are separated into a number of buckets based on sort key. Then a local sort is used anyway as a part of Reduce (We can do this way because at the moment we only support equal join). Here the size of data in each bucket statistically will not be too big. Though, there could be some kinds of data skews. Possibly one way to help if some buckets are still too big is to use a second bucketing function to further slice into smaller buckets. A parameterized partitioner could be used as well but I don't think Hadoop currently supports it :(

    2) One way we could do what you've suggested easily is to use a UDF that reads from the small table file. The small table file can be shipped to all the processing nodes using the mechanism similar to what we've got in Pig Streaming(See Pig Streaming SHIP in Pig Wiki). I really start to think that the SHIP construct should not be limited to Streaming.

    This is a part of optimization work that hasn't started yet, though it's good that we've started a discussion. What about your opinion? Please keep giving us your ideas!!
    Replicating small tables for joins
    ----------------------------------

    Key: PIG-211
    URL: https://issues.apache.org/jira/browse/PIG-211
    Project: Pig
    Issue Type: New Feature
    Components: data
    Reporter: John DeTreville
    Priority: Minor

    Joining a table A with a small table B can be disproportionately expensive if A must be sorted before the join, and the result must be sorted again. This effort can often be reduced or eliminated if table B is replicated in whole to all nodes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedApr 17, '08 at 11:50p
activeApr 19, '08 at 3:21p
posts2
users1
websitepig.apache.org

1 user in discussion

Pi Song (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase