Pi Song (JIRA)
|| at Apr 19, 2008 at 3:21 pm
Pi Song commented on PIG-211:
These might be useful for you:-
1) What really happens in our Pig MapReduce execution engine is that all the records on both sides are separated into a number of buckets based on sort key. Then a local sort is used anyway as a part of Reduce (We can do this way because at the moment we only support equal join). Here the size of data in each bucket statistically will not be too big. Though, there could be some kinds of data skews. Possibly one way to help if some buckets are still too big is to use a second bucketing function to further slice into smaller buckets. A parameterized partitioner could be used as well but I don't think Hadoop currently supports it :(
2) One way we could do what you've suggested easily is to use a UDF that reads from the small table file. The small table file can be shipped to all the processing nodes using the mechanism similar to what we've got in Pig Streaming(See Pig Streaming SHIP in Pig Wiki). I really start to think that the SHIP construct should not be limited to Streaming.
This is a part of optimization work that hasn't started yet, though it's good that we've started a discussion. What about your opinion? Please keep giving us your ideas!!
Replicating small tables for joins
Issue Type: New Feature
Reporter: John DeTreville
Joining a table A with a small table B can be disproportionately expensive if A must be sorted before the join, and the result must be sorted again. This effort can often be reduced or eliminated if table B is replicated in whole to all nodes.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.