Hi Bharath,
I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
is not valid Pig syntax.
Note that unlike an SQL join, a Pig join is based strictly on equality
of one (possibly multi-valued) key.
Meaning, where in sql you say:
select * from a, b, c where a.id=b.id and a.id2 = c.id2
(leaving the optimizer to figure out if it wants to do ((a join b)
join c), or ((b cross c) join a), or ((a join c) join b), etc)
In Pig you would explicitly state the order of joins:
ab = JOIN a on id, b on id;
abc = JOIN ab on a::id2, c on id2;
If this is what you are talking about -- yes, ab will be materialized,
as the second join requires a new Map-Reduce stage (there is a new key
that the whole relation needs to be partitioned on).
If, however, you mean simply joining multiple relations on the same
key, as described earlier -- no, nothing is materialized unless you
count the regular IO that needs to happen for a standard Map-Reduce
join, and any possible spills to disk required when buffers run out of
memory and such.
Hope this helps.
Dmitriy
On Wed, Feb 3, 2010 at 4:17 AM, bharath v
wrote:
Dimitry,
Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
it requires storing the intermediate join of AB on to disk right?
Thanks
On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:if you explicitly join 3 or more relations with a single command ("d =
join a on id, b on id, c on id;"), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.
This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the Mappers to Reducers, but none of it is
the intermediate result of joining A to B.
-Dmitriy
On Tue, Feb 2, 2010 at 10:52 PM, bharath v
wrote:
Hi ,
I have a small doubt in how pig handles queries containing join of more than
2 tables .
Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .
Any help is appreciated ,
Thanks