FAQ
Hi ,

I have a small doubt in how pig handles queries containing join of more than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .

Any help is appreciated ,

Thanks

## Search Discussions

•  at Feb 3, 2010 at 11:48 am ⇧
if you explicitly join 3 or more relations with a single command ("d =
join a on id, b on id, c on id;"), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.

This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the Mappers to Reducers, but none of it is
the intermediate result of joining A to B.

-Dmitriy

On Tue, Feb 2, 2010 at 10:52 PM, bharath v
wrote:
Hi ,

I have a small doubt in how pig handles queries containing join of more than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .

Any help is appreciated ,

Thanks
•  at Feb 3, 2010 at 12:18 pm ⇧
Dimitry,

Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
it requires storing the intermediate join of AB on to disk right?

Thanks
On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

if you explicitly join 3 or more relations with a single command ("d =
join a on id, b on id, c on id;"), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.

This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the Mappers to Reducers, but none of it is
the intermediate result of joining A to B.

-Dmitriy

On Tue, Feb 2, 2010 at 10:52 PM, bharath v
wrote:
Hi ,

I have a small doubt in how pig handles queries containing join of more than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .

Any help is appreciated ,

Thanks
•  at Feb 3, 2010 at 2:53 pm ⇧
Hi Bharath,
I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
is not valid Pig syntax.

Note that unlike an SQL join, a Pig join is based strictly on equality
of one (possibly multi-valued) key.

Meaning, where in sql you say:

select * from a, b, c where a.id=b.id and a.id2 = c.id2

(leaving the optimizer to figure out if it wants to do ((a join b)
join c), or ((b cross c) join a), or ((a join c) join b), etc)

In Pig you would explicitly state the order of joins:

ab = JOIN a on id, b on id;
abc = JOIN ab on a::id2, c on id2;

If this is what you are talking about -- yes, ab will be materialized,
as the second join requires a new Map-Reduce stage (there is a new key
that the whole relation needs to be partitioned on).

If, however, you mean simply joining multiple relations on the same
key, as described earlier -- no, nothing is materialized unless you
count the regular IO that needs to happen for a standard Map-Reduce
join, and any possible spills to disk required when buffers run out of
memory and such.

Hope this helps.

Dmitriy

On Wed, Feb 3, 2010 at 4:17 AM, bharath v
wrote:
Dimitry,

Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
it requires storing the intermediate join of AB on to disk right?

Thanks
On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

if you explicitly join 3 or more relations with a single command ("d =
join a on id, b on id, c on id;"), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.

This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the Mappers to Reducers, but none of it is
the intermediate result of joining A to B.

-Dmitriy

On Tue, Feb 2, 2010 at 10:52 PM, bharath v
wrote:
Hi ,

I have a small doubt in how pig handles queries containing join of more than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .

Any help is appreciated ,

Thanks
•  at Feb 4, 2010 at 4:15 am ⇧
Dimitry ,

Thanks for your reply . That was what I wanted .. I am new to pig , so i
couldn't express it . Got it .

Thanks

On Wed, Feb 3, 2010 at 8:23 PM, Dmitriy Ryaboy wrote:

Hi Bharath,
I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
is not valid Pig syntax.

Note that unlike an SQL join, a Pig join is based strictly on equality
of one (possibly multi-valued) key.

Meaning, where in sql you say:

select * from a, b, c where a.id=b.id and a.id2 = c.id2

(leaving the optimizer to figure out if it wants to do ((a join b)
join c), or ((b cross c) join a), or ((a join c) join b), etc)

In Pig you would explicitly state the order of joins:

ab = JOIN a on id, b on id;
abc = JOIN ab on a::id2, c on id2;

If this is what you are talking about -- yes, ab will be materialized,
as the second join requires a new Map-Reduce stage (there is a new key
that the whole relation needs to be partitioned on).

If, however, you mean simply joining multiple relations on the same
key, as described earlier -- no, nothing is materialized unless you
count the regular IO that needs to happen for a standard Map-Reduce
join, and any possible spills to disk required when buffers run out of
memory and such.

Hope this helps.

Dmitriy

On Wed, Feb 3, 2010 at 4:17 AM, bharath v
wrote:
Dimitry,

Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
it requires storing the intermediate join of AB on to disk right?

Thanks
On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

if you explicitly join 3 or more relations with a single command ("d =
join a on id, b on id, c on id;"), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.

This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the Mappers to Reducers, but none of it is
the intermediate result of joining A to B.

-Dmitriy

On Tue, Feb 2, 2010 at 10:52 PM, bharath v
wrote:
Hi ,

I have a small doubt in how pig handles queries containing join of
more
than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before
joining
with C or is it streamed directly to join with C (I dont know how ,
just
a
guess) .

Any help is appreciated ,

Thanks

## Related Discussions

 view thread | post
Discussion Overview
 group user categories pig, hadoop posted Feb 3, '10 at 6:53a active Feb 4, '10 at 4:15a posts 5 users 2 website pig.apache.org

### 2 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase