FAQ
Hi ,

I have a small doubt in how pig handles queries containing join of more than
2 tables .

Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
We can join A,B in a map reduce job and join the resultant table with "C". I
have a doubt whether the result of "AB" is stored to disk before joining
with C or is it streamed directly to join with C (I dont know how , just a
guess) .

Any help is appreciated ,

Thanks

Search Discussions

  • Dmitriy Ryaboy at Feb 3, 2010 at 11:48 am
    if you explicitly join 3 or more relations with a single command ("d =
    join a on id, b on id, c on id;"), a and b will be buffered for each
    key, while c, the rightmost relation, will be streamed.

    This is on a per-reducer basis. There is of course a whole lot of IO
    going on for getting from the Mappers to Reducers, but none of it is
    the intermediate result of joining A to B.

    -Dmitriy

    On Tue, Feb 2, 2010 at 10:52 PM, bharath v
    wrote:
    Hi ,

    I have a small doubt in how pig handles queries containing join of more than
    2 tables .

    Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
    We can join A,B in a map reduce job and join the resultant table with "C". I
    have a doubt whether the result of "AB" is stored to disk before joining
    with C or is it streamed directly to join with C (I dont know how , just a
    guess) .

    Any help is appreciated ,

    Thanks
  • Bharath v at Feb 3, 2010 at 12:18 pm
    Dimitry,

    Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
    it requires storing the intermediate join of AB on to disk right?

    Thanks
    On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

    if you explicitly join 3 or more relations with a single command ("d =
    join a on id, b on id, c on id;"), a and b will be buffered for each
    key, while c, the rightmost relation, will be streamed.

    This is on a per-reducer basis. There is of course a whole lot of IO
    going on for getting from the Mappers to Reducers, but none of it is
    the intermediate result of joining A to B.

    -Dmitriy

    On Tue, Feb 2, 2010 at 10:52 PM, bharath v
    wrote:
    Hi ,

    I have a small doubt in how pig handles queries containing join of more than
    2 tables .

    Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
    We can join A,B in a map reduce job and join the resultant table with "C". I
    have a doubt whether the result of "AB" is stored to disk before joining
    with C or is it streamed directly to join with C (I dont know how , just a
    guess) .

    Any help is appreciated ,

    Thanks
  • Dmitriy Ryaboy at Feb 3, 2010 at 2:53 pm
    Hi Bharath,
    I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
    is not valid Pig syntax.

    Note that unlike an SQL join, a Pig join is based strictly on equality
    of one (possibly multi-valued) key.

    Meaning, where in sql you say:

    select * from a, b, c where a.id=b.id and a.id2 = c.id2

    (leaving the optimizer to figure out if it wants to do ((a join b)
    join c), or ((b cross c) join a), or ((a join c) join b), etc)

    In Pig you would explicitly state the order of joins:

    ab = JOIN a on id, b on id;
    abc = JOIN ab on a::id2, c on id2;

    If this is what you are talking about -- yes, ab will be materialized,
    as the second join requires a new Map-Reduce stage (there is a new key
    that the whole relation needs to be partitioned on).

    If, however, you mean simply joining multiple relations on the same
    key, as described earlier -- no, nothing is materialized unless you
    count the regular IO that needs to happen for a standard Map-Reduce
    join, and any possible spills to disk required when buffers run out of
    memory and such.

    Hope this helps.

    Dmitriy

    On Wed, Feb 3, 2010 at 4:17 AM, bharath v
    wrote:
    Dimitry,

    Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
    it requires storing the intermediate join of AB on to disk right?

    Thanks
    On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

    if you explicitly join 3 or more relations with a single command ("d =
    join a on id, b on id, c on id;"), a and b will be buffered for each
    key, while c, the rightmost relation, will be streamed.

    This is on a per-reducer basis. There is of course a whole lot of IO
    going on for getting from the Mappers to Reducers, but none of it is
    the intermediate result of joining A to B.

    -Dmitriy

    On Tue, Feb 2, 2010 at 10:52 PM, bharath v
    wrote:
    Hi ,

    I have a small doubt in how pig handles queries containing join of more than
    2 tables .

    Suppose we have 3 tables A,B,C .. and the plan is  "((AB)C)" ..
    We can join A,B in a map reduce job and join the resultant table with "C". I
    have a doubt whether the result of "AB" is stored to disk before joining
    with C or is it streamed directly to join with C (I dont know how , just a
    guess) .

    Any help is appreciated ,

    Thanks
  • Bharath v at Feb 4, 2010 at 4:15 am
    Dimitry ,

    Thanks for your reply . That was what I wanted .. I am new to pig , so i
    couldn't express it . Got it .

    Thanks


    On Wed, Feb 3, 2010 at 8:23 PM, Dmitriy Ryaboy wrote:

    Hi Bharath,
    I am not sure what you mean. (A on a , B on b1 and B on b2 , C on c)
    is not valid Pig syntax.

    Note that unlike an SQL join, a Pig join is based strictly on equality
    of one (possibly multi-valued) key.

    Meaning, where in sql you say:

    select * from a, b, c where a.id=b.id and a.id2 = c.id2

    (leaving the optimizer to figure out if it wants to do ((a join b)
    join c), or ((b cross c) join a), or ((a join c) join b), etc)

    In Pig you would explicitly state the order of joins:

    ab = JOIN a on id, b on id;
    abc = JOIN ab on a::id2, c on id2;

    If this is what you are talking about -- yes, ab will be materialized,
    as the second join requires a new Map-Reduce stage (there is a new key
    that the whole relation needs to be partitioned on).

    If, however, you mean simply joining multiple relations on the same
    key, as described earlier -- no, nothing is materialized unless you
    count the regular IO that needs to happen for a standard Map-Reduce
    join, and any possible spills to disk required when buffers run out of
    memory and such.

    Hope this helps.

    Dmitriy

    On Wed, Feb 3, 2010 at 4:17 AM, bharath v
    wrote:
    Dimitry,

    Suppose the command is like (A on a , B on b1 and B on b2 , C on c) .. Then
    it requires storing the intermediate join of AB on to disk right?

    Thanks
    On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy wrote:

    if you explicitly join 3 or more relations with a single command ("d =
    join a on id, b on id, c on id;"), a and b will be buffered for each
    key, while c, the rightmost relation, will be streamed.

    This is on a per-reducer basis. There is of course a whole lot of IO
    going on for getting from the Mappers to Reducers, but none of it is
    the intermediate result of joining A to B.

    -Dmitriy

    On Tue, Feb 2, 2010 at 10:52 PM, bharath v
    wrote:
    Hi ,

    I have a small doubt in how pig handles queries containing join of
    more
    than
    2 tables .

    Suppose we have 3 tables A,B,C .. and the plan is "((AB)C)" ..
    We can join A,B in a map reduce job and join the resultant table with "C". I
    have a doubt whether the result of "AB" is stored to disk before
    joining
    with C or is it streamed directly to join with C (I dont know how ,
    just
    a
    guess) .

    Any help is appreciated ,

    Thanks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 3, '10 at 6:53a
activeFeb 4, '10 at 4:15a
posts5
users2
websitepig.apache.org

2 users in discussion

Bharath v: 3 posts Dmitriy Ryaboy: 2 posts

People

Translate

site design / logo © 2022 Grokbase