Grokbase Groups Pig user August 2009
FAQ
Hello there,

Is it possible to do something like this by using one join? Thanks!

select *
from tbl_a, tbl_b, tbl_c
where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c

Search Discussions

  • Dmitriy Ryaboy at Aug 7, 2009 at 6:56 pm
    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiaowrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
  • Chris Olston at Aug 7, 2009 at 7:51 pm
    No, for this you'd have to use a cascade of two binary join expressions.

    (The predicate for joining tbl_a with tbl_b is not the same as the one for
    joining tbl_a with tbl_c.)

    -Chris

    On 8/7/09 11:56 AM, "Dmitriy Ryaboy" wrote:

    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiaowrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Dmitriy Ryaboy at Aug 7, 2009 at 8:25 pm
    Ah, sorry, missed that the predicates were different.

    Would be cool if there was a way to specify something that matches anything.

    So you could join A on b, c, B on b, _ignore_, C on c, _ignore_

    except with cleaner syntax -- maybe even the sql style "A, B, C on (
    A.b=B.b and A.c=C.c)" .

    This would still have to be done with 2 MR jobs, but the optimizer
    would get a crack at optimal ordering.

    On Fri, Aug 7, 2009 at 12:51 PM, Chris Olstonwrote:
    No, for this you'd have to use a cascade of two binary join expressions.

    (The predicate for joining tbl_a with tbl_b is not the same as the one for
    joining tbl_a with tbl_c.)

    -Chris

    On 8/7/09 11:56 AM, "Dmitriy Ryaboy" wrote:

    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiaowrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



  • Chris Olston at Aug 7, 2009 at 8:41 pm
    Pig Latin is for cases in which the user has a particular ordering in mind
    (same philosophy as raw map-reduce). Later, a SQL layer will be added to Pig
    for cases in which you want to leave such decisions in the hands of an
    automated query optimizer.

    -Chris

    On 8/7/09 1:24 PM, "Dmitriy Ryaboy" wrote:

    Ah, sorry, missed that the predicates were different.

    Would be cool if there was a way to specify something that matches anything.

    So you could join A on b, c, B on b, _ignore_, C on c, _ignore_

    except with cleaner syntax -- maybe even the sql style "A, B, C on (
    A.b=B.b and A.c=C.c)" .

    This would still have to be done with 2 MR jobs, but the optimizer
    would get a crack at optimal ordering.

    On Fri, Aug 7, 2009 at 12:51 PM, Chris Olstonwrote:
    No, for this you'd have to use a cascade of two binary join expressions.

    (The predicate for joining tbl_a with tbl_b is not the same as the one for
    joining tbl_a with tbl_c.)

    -Chris

    On 8/7/09 11:56 AM, "Dmitriy Ryaboy" wrote:

    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiao<yonggang.qiao@gmail.com>
    wrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Yonggang Qiao at Aug 7, 2009 at 8:54 pm
    So after sql layer being added, this use case will be supported automatically? Cascading is fine, but I have ~10 star tables...

    Sent via BlackBerry by AT&T

    -----Original Message-----
    From: Chris Olston <olston@yahoo-inc.com>

    Date: Fri, 07 Aug 2009 13:41:26
    To: <pig-user@hadoop.apache.org>
    Subject: Re: join multiple dataset on different key sets


    Pig Latin is for cases in which the user has a particular ordering in mind
    (same philosophy as raw map-reduce). Later, a SQL layer will be added to Pig
    for cases in which you want to leave such decisions in the hands of an
    automated query optimizer.

    -Chris

    On 8/7/09 1:24 PM, "Dmitriy Ryaboy" wrote:

    Ah, sorry, missed that the predicates were different.

    Would be cool if there was a way to specify something that matches anything.

    So you could join A on b, c, B on b, _ignore_, C on c, _ignore_

    except with cleaner syntax -- maybe even the sql style "A, B, C on (
    A.b=B.b and A.c=C.c)" .

    This would still have to be done with 2 MR jobs, but the optimizer
    would get a crack at optimal ordering.

    On Fri, Aug 7, 2009 at 12:51 PM, Chris Olstonwrote:
    No, for this you'd have to use a cascade of two binary join expressions.

    (The predicate for joining tbl_a with tbl_b is not the same as the one for
    joining tbl_a with tbl_c.)

    -Chris

    On 8/7/09 11:56 AM, "Dmitriy Ryaboy" wrote:

    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiao<yonggang.qiao@gmail.com>
    wrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Chris Olston at Aug 7, 2009 at 9:00 pm
    Yes.

    On 8/7/09 1:53 PM, "yonggang.qiao@gmail.com" wrote:

    So after sql layer being added, this use case will be supported automatically?
    Cascading is fine, but I have ~10 star tables...

    Sent via BlackBerry by AT&T

    -----Original Message-----
    From: Chris Olston <olston@yahoo-inc.com>

    Date: Fri, 07 Aug 2009 13:41:26
    To: <pig-user@hadoop.apache.org>
    Subject: Re: join multiple dataset on different key sets


    Pig Latin is for cases in which the user has a particular ordering in mind
    (same philosophy as raw map-reduce). Later, a SQL layer will be added to Pig
    for cases in which you want to leave such decisions in the hands of an
    automated query optimizer.

    -Chris

    On 8/7/09 1:24 PM, "Dmitriy Ryaboy" wrote:

    Ah, sorry, missed that the predicates were different.

    Would be cool if there was a way to specify something that matches anything.

    So you could join A on b, c, B on b, _ignore_, C on c, _ignore_

    except with cleaner syntax -- maybe even the sql style "A, B, C on (
    A.b=B.b and A.c=C.c)" .

    This would still have to be done with 2 MR jobs, but the optimizer
    would get a crack at optimal ordering.

    On Fri, Aug 7, 2009 at 12:51 PM, Chris Olstonwrote:
    No, for this you'd have to use a cascade of two binary join expressions.

    (The predicate for joining tbl_a with tbl_b is not the same as the one for
    joining tbl_a with tbl_c.)

    -Chris

    On 8/7/09 11:56 AM, "Dmitriy Ryaboy" wrote:

    A join takes an arbitrary number of relations.

    So yes.

    D = Join A on a, B on b, C on c [PARALLEL n]

    you really want to specify the number of reducers using the parallel
    keyword if you are running on a real cluster.

    Order maters! Put your smaller relations first.

    -D

    On Fri, Aug 7, 2009 at 11:44 AM, Yonggang Qiao<yonggang.qiao@gmail.com>
    wrote:
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research


    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Yonggang Qiao at Aug 7, 2009 at 8:24 pm
    Hello there,

    Is it possible to do something like this by using one join? Thanks!

    select *
    from tbl_a, tbl_b, tbl_c
    where tbl_a.b = tbl_b.b and tbl_a.c= tbl_c.c

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 7, '09 at 6:49p
activeAug 7, '09 at 9:00p
posts8
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase