Grokbase Groups Pig user April 2009
FAQ
First, as an aside, this email really should be on pig-user rather
than pig-dev, as it's a usage question, not a development question.
So I've pushed it onto that list and replied to you directly in case
you're not on that list.

If I understand correctly you want to do a non-equijoin on the data.
That can be done as follows:

Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
Table2 = LOAD 'Table2' AS (startip, endip);
Crossed = CROSS Table1, Table2;
Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

Note that this will not be very efficient because it has to do a
complete cross product of all tuples. Pig does not support non-
equijoins currently. In general, non-equijoins are hard to do
efficiently in map reduce because it's hard to get all of the
appropriate keys together in the same reducers.

So, if you're going to do this on very large data, it will be very slow.

Alan.
On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


Hi
i want some operation on pig;
I have two tables of data
Table1 contains userid,ipaddress, date
Table2 contains startip,endip

i want the data for the fallowing query
Table1.ipaddress>table2.startip &
table1.ipaddress<table2.endip

how to write join or cogroup in pig using piglatin scripts

can any one help in this


thanks
ramana

Search Discussions

  • Yiping Han at Apr 8, 2009 at 6:12 pm
    If such a join could be a map-side join, this can be done efficiently.

    Actually I was discussing with Nathan from Hadoop Table about this
    yesterday. For those of us who will go for Hadoop Table, such an early
    filtering should be pushed down to Table layer.

    In general, for an query like:

    SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and T3.f3 =
    T4.f4

    As in map-side joint, we should join each row for those tables that has an
    filter function first (here T1 with T2 and T3 wth T4) and filter, then join
    with the rest tables. (of course projection before filtering for each table)



    --Yiping

    On 4/8/09 8:50 AM, "Alan Gates" wrote:

    First, as an aside, this email really should be on pig-user rather
    than pig-dev, as it's a usage question, not a development question.
    So I've pushed it onto that list and replied to you directly in case
    you're not on that list.

    If I understand correctly you want to do a non-equijoin on the data.
    That can be done as follows:

    Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
    Table2 = LOAD 'Table2' AS (startip, endip);
    Crossed = CROSS Table1, Table2;
    Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

    Note that this will not be very efficient because it has to do a
    complete cross product of all tuples. Pig does not support non-
    equijoins currently. In general, non-equijoins are hard to do
    efficiently in map reduce because it's hard to get all of the
    appropriate keys together in the same reducers.

    So, if you're going to do this on very large data, it will be very slow.

    Alan.
    On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


    Hi
    i want some operation on pig;
    I have two tables of data
    Table1 contains userid,ipaddress, date
    Table2 contains startip,endip

    i want the data for the fallowing query
    Table1.ipaddress>table2.startip &
    table1.ipaddress<table2.endip

    how to write join or cogroup in pig using piglatin scripts

    can any one help in this


    thanks
    ramana
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
  • Alan Gates at Apr 8, 2009 at 9:59 pm
    The issue here is that the odds that all of the keys you want are in
    one map are very low, unless you use only one map for the file. Since
    you aren't doing equijoin, even if your files are partitioned the same
    way this won't guarantee that all the keys you need to do the join are
    in the same map.

    Alan.
    On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:

    If such a join could be a map-side join, this can be done efficiently.

    Actually I was discussing with Nathan from Hadoop Table about this
    yesterday. For those of us who will go for Hadoop Table, such an early
    filtering should be pushed down to Table layer.

    In general, for an query like:

    SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
    T3.f3 =
    T4.f4

    As in map-side joint, we should join each row for those tables that
    has an
    filter function first (here T1 with T2 and T3 wth T4) and filter,
    then join
    with the rest tables. (of course projection before filtering for
    each table)



    --Yiping

    On 4/8/09 8:50 AM, "Alan Gates" wrote:

    First, as an aside, this email really should be on pig-user rather
    than pig-dev, as it's a usage question, not a development question.
    So I've pushed it onto that list and replied to you directly in case
    you're not on that list.

    If I understand correctly you want to do a non-equijoin on the data.
    That can be done as follows:

    Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
    Table2 = LOAD 'Table2' AS (startip, endip);
    Crossed = CROSS Table1, Table2;
    Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

    Note that this will not be very efficient because it has to do a
    complete cross product of all tuples. Pig does not support non-
    equijoins currently. In general, non-equijoins are hard to do
    efficiently in map reduce because it's hard to get all of the
    appropriate keys together in the same reducers.

    So, if you're going to do this on very large data, it will be very
    slow.

    Alan.
    On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


    Hi
    i want some operation on pig;
    I have two tables of data
    Table1 contains userid,ipaddress, date
    Table2 contains startip,endip

    i want the data for the fallowing query
    Table1.ipaddress>table2.startip &
    table1.ipaddress<table2.endip

    how to write join or cogroup in pig using piglatin scripts

    can any one help in this


    thanks
    ramana
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
  • Yiping Han at Apr 8, 2009 at 10:15 pm
    It does not require all the keys are in one map, it could be in multiple
    maps. But it has to be the key that is used to partition the shards.

    I agree this does not solve all WHERE clauses. But it does work for many
    cases, like filtering people at a certain age, limit only to urls that has a
    certain number of inlinks, etc.

    As long as such filtering reduce a significant portion of the rows, it
    should benefit us quite a lot on performance.


    --Yiping

    On 4/8/09 2:57 PM, "Alan Gates" wrote:

    The issue here is that the odds that all of the keys you want are in
    one map are very low, unless you use only one map for the file. Since
    you aren't doing equijoin, even if your files are partitioned the same
    way this won't guarantee that all the keys you need to do the join are
    in the same map.

    Alan.
    On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:

    If such a join could be a map-side join, this can be done efficiently.

    Actually I was discussing with Nathan from Hadoop Table about this
    yesterday. For those of us who will go for Hadoop Table, such an early
    filtering should be pushed down to Table layer.

    In general, for an query like:

    SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
    T3.f3 =
    T4.f4

    As in map-side joint, we should join each row for those tables that
    has an
    filter function first (here T1 with T2 and T3 wth T4) and filter,
    then join
    with the rest tables. (of course projection before filtering for
    each table)



    --Yiping

    On 4/8/09 8:50 AM, "Alan Gates" wrote:

    First, as an aside, this email really should be on pig-user rather
    than pig-dev, as it's a usage question, not a development question.
    So I've pushed it onto that list and replied to you directly in case
    you're not on that list.

    If I understand correctly you want to do a non-equijoin on the data.
    That can be done as follows:

    Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
    Table2 = LOAD 'Table2' AS (startip, endip);
    Crossed = CROSS Table1, Table2;
    Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

    Note that this will not be very efficient because it has to do a
    complete cross product of all tuples. Pig does not support non-
    equijoins currently. In general, non-equijoins are hard to do
    efficiently in map reduce because it's hard to get all of the
    appropriate keys together in the same reducers.

    So, if you're going to do this on very large data, it will be very
    slow.

    Alan.
    On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


    Hi
    i want some operation on pig;
    I have two tables of data
    Table1 contains userid,ipaddress, date
    Table2 contains startip,endip

    i want the data for the fallowing query
    Table1.ipaddress>table2.startip &
    table1.ipaddress<table2.endip

    how to write join or cogroup in pig using piglatin scripts

    can any one help in this


    thanks
    ramana
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
  • Yiping Han at Apr 8, 2009 at 10:17 pm
    Sorry, I misunderstood "one map". Yes, the keys are required in one map (or
    one shard).

    As long as it is a map-side join, all tables join on the same key, and the
    filtering is per row basis. It should be easy to implement.


    --Yiping
    On 4/8/09 3:11 PM, "Yiping Han" wrote:

    It does not require all the keys are in one map, it could be in multiple maps.
    But it has to be the key that is used to partition the shards.

    I agree this does not solve all WHERE clauses. But it does work for many
    cases, like filtering people at a certain age, limit only to urls that has a
    certain number of inlinks, etc.

    As long as such filtering reduce a significant portion of the rows, it should
    benefit us quite a lot on performance.


    --Yiping

    On 4/8/09 2:57 PM, "Alan Gates" wrote:

    The issue here is that the odds that all of the keys you want are in
    one map are very low, unless you use only one map for the file. Since
    you aren't doing equijoin, even if your files are partitioned the same
    way this won't guarantee that all the keys you need to do the join are
    in the same map.

    Alan.
    On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:

    If such a join could be a map-side join, this can be done efficiently.

    Actually I was discussing with Nathan from Hadoop Table about this
    yesterday. For those of us who will go for Hadoop Table, such an early
    filtering should be pushed down to Table layer.

    In general, for an query like:

    SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
    T3.f3 =
    T4.f4

    As in map-side joint, we should join each row for those tables that
    has an
    filter function first (here T1 with T2 and T3 wth T4) and filter,
    then join
    with the rest tables. (of course projection before filtering for
    each table)



    --Yiping

    On 4/8/09 8:50 AM, "Alan Gates" wrote:

    First, as an aside, this email really should be on pig-user rather
    than pig-dev, as it's a usage question, not a development question.
    So I've pushed it onto that list and replied to you directly in case
    you're not on that list.

    If I understand correctly you want to do a non-equijoin on the data.
    That can be done as follows:

    Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
    Table2 = LOAD 'Table2' AS (startip, endip);
    Crossed = CROSS Table1, Table2;
    Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

    Note that this will not be very efficient because it has to do a
    complete cross product of all tuples. Pig does not support non-
    equijoins currently. In general, non-equijoins are hard to do
    efficiently in map reduce because it's hard to get all of the
    appropriate keys together in the same reducers.

    So, if you're going to do this on very large data, it will be very
    slow.

    Alan.
    On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


    Hi
    i want some operation on pig;
    I have two tables of data
    Table1 contains userid,ipaddress, date
    Table2 contains startip,endip

    i want the data for the fallowing query
    Table1.ipaddress>table2.startip &
    table1.ipaddress<table2.endip

    how to write join or cogroup in pig using piglatin scripts

    can any one help in this


    thanks
    ramana
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
  • Chris Olston at Apr 8, 2009 at 11:12 pm
    There are a bunch of papers on doing "band joins" efficiently, which I
    believe is what we're seeing an instance of. And like most join techniques
    they are probably easy to parallelize.

    If this turns out to be a frequent usage pattern for pig applications we can
    consider doing something special for them. If it's rare, I'm afraid users
    will have to make do with cross-product. (The good news is that Pig uses an
    NxM parallel cross-product implementation, so with enough machines/threads
    the data should become sufficiently chopped so that each one only needs to
    deal with a small amount of data. Not sure how the current Pig code chooses
    N and M though; I don't think we ever did much tuning on this.)

    -Chris

    On 4/8/09 3:16 PM, "Yiping Han" wrote:

    Sorry, I misunderstood "one map". Yes, the keys are required in one map (or
    one shard).

    As long as it is a map-side join, all tables join on the same key, and the
    filtering is per row basis. It should be easy to implement.


    --Yiping
    On 4/8/09 3:11 PM, "Yiping Han" wrote:

    It does not require all the keys are in one map, it could be in multiple
    maps.
    But it has to be the key that is used to partition the shards.

    I agree this does not solve all WHERE clauses. But it does work for many
    cases, like filtering people at a certain age, limit only to urls that has a
    certain number of inlinks, etc.

    As long as such filtering reduce a significant portion of the rows, it should
    benefit us quite a lot on performance.


    --Yiping

    On 4/8/09 2:57 PM, "Alan Gates" wrote:

    The issue here is that the odds that all of the keys you want are in
    one map are very low, unless you use only one map for the file. Since
    you aren't doing equijoin, even if your files are partitioned the same
    way this won't guarantee that all the keys you need to do the join are
    in the same map.

    Alan.
    On Apr 8, 2009, at 11:09 AM, Yiping Han wrote:

    If such a join could be a map-side join, this can be done efficiently.

    Actually I was discussing with Nathan from Hadoop Table about this
    yesterday. For those of us who will go for Hadoop Table, such an early
    filtering should be pushed down to Table layer.

    In general, for an query like:

    SELECT f1, f2, ..., fn FROM T1, T2, ..., Tn WHERE T1.f1 = T2.f2 and
    T3.f3 =
    T4.f4

    As in map-side joint, we should join each row for those tables that
    has an
    filter function first (here T1 with T2 and T3 wth T4) and filter,
    then join
    with the rest tables. (of course projection before filtering for
    each table)



    --Yiping

    On 4/8/09 8:50 AM, "Alan Gates" wrote:

    First, as an aside, this email really should be on pig-user rather
    than pig-dev, as it's a usage question, not a development question.
    So I've pushed it onto that list and replied to you directly in case
    you're not on that list.

    If I understand correctly you want to do a non-equijoin on the data.
    That can be done as follows:

    Table1 = LOAD 'Table1' AS (userid, ipaddress, date);
    Table2 = LOAD 'Table2' AS (startip, endip);
    Crossed = CROSS Table1, Table2;
    Joined = FILTER Crossed BY ipaddress > startip & ipaddress < endip;

    Note that this will not be very efficient because it has to do a
    complete cross product of all tuples. Pig does not support non-
    equijoins currently. In general, non-equijoins are hard to do
    efficiently in map reduce because it's hard to get all of the
    appropriate keys together in the same reducers.

    So, if you're going to do this on very large data, it will be very
    slow.

    Alan.
    On Apr 7, 2009, at 8:01 AM, venkata ramanaiah anneboina wrote:


    Hi
    i want some operation on pig;
    I have two tables of data
    Table1 contains userid,ipaddress, date
    Table2 contains startip,endip

    i want the data for the fallowing query
    Table1.ipaddress>table2.startip &
    table1.ipaddress<table2.endip

    how to write join or cogroup in pig using piglatin scripts

    can any one help in this


    thanks
    ramana
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    ----------------------
    Yiping Han
    2MC 8127
    2811 Mission College Blvd.,
    Santa Clara, CA 95054
    (408)349-4403
    yhan@yahoo-inc.com
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 8, '09 at 3:51p
activeApr 8, '09 at 11:12p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase