Grokbase Groups Pig user July 2011
FAQ
Hi all

I have 2 CSV files a shown below:

*File 1: File2:
col1 col2 col1 col2 col3 col4
1234 2 1000 1999
2222 3 2000 2999
3333 5 3000 3999
4444 6 4000 4999*

Now I need to JOIN these 2 files in such a way that:

File1-col1 should lie in between File2-col1 and File2-col2

Can I use JOIN / COGROUP or any other existing operators?

or shud I build a new UDF?

thanks
Narayan.

Search Discussions

  • Jacob Perkins at Jul 15, 2011 at 12:23 pm
    Assuming col1 is numeric, as you've indicated, couldn't you simply
    generate a new column in file 1 by rounding to the nearest 1000? Then
    file 1 would look like:

    *File 1:
    col1 col2 join_key
    1234 2 1000
    2222 3 2000
    3333 5 3000
    4444 6 4000

    Then you could just join by the new key from file 1 and col2 from file
    2.

    This works even if your ranges are smaller, just round to whatever makes
    sense. Eg, nearest 10. What this does not work for is if your ranges are
    variable. Are your ranges variable? :)

    --jacob
    @thedatachef
    On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
    Hi all

    I have 2 CSV files a shown below:

    *File 1: File2:
    col1 col2 col1 col2 col3 col4
    1234 2 1000 1999
    2222 3 2000 2999
    3333 5 3000 3999
    4444 6 4000 4999*

    Now I need to JOIN these 2 files in such a way that:

    File1-col1 should lie in between File2-col1 and File2-col2

    Can I use JOIN / COGROUP or any other existing operators?

    or shud I build a new UDF?

    thanks
    Narayan.
  • Lakshminarayana Motamarri at Jul 15, 2011 at 8:13 pm
    Hi Jacob,

    thanks for the response and suggestion. But as u mentioned the range is NOT
    small and NON linear...

    file 2:
    col1 col2 50331648 67108863
    67108864 67109119 67109120 67109631 67109632 67110399 67110400
    67110655
    ....

    so as mentioned, the range is not small and there could be more than one
    match from file 1, which like in between these...

    suggestions? ideas? please...

    thanks
    Narayan.
    On Fri, Jul 15, 2011 at 5:23 AM, Jacob Perkins wrote:

    Assuming col1 is numeric, as you've indicated, couldn't you simply
    generate a new column in file 1 by rounding to the nearest 1000? Then
    file 1 would look like:

    *File 1:
    col1 col2 join_key
    1234 2 1000
    2222 3 2000
    3333 5 3000
    4444 6 4000

    Then you could just join by the new key from file 1 and col2 from file
    2.

    This works even if your ranges are smaller, just round to whatever makes
    sense. Eg, nearest 10. What this does not work for is if your ranges are
    variable. Are your ranges variable? :)

    --jacob
    @thedatachef
    On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
    Hi all

    I have 2 CSV files a shown below:

    *File 1: File2:
    col1 col2 col1 col2 col3 col4
    1234 2 1000 1999
    2222 3 2000 2999
    3333 5 3000 3999
    4444 6 4000 4999*

    Now I need to JOIN these 2 files in such a way that:

    File1-col1 should lie in between File2-col1 and File2-col2

    Can I use JOIN / COGROUP or any other existing operators?

    or shud I build a new UDF?

    thanks
    Narayan.

    --
    Thanking you,
    ---
    Thanks & Regards,
    Mr. Lakshminarayana. Motamarri, {Call-out Name: Narayan}
    Graduate Student (M.S in Computer Science),
    Arizona State University,
    School of Computing and Informatics,
    Mobile: (480) 280 0179.
    http://www.public.asu.edu/~lmotamar/
  • Xiaomeng Wan at Jul 15, 2011 at 8:56 pm
    you can always use a cross followed by filter

    x = cross file1, file2;
    y = filter x by file1::col1>file2::col1 and file1::col1<file2::col2;

    Shawn

    On Fri, Jul 15, 2011 at 2:12 PM, Lakshminarayana Motamarri
    wrote:
    Hi Jacob,

    thanks for the response and suggestion. But as u mentioned the range is NOT
    small and NON linear...

    file 2:
    col1               col2            50331648    67108863
    67108864 67109119  67109120 67109631  67109632 67110399  67110400
    67110655
    ....

    so as mentioned, the range is not small and there could be more than one
    match from file 1, which like in between these...

    suggestions? ideas? please...

    thanks
    Narayan.
    On Fri, Jul 15, 2011 at 5:23 AM, Jacob Perkins wrote:

    Assuming col1 is numeric, as you've indicated, couldn't you simply
    generate a new column in file 1 by rounding to the nearest 1000? Then
    file 1 would look like:

    *File 1:
    col1  col2 join_key
    1234  2    1000
    2222  3    2000
    3333  5    3000
    4444  6    4000

    Then you could just join by the new key from file 1 and col2 from file
    2.

    This works even if your ranges are smaller, just round to whatever makes
    sense. Eg, nearest 10. What this does not work for is if your ranges are
    variable. Are your ranges variable? :)

    --jacob
    @thedatachef
    On Fri, 2011-07-15 at 01:23 -0700, Lakshminarayana Motamarri wrote:
    Hi all

    I have 2 CSV files a shown below:

    *File 1:                     File2:
    col1  col2             col1    col2   col3   col4
    1234    2                1000   1999
    2222    3                2000   2999
    3333    5                3000   3999
    4444    6                4000   4999*

    Now I need to JOIN these 2 files in such a way that:

    File1-col1 should lie in between File2-col1 and File2-col2

    Can I use JOIN / COGROUP or any other existing operators?

    or shud I build a new UDF?

    thanks
    Narayan.

    --
    Thanking you,
    ---
    Thanks & Regards,
    Mr. Lakshminarayana. Motamarri, {Call-out Name: Narayan}
    Graduate Student (M.S in Computer Science),
    Arizona State University,
    School of Computing and Informatics,
    Mobile: (480) 280 0179.
    http://www.public.asu.edu/~lmotamar/
  • Thejas Nair at Jul 17, 2011 at 7:55 pm
    Pig supports only equi join. So you will need to use cross+ filter as shawn
    said. But if either file is small enough to fit into memory, you can use
    simulate cross using replicated join, using same constant on.both sides as
    the join key
    On Jul 15, 2011 1:13 PM, "Lakshminarayana Motamarri" wrote:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 15, '11 at 8:23a
activeJul 17, '11 at 7:55p
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase