Grokbase Groups Pig user August 2012
FAQ
Hello!

Considering the following two relations...

grunt> querys = load 'query' as (id:int, token:chararray);
grunt> dump querys
(11,foo)
(12,bar)
(13,frog)

and

grunt> documents = load 'document' as (id:int, text:chararray);
grunt> dump documents;
(21,foo bar frog)
(22,hello frog)

Is is possible to do a join where the query:token is not equal to but
contained in documents:text ?

eg
(11,foo,21,foo bar frog)
(12,bar,21,foo bar frog)
(13,frog,21,foo bar frog)
(13,frog,22,hello frog)

I can certainly do this in Java map/reduce (as we all had to in the
dark days days before pig) but is there a way to hack this together
with a custom udf or some other weird join backdoor (customer
partitioner for a group or something whacky) ???

It's been a long day, maybe I'm just missing some super obvious..

Cheers!
Mat

Search Discussions

  • Russell Jurney at Aug 30, 2012 at 12:05 am
    Join on a dummy key or CROSS, then plug the token in a udf.

    Russell Jurney
    twitter.com/rjurney
    russell.jurney@gmail.com
    datasyndrome.com
    On Aug 29, 2012, at 4:56 PM, Mat Kelcey wrote:

    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat
  • Jonathan Coveney at Aug 30, 2012 at 12:06 am
    You're not missing anything obvious... what you're trying to do, on face
    value, is not an easy thing to do. In M/R, joining is done based on
    partitioning to the same reducer...how can you do that if you have a case

    foo
    bar

    foo bar

    and foo is sent to reducer 1, bar to reducer 2? There's no way to know
    where keys should be sent.

    That said, there are options.

    Option 1: a cross. Undesirable because of data explosion.
    Option 2: If one of the data sets is large enough to fit in memory, you can
    make a UDF that brings it in, and does the join for you. This is
    essentially option 1.
    Option 3: Less generically, exploit the join you're actually doing. In the
    dummy example, it looks like you're checking if a token is contained in
    another string. You could convert this into a join by tokenizing,
    flattening, doing the join, etc. I don't know how close your real use case
    is to what you posted.

    Jon


    2012/8/29 Mat Kelcey <matthew.kelcey@gmail.com>
    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat
  • Mat Kelcey at Aug 30, 2012 at 12:14 am
    Unfortunately neither side is small enough to either support a cross or a
    replicated join in memory approach.

    But opt3 does make sense, I think I'm over thinking things. I can utilise a
    udf to do the equivalent of tokenisation and do, like you say, just a join.

    In terms of the multiple joins I can just do all three, count the matches,
    and only allow the cases of all three matching

    Thanks!
    Mat
    On Aug 29, 2012 5:06 PM, "Jonathan Coveney" wrote:

    You're not missing anything obvious... what you're trying to do, on face
    value, is not an easy thing to do. In M/R, joining is done based on
    partitioning to the same reducer...how can you do that if you have a case

    foo
    bar

    foo bar

    and foo is sent to reducer 1, bar to reducer 2? There's no way to know
    where keys should be sent.

    That said, there are options.

    Option 1: a cross. Undesirable because of data explosion.
    Option 2: If one of the data sets is large enough to fit in memory, you can
    make a UDF that brings it in, and does the join for you. This is
    essentially option 1.
    Option 3: Less generically, exploit the join you're actually doing. In the
    dummy example, it looks like you're checking if a token is contained in
    another string. You could convert this into a join by tokenizing,
    flattening, doing the join, etc. I don't know how close your real use case
    is to what you posted.

    Jon


    2012/8/29 Mat Kelcey <matthew.kelcey@gmail.com>
    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat
  • Mat Kelcey at Aug 30, 2012 at 12:29 am
    Actually, given the nature of my Query data I might just pack a few bloom
    filters and stream Document through a udf, I've got plenty of data and can
    guard against mistakes downstream.
    It's wonderful what leaving the office and getting on the bus does for your
    thought process....
    Mat
    On Aug 29, 2012 5:14 PM, "Mat Kelcey" wrote:

    Unfortunately neither side is small enough to either support a cross or a
    replicated join in memory approach.

    But opt3 does make sense, I think I'm over thinking things. I can utilise
    a udf to do the equivalent of tokenisation and do, like you say, just a
    join.

    In terms of the multiple joins I can just do all three, count the matches,
    and only allow the cases of all three matching

    Thanks!
    Mat
    On Aug 29, 2012 5:06 PM, "Jonathan Coveney" wrote:

    You're not missing anything obvious... what you're trying to do, on face
    value, is not an easy thing to do. In M/R, joining is done based on
    partitioning to the same reducer...how can you do that if you have a case

    foo
    bar

    foo bar

    and foo is sent to reducer 1, bar to reducer 2? There's no way to know
    where keys should be sent.

    That said, there are options.

    Option 1: a cross. Undesirable because of data explosion.
    Option 2: If one of the data sets is large enough to fit in memory, you
    can
    make a UDF that brings it in, and does the join for you. This is
    essentially option 1.
    Option 3: Less generically, exploit the join you're actually doing. In the
    dummy example, it looks like you're checking if a token is contained in
    another string. You could convert this into a join by tokenizing,
    flattening, doing the join, etc. I don't know how close your real use case
    is to what you posted.

    Jon


    2012/8/29 Mat Kelcey <matthew.kelcey@gmail.com>
    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat
  • Mat Kelcey at Aug 30, 2012 at 12:49 am
    and i just realised this last statement makes no sense in the context
    of my original contrived example (i originally asked about a join, not
    a filter)
    don't mind me! :)
    On 29 August 2012 17:29, Mat Kelcey wrote:
    Actually, given the nature of my Query data I might just pack a few bloom
    filters and stream Document through a udf, I've got plenty of data and can
    guard against mistakes downstream.
    It's wonderful what leaving the office and getting on the bus does for your
    thought process....
    Mat
    On Aug 29, 2012 5:14 PM, "Mat Kelcey" wrote:

    Unfortunately neither side is small enough to either support a cross or a
    replicated join in memory approach.

    But opt3 does make sense, I think I'm over thinking things. I can utilise
    a udf to do the equivalent of tokenisation and do, like you say, just a
    join.

    In terms of the multiple joins I can just do all three, count the matches,
    and only allow the cases of all three matching

    Thanks!
    Mat
    On Aug 29, 2012 5:06 PM, "Jonathan Coveney" wrote:

    You're not missing anything obvious... what you're trying to do, on face
    value, is not an easy thing to do. In M/R, joining is done based on
    partitioning to the same reducer...how can you do that if you have a case

    foo
    bar

    foo bar

    and foo is sent to reducer 1, bar to reducer 2? There's no way to know
    where keys should be sent.

    That said, there are options.

    Option 1: a cross. Undesirable because of data explosion.
    Option 2: If one of the data sets is large enough to fit in memory, you
    can
    make a UDF that brings it in, and does the join for you. This is
    essentially option 1.
    Option 3: Less generically, exploit the join you're actually doing. In
    the
    dummy example, it looks like you're checking if a token is contained in
    another string. You could convert this into a join by tokenizing,
    flattening, doing the join, etc. I don't know how close your real use
    case
    is to what you posted.

    Jon


    2012/8/29 Mat Kelcey <matthew.kelcey@gmail.com>
    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat
  • Mat Kelcey at Aug 30, 2012 at 12:08 am
    For the sake of discussion I actually simplified things but perhaps in a
    critical way...

    Query actually has 3 token fields and Document has 2 text fields and I
    really require token1 to be text1, token2 to also be in text1 and token3 to
    be in text2. (Damn bizarre NLP)

    These additional complexities might change things...
    On Aug 29, 2012 4:55 PM, "Mat Kelcey" wrote:

    Hello!

    Considering the following two relations...

    grunt> querys = load 'query' as (id:int, token:chararray);
    grunt> dump querys
    (11,foo)
    (12,bar)
    (13,frog)

    and

    grunt> documents = load 'document' as (id:int, text:chararray);
    grunt> dump documents;
    (21,foo bar frog)
    (22,hello frog)

    Is is possible to do a join where the query:token is not equal to but
    contained in documents:text ?

    eg
    (11,foo,21,foo bar frog)
    (12,bar,21,foo bar frog)
    (13,frog,21,foo bar frog)
    (13,frog,22,hello frog)

    I can certainly do this in Java map/reduce (as we all had to in the
    dark days days before pig) but is there a way to hack this together
    with a custom udf or some other weird join backdoor (customer
    partitioner for a group or something whacky) ???

    It's been a long day, maybe I'm just missing some super obvious..

    Cheers!
    Mat

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 29, '12 at 11:56p
activeAug 30, '12 at 12:49a
posts7
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase