Grokbase Groups Pig user July 2011
FAQ
I'm trying to join together several different sources of synonyms using Pig.
For example:

A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
label:chararray);
DUMP A;
(12,synonym1)
(12,alternative_name)
(45,synonym1 full name and description)
(45,synonym1)
(45,synonym1_expanded)
(78,synonym1)
(67,synonym1)

I've managed to group things together by the label...

C = GROUP A BY label;
DUMP C;
(synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
(alternative_name,{(12,alternative_name)})
(synonym1_expanded,{(45,synonym1_expanded)})
(synonym1 full name and description,{(45,synonym1 full name and
description)})

And then flatten them out a little bit:

D = FOREACH C GENERATE $0, $1.id;
DUMP D;
(synonym1,{(12),(45),(67)})
(alternative_name,{(12),(78)})
(synonym1_expanded,{(45)})
(synonym1 full name and description,{(45)})


If you look closely at the data, it turns out that this example test data
set is really all the same - the synonyms all overlap. The final output I'd
like to get to is something like this (the arbitrary_id could be anything, I
really just need a set of the overlapping IDs):

(arbitrary_id, {12, 45, 67, 78})

How can I join on the bag of IDs in 'D' to find other labels that have at
least one of the same IDs? Or am I approaching this the wrong way?

Thanks,

Mike

Search Discussions

  • John Conwell at Jul 13, 2011 at 3:13 pm
    If I understand you correctly, what you want in the end is a bag with all
    distinct ids from the original dataset, regardless of the row label. The
    following will get you that (if thats what your looking for). Note, that in
    the for LOAD statement, I specified a comma as the delimiter.

    a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
    label:chararray);

    b = FOREACH a GENERATE id;

    c = GROUP b BY id;

    d = FOREACH c GENERATE group;

    e = GROUP d ALL;

    dump e

    (all,{(12),(45),(67),(78)})



    On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo wrote:

    I'm trying to join together several different sources of synonyms using
    Pig.
    For example:

    A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
    label:chararray);
    DUMP A;
    (12,synonym1)
    (12,alternative_name)
    (45,synonym1 full name and description)
    (45,synonym1)
    (45,synonym1_expanded)
    (78,synonym1)
    (67,synonym1)

    I've managed to group things together by the label...

    C = GROUP A BY label;
    DUMP C;
    (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
    (alternative_name,{(12,alternative_name)})
    (synonym1_expanded,{(45,synonym1_expanded)})
    (synonym1 full name and description,{(45,synonym1 full name and
    description)})

    And then flatten them out a little bit:

    D = FOREACH C GENERATE $0, $1.id;
    DUMP D;
    (synonym1,{(12),(45),(67)})
    (alternative_name,{(12),(78)})
    (synonym1_expanded,{(45)})
    (synonym1 full name and description,{(45)})


    If you look closely at the data, it turns out that this example test data
    set is really all the same - the synonyms all overlap. The final output
    I'd
    like to get to is something like this (the arbitrary_id could be anything,
    I
    really just need a set of the overlapping IDs):

    (arbitrary_id, {12, 45, 67, 78})

    How can I join on the bag of IDs in 'D' to find other labels that have at
    least one of the same IDs? Or am I approaching this the wrong way?

    Thanks,

    Mike


    --

    Thanks,
    John C
  • Mike Hugo at Jul 13, 2011 at 3:35 pm
    Thanks so much for the input John! That's not quite what I'm looking for -
    I realize now that my example is not fully complete. There may be different
    sets of synonyms in the input file. For example:

    12 synonym1
    12 alternative_name
    45 synonym1 full name and description
    45 synonym1
    45 synonym1_expanded
    78 alternative_name
    67 synonym1
    34 synonym2
    34 synonym2_expanded
    56 synonym2
    89 synonym2_expanded

    Then the desired output would be:

    (arbitrary_id_1, {12, 45, 67, 78})
    (arbitrary_id_2, {34, 56, 89})

    (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
    therefore the set of IDs for synonym2 is 34, 56, 89)

    The arbitrary ID could be a row label, but it doesn't really matter, what
    I'm really interested in is the bag of ids.

    Mike
    On Wed, Jul 13, 2011 at 10:13 AM, John Conwell wrote:

    If I understand you correctly, what you want in the end is a bag with all
    distinct ids from the original dataset, regardless of the row label. The
    following will get you that (if thats what your looking for). Note, that
    in
    the for LOAD statement, I specified a comma as the delimiter.

    a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
    label:chararray);

    b = FOREACH a GENERATE id;

    c = GROUP b BY id;

    d = FOREACH c GENERATE group;

    e = GROUP d ALL;

    dump e

    (all,{(12),(45),(67),(78)})



    On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo wrote:

    I'm trying to join together several different sources of synonyms using
    Pig.
    For example:

    A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
    label:chararray);
    DUMP A;
    (12,synonym1)
    (12,alternative_name)
    (45,synonym1 full name and description)
    (45,synonym1)
    (45,synonym1_expanded)
    (78,synonym1)
    (67,synonym1)

    I've managed to group things together by the label...

    C = GROUP A BY label;
    DUMP C;
    (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
    (alternative_name,{(12,alternative_name)})
    (synonym1_expanded,{(45,synonym1_expanded)})
    (synonym1 full name and description,{(45,synonym1 full name and
    description)})

    And then flatten them out a little bit:

    D = FOREACH C GENERATE $0, $1.id;
    DUMP D;
    (synonym1,{(12),(45),(67)})
    (alternative_name,{(12),(78)})
    (synonym1_expanded,{(45)})
    (synonym1 full name and description,{(45)})


    If you look closely at the data, it turns out that this example test data
    set is really all the same - the synonyms all overlap. The final output
    I'd
    like to get to is something like this (the arbitrary_id could be anything,
    I
    really just need a set of the overlapping IDs):

    (arbitrary_id, {12, 45, 67, 78})

    How can I join on the bag of IDs in 'D' to find other labels that have at
    least one of the same IDs? Or am I approaching this the wrong way?

    Thanks,

    Mike


    --

    Thanks,
    John C
  • Jonathan Coveney at Jul 13, 2011 at 4:01 pm
    I would group on the label column, and then just take the distinct values in
    the id column. You may need to make a UDF or just do some processing to turn
    synonym2_expanded into synonym2, but it sounds like that's what you want to
    do. I guess I'm not sure how alternative_name works into this?

    2011/7/13 Mike Hugo <mike@piragua.com>
    Thanks so much for the input John! That's not quite what I'm looking for -
    I realize now that my example is not fully complete. There may be
    different
    sets of synonyms in the input file. For example:

    12 synonym1
    12 alternative_name
    45 synonym1 full name and description
    45 synonym1
    45 synonym1_expanded
    78 alternative_name
    67 synonym1
    34 synonym2
    34 synonym2_expanded
    56 synonym2
    89 synonym2_expanded

    Then the desired output would be:

    (arbitrary_id_1, {12, 45, 67, 78})
    (arbitrary_id_2, {34, 56, 89})

    (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
    therefore the set of IDs for synonym2 is 34, 56, 89)

    The arbitrary ID could be a row label, but it doesn't really matter, what
    I'm really interested in is the bag of ids.

    Mike
    On Wed, Jul 13, 2011 at 10:13 AM, John Conwell wrote:

    If I understand you correctly, what you want in the end is a bag with all
    distinct ids from the original dataset, regardless of the row label. The
    following will get you that (if thats what your looking for). Note, that
    in
    the for LOAD statement, I specified a comma as the delimiter.

    a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
    label:chararray);

    b = FOREACH a GENERATE id;

    c = GROUP b BY id;

    d = FOREACH c GENERATE group;

    e = GROUP d ALL;

    dump e

    (all,{(12),(45),(67),(78)})



    On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo wrote:

    I'm trying to join together several different sources of synonyms using
    Pig.
    For example:

    A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
    label:chararray);
    DUMP A;
    (12,synonym1)
    (12,alternative_name)
    (45,synonym1 full name and description)
    (45,synonym1)
    (45,synonym1_expanded)
    (78,synonym1)
    (67,synonym1)

    I've managed to group things together by the label...

    C = GROUP A BY label;
    DUMP C;
    (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
    (alternative_name,{(12,alternative_name)})
    (synonym1_expanded,{(45,synonym1_expanded)})
    (synonym1 full name and description,{(45,synonym1 full name and
    description)})

    And then flatten them out a little bit:

    D = FOREACH C GENERATE $0, $1.id;
    DUMP D;
    (synonym1,{(12),(45),(67)})
    (alternative_name,{(12),(78)})
    (synonym1_expanded,{(45)})
    (synonym1 full name and description,{(45)})


    If you look closely at the data, it turns out that this example test
    data
    set is really all the same - the synonyms all overlap. The final
    output
    I'd
    like to get to is something like this (the arbitrary_id could be anything,
    I
    really just need a set of the overlapping IDs):

    (arbitrary_id, {12, 45, 67, 78})

    How can I join on the bag of IDs in 'D' to find other labels that have
    at
    least one of the same IDs? Or am I approaching this the wrong way?

    Thanks,

    Mike


    --

    Thanks,
    John C
  • Mike Hugo at Jul 13, 2011 at 4:11 pm
    Great thanks John! I think I'm down the right path then.

    To answer your final question about the alternative name - basically you can
    consider each id as a distinct datasource of synonyms. I'm trying to join
    them all together in a single repository. Looking at the example again,

    12 synonym1
    12 alternative_name
    45 synonym1 full name and description
    45 synonym1
    45 synonym1_expanded
    78 alternative_name
    67 synonym1
    34 synonym2
    34 synonym2_expanded
    56 synonym2
    89 synonym2_expanded

    12 has two "labels" - synonym1 and alternative_name. synonym1 is found in
    45, 12, and 67 so we now know 45, 12, and 67 are the same thing.
    alternative name is found in 12 and 78, so we now know that 12 and 78 are
    the same thing. 12 is found in both the first set (45, 12, and 67) and the
    second set (12, 78), so we now know those two sets are the same thing,
    resulting in the desired output of (12, 45, 67, 78). The same logic can be
    applied to the next set of data: synonym2 is found in 34 and 56, so they
    are the same thing. synonym2_expanded is found in 34 and 89, so they are
    the same thing. 34 is found in both sets, so the final output for that
    chunk of data is (34, 56, 89).

    Thanks for the help, I'll keep playing around with this and take a look at
    building a UDF.

    Mike
    On Wed, Jul 13, 2011 at 11:01 AM, Jonathan Coveney wrote:

    I would group on the label column, and then just take the distinct values
    in
    the id column. You may need to make a UDF or just do some processing to
    turn
    synonym2_expanded into synonym2, but it sounds like that's what you want to
    do. I guess I'm not sure how alternative_name works into this?

    2011/7/13 Mike Hugo <mike@piragua.com>
    Thanks so much for the input John! That's not quite what I'm looking for -
    I realize now that my example is not fully complete. There may be
    different
    sets of synonyms in the input file. For example:

    12 synonym1
    12 alternative_name
    45 synonym1 full name and description
    45 synonym1
    45 synonym1_expanded
    78 alternative_name
    67 synonym1
    34 synonym2
    34 synonym2_expanded
    56 synonym2
    89 synonym2_expanded

    Then the desired output would be:

    (arbitrary_id_1, {12, 45, 67, 78})
    (arbitrary_id_2, {34, 56, 89})

    (34 has a synonym that matches 56, and 34 has a synonym that matches 89,
    therefore the set of IDs for synonym2 is 34, 56, 89)

    The arbitrary ID could be a row label, but it doesn't really matter, what
    I'm really interested in is the bag of ids.

    Mike
    On Wed, Jul 13, 2011 at 10:13 AM, John Conwell wrote:

    If I understand you correctly, what you want in the end is a bag with
    all
    distinct ids from the original dataset, regardless of the row label.
    The
    following will get you that (if thats what your looking for). Note,
    that
    in
    the for LOAD statement, I specified a comma as the delimiter.

    a = LOAD 'synonyms.txt' USING PigStorage(',') AS (id:chararray,
    label:chararray);

    b = FOREACH a GENERATE id;

    c = GROUP b BY id;

    d = FOREACH c GENERATE group;

    e = GROUP d ALL;

    dump e

    (all,{(12),(45),(67),(78)})



    On Tue, Jul 12, 2011 at 12:45 PM, Mike Hugo wrote:

    I'm trying to join together several different sources of synonyms
    using
    Pig.
    For example:

    A = LOAD '/tmp/synonyms.txt' USING PigStorage() AS (id:chararray,
    label:chararray);
    DUMP A;
    (12,synonym1)
    (12,alternative_name)
    (45,synonym1 full name and description)
    (45,synonym1)
    (45,synonym1_expanded)
    (78,synonym1)
    (67,synonym1)

    I've managed to group things together by the label...

    C = GROUP A BY label;
    DUMP C;
    (synonym1,{(12,synonym1),(45,synonym1),(78,synonym1),(67,synonym1)})
    (alternative_name,{(12,alternative_name)})
    (synonym1_expanded,{(45,synonym1_expanded)})
    (synonym1 full name and description,{(45,synonym1 full name and
    description)})

    And then flatten them out a little bit:

    D = FOREACH C GENERATE $0, $1.id;
    DUMP D;
    (synonym1,{(12),(45),(67)})
    (alternative_name,{(12),(78)})
    (synonym1_expanded,{(45)})
    (synonym1 full name and description,{(45)})


    If you look closely at the data, it turns out that this example test
    data
    set is really all the same - the synonyms all overlap. The final
    output
    I'd
    like to get to is something like this (the arbitrary_id could be anything,
    I
    really just need a set of the overlapping IDs):

    (arbitrary_id, {12, 45, 67, 78})

    How can I join on the bag of IDs in 'D' to find other labels that
    have
    at
    least one of the same IDs? Or am I approaching this the wrong way?

    Thanks,

    Mike


    --

    Thanks,
    John C

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 12, '11 at 7:46p
activeJul 13, '11 at 4:11p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase