FAQ
I'm writing a document similarity script. I created an inverted index
and trying to create a pairwise comparison list with docs of shared
words.

I got my data reduced my data to :
({(233),(534)})
({(21),(233),(534)})
({(21),(534)})

Each row contains doc_ids with words in common.

I want to cross join on just each row(not the whole data):
(233),(534)
(21),(233)
(21), (534)
(233), (534)
(21),(534)

and then just make a unique list out of this.

Any ideas how to get to the cross joined set?


thanks,
Tommy

Search Discussions

  • Mridul Muralidharan at Sep 4, 2009 at 2:57 am
    If I understand right, you each tuple has bag1 and bag2 - each having
    the set of word id's which form that document-bag ?

    Then you could just flatten both to do the cross, no ?

    A = load 'file' using BinStorage() AS (b1:{t1:(w1:long)},
    b2:{t2:(w2:long)});
    B = FOREACH A generate FLATTEN(b1), FLATTEN(b2);



    Or where you asking about how to generate A given A0 = load b1, A1 =
    load b2 ?

    Regards,
    Mridul




    Tommy Chheng wrote:
    I'm writing a document similarity script. I created an inverted index
    and trying to create a pairwise comparison list with docs of shared words.

    I got my data reduced my data to :
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})

    Each row contains doc_ids with words in common.

    I want to cross join on just each row(not the whole data):
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)

    and then just make a unique list out of this.

    Any ideas how to get to the cross joined set?


    thanks,
    Tommy
  • Tommy Chheng at Sep 4, 2009 at 3:20 am
    Hi Mridul,
    The data is already in pig as a tuple set. Each row can have more than
    2 bags as in the second tuple: {(21), (233), (534)}
    dump docs_with_shared_words; <= starting input
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})

    If i flatten, it'll give me a list of all the doc ids separately and
    lose the relationship.
    grunt> c = foreach docs_with_shared_words generate FLATTEN($0);
    grunt> dump c
    (233)
    (534)
    (21)
    (233)
    (534)
    (21)
    (534)

    I want the OUTput to be like the set below. I'm not sure how to get to
    this expected output
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    thanks,
    tommy

    El Sep 3, 2009, a las 7:55 PM, Mridul Muralidharan escribió:
    If I understand right, you each tuple has bag1 and bag2 - each
    having the set of word id's which form that document-bag ?

    Then you could just flatten both to do the cross, no ?

    A = load 'file' using BinStorage() AS (b1:{t1:(w1:long)}, b2:{t2:
    (w2:long)});
    B = FOREACH A generate FLATTEN(b1), FLATTEN(b2);



    Or where you asking about how to generate A given A0 = load b1, A1 =
    load b2 ?

    Regards,
    Mridul




    Tommy Chheng wrote:
    I'm writing a document similarity script. I created an inverted
    index and trying to create a pairwise comparison list with docs of
    shared words.
    I got my data reduced my data to :
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})
    Each row contains doc_ids with words in common.
    I want to cross join on just each row(not the whole data):
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    and then just make a unique list out of this.
    Any ideas how to get to the cross joined set?
    thanks,
    Tommy
  • Nikhil Gupta at Sep 4, 2009 at 3:36 am
    One way to do this by doing a hack-ish cross product, would be -
    A = your bag of words -> A_bag;
    RESULT_1 = foreach A {
    store_1 = A.A_bag;
    store_2 = A.A_bag;
    generate FLATTEN(store_1) as part_1,
    FLATTEN(store_2) as part_2;
    }

    So if your input was:
    A = ({(21),(534)})

    This will give you
    RESULT_1 with
    21, 534
    21, 21
    534, 21
    534, 534

    You can filter out same doc-ids
    RESULT_2 = filter RESULT_1 BY part_1 != part_2;

    RESULT_2 will be
    21, 534
    534, 21

    Hope that helps !

    -Nikhil
    http://stanford.edu/~nikgupta
    On Thu, Sep 3, 2009 at 8:17 PM, Tommy Chheng wrote:

    Hi Mridul,
    The data is already in pig as a tuple set. Each row can have more than 2
    bags as in the second tuple: {(21), (233), (534)}
    dump docs_with_shared_words; <= starting input
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})

    If i flatten, it'll give me a list of all the doc ids separately and lose
    the relationship.
    grunt> c = foreach docs_with_shared_words generate FLATTEN($0);
    grunt> dump c
    (233)
    (534)
    (21)
    (233)
    (534)
    (21)
    (534)

    I want the OUTput to be like the set below. I'm not sure how to get to this
    expected output
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    thanks,
    tommy

    El Sep 3, 2009, a las 7:55 PM, Mridul Muralidharan escribió:


    If I understand right, you each tuple has bag1 and bag2 - each having the
    set of word id's which form that document-bag ?

    Then you could just flatten both to do the cross, no ?

    A = load 'file' using BinStorage() AS (b1:{t1:(w1:long)},
    b2:{t2:(w2:long)});
    B = FOREACH A generate FLATTEN(b1), FLATTEN(b2);



    Or where you asking about how to generate A given A0 = load b1, A1 = load
    b2 ?

    Regards,
    Mridul




    Tommy Chheng wrote:
    I'm writing a document similarity script. I created an inverted index
    and trying to create a pairwise comparison list with docs of shared words.
    I got my data reduced my data to :
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})
    Each row contains doc_ids with words in common.
    I want to cross join on just each row(not the whole data):
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    and then just make a unique list out of this.
    Any ideas how to get to the cross joined set?
    thanks,
    Tommy
  • Mridul Muralidharan at Sep 4, 2009 at 3:44 am
    Hi,


    I misread your schema - assumed it was bag1, bag2 - and not bag with
    both document ids within it !

    You could do :

    A = LOAD 'input';
    B0 = FOREACH A GENERATE FLATTEN($0), FLATTEN($0);
    B1 = FILTER B0 by $0 != $1;
    B = DISTINCT B1


    if I understood the problem right.

    Hope this helps.
    Regards,
    Mridul

    Tommy Chheng wrote:
    Hi Mridul,
    The data is already in pig as a tuple set. Each row can have more than 2
    bags as in the second tuple: {(21), (233), (534)}
    dump docs_with_shared_words; <= starting input
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})

    If i flatten, it'll give me a list of all the doc ids separately and
    lose the relationship.
    grunt> c = foreach docs_with_shared_words generate FLATTEN($0);
    grunt> dump c
    (233)
    (534)
    (21)
    (233)
    (534)
    (21)
    (534)

    I want the OUTput to be like the set below. I'm not sure how to get to
    this expected output
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    thanks,
    tommy

    El Sep 3, 2009, a las 7:55 PM, Mridul Muralidharan escribió:
    If I understand right, you each tuple has bag1 and bag2 - each having
    the set of word id's which form that document-bag ?

    Then you could just flatten both to do the cross, no ?

    A = load 'file' using BinStorage() AS (b1:{t1:(w1:long)},
    b2:{t2:(w2:long)});
    B = FOREACH A generate FLATTEN(b1), FLATTEN(b2);



    Or where you asking about how to generate A given A0 = load b1, A1 =
    load b2 ?

    Regards,
    Mridul




    Tommy Chheng wrote:
    I'm writing a document similarity script. I created an inverted
    index and trying to create a pairwise comparison list with docs of
    shared words.
    I got my data reduced my data to :
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})
    Each row contains doc_ids with words in common.
    I want to cross join on just each row(not the whole data):
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    and then just make a unique list out of this.
    Any ideas how to get to the cross joined set?
    thanks,
    Tommy
  • Tommy Chheng at Sep 4, 2009 at 6:49 pm
    Great thanks, this line was key for me: B0 = FOREACH A GENERATE FLATTEN
    ($0), FLATTEN($0);

    can you explain how this works? how does this generate pairwise
    groupings from the bags?

    -
    tommy

    El Sep 3, 2009, a las 8:43 PM, Mridul Muralidharan escribió:
    Hi,


    I misread your schema - assumed it was bag1, bag2 - and not bag
    with both document ids within it !

    You could do :

    A = LOAD 'input';
    B0 = FOREACH A GENERATE FLATTEN($0), FLATTEN($0);
    B1 = FILTER B0 by $0 != $1;
    B = DISTINCT B1


    if I understood the problem right.

    Hope this helps.
    Regards,
    Mridul

    Tommy Chheng wrote:
    Hi Mridul,
    The data is already in pig as a tuple set. Each row can have more
    than 2 bags as in the second tuple: {(21), (233), (534)}
    dump docs_with_shared_words; <= starting input
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})
    If i flatten, it'll give me a list of all the doc ids separately
    and lose the relationship.
    grunt> c = foreach docs_with_shared_words generate FLATTEN($0);
    grunt> dump c
    (233)
    (534)
    (21)
    (233)
    (534)
    (21)
    (534)
    I want the OUTput to be like the set below. I'm not sure how to get
    to this expected output
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    thanks,
    tommy
    El Sep 3, 2009, a las 7:55 PM, Mridul Muralidharan escribió:
    If I understand right, you each tuple has bag1 and bag2 - each
    having the set of word id's which form that document-bag ?

    Then you could just flatten both to do the cross, no ?

    A = load 'file' using BinStorage() AS (b1:{t1:(w1:long)}, b2:{t2:
    (w2:long)});
    B = FOREACH A generate FLATTEN(b1), FLATTEN(b2);



    Or where you asking about how to generate A given A0 = load b1, A1
    = load b2 ?

    Regards,
    Mridul




    Tommy Chheng wrote:
    I'm writing a document similarity script. I created an inverted
    index and trying to create a pairwise comparison list with docs
    of shared words.
    I got my data reduced my data to :
    ({(233),(534)})
    ({(21),(233),(534)})
    ({(21),(534)})
    Each row contains doc_ids with words in common.
    I want to cross join on just each row(not the whole data):
    (233),(534)
    (21),(233)
    (21), (534)
    (233), (534)
    (21),(534)
    and then just make a unique list out of this.
    Any ideas how to get to the cross joined set?
    thanks,
    Tommy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 4, '09 at 1:48a
activeSep 4, '09 at 6:49p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase