FAQ
I'm having trouble thinking about to use pig to do a pairwise document
similarity.

If i have a huge list of word counts: (using dummy names to explain easier)
doc_id, word, count
doc1, testword1, doc1_testword1_count
doc1, testword2, doc1_testword2_count
doc2, testword1, doc2_testword1_count
doc2, testword2, doc2_testword2_count
doc3, anotherword, doc3_anotherword_count
...

i want to be able to compare pairwise doc1 to doc2 like:
similarity_rating = doc1_testword1_count * doc2_testword1_count +
doc1_testword2_count * doc2_testword2_count

so the output can be a file like
doc1, doc2, similarity_rating
...

In Matlab/octave, i could just do a matrix multiply. In a scripting
language like Ruby, i could just do a for loop with the words stored in
a hash.

I'm not sure how to approach this with the pig commands. Any ideas?

Thanks,
Tommy

Search Discussions

  • Ted Dunning at Sep 9, 2009 at 8:35 pm
    Group the word count table against itself on the word.
    Aggregate the products of the counts.

    On Wed, Sep 9, 2009 at 1:21 PM, Tommy Chheng wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in a
    hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy

    --
    Ted Dunning, CTO
    DeepDyve
  • Tommy Chheng at Sep 10, 2009 at 7:26 am
    consider this dataset:
    doc1 a 2
    doc1 b 2
    doc2 b 2
    doc2 c 1
    doc3 a 1
    doc3 b 1
    doc3 c 1
    doc4 a 1
    Group the word count table against itself on the word.
    (a,{(doc1,a,2),(doc3,a,1),(doc4,a,1)})
    (b,{(doc1,b,2),(doc2,b,2),(doc3,b,1)})
    (c,{(doc2,c,1),(doc3,c,1)})

    ok, this is good, group by words to see which pairwise docs to run
    against.
    Aggregate the products of the counts.
    This is the troubling part for me:
    In the first grouping:
    (a,{(doc1,a,2),(doc3,a,1),(doc4,a,1)})
    it needs to compare doc1's word counts with doc3's word counts, doc1
    with doc4 and doc3 with doc4.

    i'm confused about how to select the right terms for the product?

    In this statement, the $1.$2 needs to select pairwise documents'
    corresponding word count.

    x = foreach grouped_docs generate $1.$0,SUM($1.$2 * $1.$2);


    Thanks for the help,
    tommy

    On Wed, Sep 9, 2009 at 1:21 PM, Tommy Chheng
    wrote:
    I'm having trouble thinking about to use pig to do a pairwise
    document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain
    easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words
    stored in a
    hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy

    --
    Ted Dunning, CTO
    DeepDyve
  • Ted Dunning at Sep 10, 2009 at 7:37 am

    From here:
    (a,{(doc1,a,2),(doc3,a,1),(doc4,a,1)})
    (b,{(doc1,b,2),(doc2,b,2),(doc3,b,1)})
    (c,{(doc2,c,1),(doc3,c,1)})
    You should have this:
    From the first line:
    doc1, doc1, 4
    doc1, doc3, 2
    doc1, doc4, 2
    doc3, doc1, 2
    doc3, doc3, 1
    doc3, doc4, 1
    doc4, doc1, 2
    doc4, doc3, 1
    doc4, doc4, 1

    doc1, doc1, 4
    doc1, doc2, 4
    doc1, doc3, 2
    doc2, doc1, 4
    doc2, doc2, 4
    doc2, doc3, 2

    ... and so, laboriously on ...

    Group by the first two fields and add up the products and you hsould be
    done.

    On Thu, Sep 10, 2009 at 12:22 AM, Tommy Chheng wrote:

    consider this dataset:
    doc1 a 2
    doc1 b 2
    doc2 b 2
    doc2 c 1
    doc3 a 1
    doc3 b 1
    doc3 c 1
    doc4 a 1

    Group the word count table against itself on the word.
    (a,{(doc1,a,2),(doc3,a,1),(doc4,a,1)})
    (b,{(doc1,b,2),(doc2,b,2),(doc3,b,1)})
    (c,{(doc2,c,1),(doc3,c,1)})

    ok, this is good, group by words to see which pairwise docs to run against.
    Aggregate the products of the counts.
    This is the troubling part for me:
    In the first grouping:
    (a,{(doc1,a,2),(doc3,a,1),(doc4,a,1)})
    it needs to compare doc1's word counts with doc3's word counts, doc1 with
    doc4 and doc3 with doc4.

    i'm confused about how to select the right terms for the product?

    In this statement, the $1.$2 needs to select pairwise documents'
    corresponding word count.

    x = foreach grouped_docs generate $1.$0,SUM($1.$2 * $1.$2);


    Thanks for the help,
    tommy


    On Wed, Sep 9, 2009 at 1:21 PM, Tommy Chheng <tommy.chheng@gmail.com>
    wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain
    easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in a
    hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy

    --
    Ted Dunning, CTO
    DeepDyve

    --
    Ted Dunning, CTO
    DeepDyve
  • Christopher Olston at Sep 9, 2009 at 9:38 pm
    You probably want to use the "CROSS" command, which links up all pairs of
    records, e.g.:

    A = load '/mydocdata.txt';
    B = cross A, A;
    C = foreach B generate compute_similarity(*);

    where compute_similarity() is a UDF (user-defined function that multiplies
    the counts or whatever your similarity computation is).

    -Chris
    On 9/9/09 1:21 PM, "Tommy Chheng" wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in
    a hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Ted Dunning at Sep 9, 2009 at 9:44 pm
    I think not.

    Many documents will not share words (you should typically use a stop list to
    ensure that this is so).

    Crossing documents to documents is N^2 n where N is the number of documents
    and n the average number of words in each document. The number of
    cooccurrences is more like N n^2. Putting the superscript on the smaller
    number is a really good idea.
    On Wed, Sep 9, 2009 at 2:36 PM, Christopher Olston wrote:

    You probably want to use the "CROSS" command, which links up all pairs of
    records, e.g.:

    A = load '/mydocdata.txt';
    B = cross A, A;
    C = foreach B generate compute_similarity(*);

    where compute_similarity() is a UDF (user-defined function that multiplies
    the counts or whatever your similarity computation is).

    -Chris
    On 9/9/09 1:21 PM, "Tommy Chheng" wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in
    a hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



    --
    Ted Dunning, CTO
    DeepDyve
  • Ted Dunning at Sep 9, 2009 at 9:48 pm
    Post in haste, repent at leisure.

    The thrust of my comment is correct. The details are not.

    The cost of the correct solution is more like sum_w (DF(w)^2) where you sum
    over all words. If you use a stop list, you eliminate all words with large
    DF.
    On Wed, Sep 9, 2009 at 2:43 PM, Ted Dunning wrote:


    I think not.

    Many documents will not share words (you should typically use a stop list
    to ensure that this is so).

    Crossing documents to documents is N^2 n where N is the number of documents
    and n the average number of words in each document. The number of
    cooccurrences is more like N n^2. Putting the superscript on the smaller
    number is a really good idea.
    On Wed, Sep 9, 2009 at 2:36 PM, Christopher Olston wrote:

    You probably want to use the "CROSS" command, which links up all pairs of
    records, e.g.:

    A = load '/mydocdata.txt';
    B = cross A, A;
    C = foreach B generate compute_similarity(*);

    where compute_similarity() is a UDF (user-defined function that multiplies
    the counts or whatever your similarity computation is).

    -Chris
    On 9/9/09 1:21 PM, "Tommy Chheng" wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in
    a hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research



    --
    Ted Dunning, CTO
    DeepDyve

    --
    Ted Dunning, CTO
    DeepDyve
  • Christopher Olston at Sep 9, 2009 at 10:49 pm
    Sorry, read too hastily. Usually when people talk about pairwise document
    similarity they mean pairs of documents, not pairs of words. Cross gives you
    pairs of documents.

    Looks like you got it figured out.

    Cheers,

    Chris

    On 9/9/09 2:43 PM, "Ted Dunning" wrote:

    I think not.

    Many documents will not share words (you should typically use a stop list to
    ensure that this is so).

    Crossing documents to documents is N^2 n where N is the number of documents
    and n the average number of words in each document. The number of
    cooccurrences is more like N n^2. Putting the superscript on the smaller
    number is a really good idea.

    On Wed, Sep 9, 2009 at 2:36 PM, Christopher Olston
    wrote:
    You probably want to use the "CROSS" command, which links up all pairs of
    records, e.g.:

    A = load '/mydocdata.txt';
    B = cross A, A;
    C = foreach B generate compute_similarity(*);

    where compute_similarity() is a UDF (user-defined function that multiplies
    the counts or whatever your similarity computation is).

    -Chris
    On 9/9/09 1:21 PM, "Tommy Chheng" wrote:

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in
    a hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy
    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research


    --
    Christopher Olston, Ph.D.
    Sr. Research Scientist
    Yahoo! Research
  • Paolo D'alberto at Sep 9, 2009 at 10:19 pm
    Interesting, how matrix multiply is used for the 1-1 comparison ?
    I new it that you can use matrix multiply for the All pair shortest path (N^3) but for all 1-1 comparison should be N^2 ...
    would you mind to share ?

    Thank you
    Paolo


    -----Original Message-----
    From: Tommy Chheng
    Sent: Wed 9/9/2009 1:21 PM
    To: pig-user@hadoop.apache.org
    Subject: computing pairwise document similarity

    I'm having trouble thinking about to use pig to do a pairwise document
    similarity.

    If i have a huge list of word counts: (using dummy names to explain easier)
    doc_id, word, count
    doc1, testword1, doc1_testword1_count
    doc1, testword2, doc1_testword2_count
    doc2, testword1, doc2_testword1_count
    doc2, testword2, doc2_testword2_count
    doc3, anotherword, doc3_anotherword_count
    ...

    i want to be able to compare pairwise doc1 to doc2 like:
    similarity_rating = doc1_testword1_count * doc2_testword1_count +
    doc1_testword2_count * doc2_testword2_count

    so the output can be a file like
    doc1, doc2, similarity_rating
    ...

    In Matlab/octave, i could just do a matrix multiply. In a scripting
    language like Ruby, i could just do a for loop with the words stored in
    a hash.

    I'm not sure how to approach this with the pig commands. Any ideas?

    Thanks,
    Tommy
  • Ted Dunning at Sep 10, 2009 at 1:09 am
    These asymptotics only apply if the values are dense. With document-term
    matrices, we have massive sparsity.

    This changes the asymptotic behavior dramatically with the result that doing
    the cross of documents against documents winds up massively slower than
    grouping on words.
    On Wed, Sep 9, 2009 at 3:16 PM, Paolo D'alberto wrote:

    Interesting, how matrix multiply is used for the 1-1 comparison ?
    I new it that you can use matrix multiply for the All pair shortest path
    (N^3) but for all 1-1 comparison should be N^2 ...
    would you mind to share ?


    --
    Ted Dunning, CTO
    DeepDyve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 9, '09 at 8:22p
activeSep 10, '09 at 7:37a
posts10
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase