FAQ
Hi,

I am working on a problem to compare two different datasets, and rank each
record of the first with respect to the other, in terms of how similar they
are. The records are dimensional, but do not have a lot of dimensions. Some
of the fields will be compared for exact matches, some for similar sound,
some with closest match etc. One of the datasets is large, and the other is
much smaller. The final goal is to compute a rank between each record of
first dataset with each record of the second. The rank is based on weighted
scores of each dimension comparison.

I was wondering if people in the community have any advice/suggested
patterns/thoughts about cross joining two datasets in map reduce. Do let me
know if you have any suggestions.

Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
Integration<https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

Search Discussions

  • Marcos Ortiz at Mar 7, 2011 at 7:54 pm

    On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
    Hi,

    I am working on a problem to compare two different datasets, and rank
    each record of the first with respect to the other, in terms of how
    similar they are. The records are dimensional, but do not have a lot
    of dimensions. Some of the fields will be compared for exact matches,
    some for similar sound, some with closest match etc. One of the
    datasets is large, and the other is much smaller. The final goal is
    to compute a rank between each record of first dataset with each
    record of the second. The rank is based on weighted scores of each
    dimension comparison.

    I was wondering if people in the community have any advice/suggested
    patterns/thoughts about cross joining two datasets in map reduce. Do
    let me know if you have any suggestions.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies
    Regards, Sonal. Can you give us more information about a basic workflow
    of your idea?

    Some questions:
    - How do you know that two records are identical? By id?
    - Can you give a example of the ranking that you want to archieve with a
    match of each case:
    - two records that are identical
    - two records that ar similar
    - two records with the closest match

    For MapReduce Design's Algoritms, I recommend to you this excelent from
    Ricky Ho:
    http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

    For the join of the two datasets, you can use Pig for this. Here you
    have a basic Pig example from Milind Bhandarkar
    (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
    and Pig":
    Users = load ‘users’ as (name, age);
    Filtered = filter Users by age >= 18 and age <= 25;
    Pages = load ‘pages’ as (user, url);
    Joined = join Filtered by name, Pages by user;
    Grouped = group Joined by url;
    Summed = foreach Grouped generate group,
    COUNT(Joined) as clicks;
    Sorted = order Summed by clicks desc;
    Top5 = limit Sorted 5;
    store Top5 into ‘top5sites’;


    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186
  • Sonal Goyal at Mar 8, 2011 at 5:22 am
    Hi Marcos,

    Thanks for replying. I think I was not very clear in my last post. Let me
    describe my use case in detail.

    I have two datasets coming from different sources, lets call them dataset1
    and dataset2. Both of them contain records for entities, say Person. A
    single record looks like:

    First Name Last Name, Street, City, State,Zip

    We want to compare each record of dataset1 with each record of dataset2, in
    effect a cross join.

    We know that the way data is collected, names will not match exactly, but we
    want to find close enoughs. So we have a rule which says create bigrams and
    find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
    match, give a score of 20 and so on.
    For Zip, we have our rule saying exact match or within 5 kms of each
    other(through a lookup), give a score of 50 and so on.

    Once we have each person of dataset1 compared with that of dataset2, we find
    the overall rank. Which is a weighted average of scores of name, address etc
    comparison.

    One approach is to use the DistributedCache for the smaller dataset and do a
    nested loop join in the mapper. The second approach is to use multiple MR
    flows, and compare the fields and reduce/collate the results.

    I am curious to know if people have other approaches they have implemented,
    what are the efficiencies they have built up etc.

    Thanks and Regards,
    Sonal
    <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
    Integration<https://github.com/sonalgoyal/hiho>
    Nube Technologies <http://www.nubetech.co>

    <http://in.linkedin.com/in/sonalgoyal>




    On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote:
    On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
    Hi,

    I am working on a problem to compare two different datasets, and rank
    each record of the first with respect to the other, in terms of how
    similar they are. The records are dimensional, but do not have a lot
    of dimensions. Some of the fields will be compared for exact matches,
    some for similar sound, some with closest match etc. One of the
    datasets is large, and the other is much smaller. The final goal is
    to compute a rank between each record of first dataset with each
    record of the second. The rank is based on weighted scores of each
    dimension comparison.

    I was wondering if people in the community have any advice/suggested
    patterns/thoughts about cross joining two datasets in map reduce. Do
    let me know if you have any suggestions.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies
    Regards, Sonal. Can you give us more information about a basic workflow
    of your idea?

    Some questions:
    - How do you know that two records are identical? By id?
    - Can you give a example of the ranking that you want to archieve with a
    match of each case:
    - two records that are identical
    - two records that ar similar
    - two records with the closest match

    For MapReduce Design's Algoritms, I recommend to you this excelent from
    Ricky Ho:

    http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

    For the join of the two datasets, you can use Pig for this. Here you
    have a basic Pig example from Milind Bhandarkar
    (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
    and Pig":
    Users = load ‘users’ as (name, age);
    Filtered = filter Users by age >= 18 and age <= 25;
    Pages = load ‘pages’ as (user, url);
    Joined = join Filtered by name, Pages by user;
    Grouped = group Joined by url;
    Summed = foreach Grouped generate group,
    COUNT(Joined) as clicks;
    Sorted = order Summed by clicks desc;
    Top5 = limit Sorted 5;
    store Top5 into ‘top5sites’;


    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186

  • Chase Bradford at Mar 8, 2011 at 6:09 pm
    How much smaller is the smaller dataset? If you can use the DC and precompute bigrams, locations, etc, and hold all the results in memory during setup before mapping on the large dataset, then I would suggest that approach.

    Another trick I've seen for similar problems where the final score is a product of feature scores, is to cluster in a way that eliminates obvious 0s. For example, if distance > 50km is a zero, then choose enough anchor coordinates to canvas the map with circles with radius 25km and overlap. Then, your mapper would emit (coord, record) pairs for every anchor region the record is in. That way, only records know to be similar in some way are considered.
    On Mar 7, 2011, at 9:21 PM, Sonal Goyal wrote:

    Hi Marcos,

    Thanks for replying. I think I was not very clear in my last post. Let me describe my use case in detail.

    I have two datasets coming from different sources, lets call them dataset1 and dataset2. Both of them contain records for entities, say Person. A single record looks like:

    First Name Last Name, Street, City, State,Zip

    We want to compare each record of dataset1 with each record of dataset2, in effect a cross join.

    We know that the way data is collected, names will not match exactly, but we want to find close enoughs. So we have a rule which says create bigrams and find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15 match, give a score of 20 and so on.
    For Zip, we have our rule saying exact match or within 5 kms of each other(through a lookup), give a score of 50 and so on.

    Once we have each person of dataset1 compared with that of dataset2, we find the overall rank. Which is a weighted average of scores of name, address etc comparison.

    One approach is to use the DistributedCache for the smaller dataset and do a nested loop join in the mapper. The second approach is to use multiple MR flows, and compare the fields and reduce/collate the results.

    I am curious to know if people have other approaches they have implemented, what are the efficiencies they have built up etc.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies







    On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote:
    On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
    Hi,

    I am working on a problem to compare two different datasets, and rank
    each record of the first with respect to the other, in terms of how
    similar they are. The records are dimensional, but do not have a lot
    of dimensions. Some of the fields will be compared for exact matches,
    some for similar sound, some with closest match etc. One of the
    datasets is large, and the other is much smaller. The final goal is
    to compute a rank between each record of first dataset with each
    record of the second. The rank is based on weighted scores of each
    dimension comparison.

    I was wondering if people in the community have any advice/suggested
    patterns/thoughts about cross joining two datasets in map reduce. Do
    let me know if you have any suggestions.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies
    Regards, Sonal. Can you give us more information about a basic workflow
    of your idea?

    Some questions:
    - How do you know that two records are identical? By id?
    - Can you give a example of the ranking that you want to archieve with a
    match of each case:
    - two records that are identical
    - two records that ar similar
    - two records with the closest match

    For MapReduce Design's Algoritms, I recommend to you this excelent from
    Ricky Ho:
    http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

    For the join of the two datasets, you can use Pig for this. Here you
    have a basic Pig example from Milind Bhandarkar
    (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
    and Pig":
    Users = load ‘users’ as (name, age);
    Filtered = filter Users by age >= 18 and age <= 25;
    Pages = load ‘pages’ as (user, url);
    Joined = join Filtered by name, Pages by user;
    Grouped = group Joined by url;
    Summed = foreach Grouped generate group,
    COUNT(Joined) as clicks;
    Sorted = order Summed by clicks desc;
    Top5 = limit Sorted 5;
    store Top5 into ‘top5sites’;


    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186

  • Lance Norskog at Mar 10, 2011 at 6:39 am
    The Mahout project has several tools for this class of problem.
    http://mahout.apache.org
    On Tue, Mar 8, 2011 at 9:07 AM, Chase Bradford wrote:
    How much smaller is the smaller dataset?  If you can use the DC and
    precompute bigrams, locations, etc, and hold all the results in memory
    during setup before mapping on the large dataset, then I would suggest that
    approach.
    Another trick I've seen for similar problems where the final score is a
    product of feature scores, is to cluster in a way that eliminates obvious
    0s.  For example, if distance > 50km is a zero, then choose enough anchor
    coordinates to canvas the map with circles with radius 25km and overlap.
    Then, your mapper would emit (coord, record) pairs for every anchor region
    the record is in.  That way, only records know to be similar in some way are
    considered.
    On Mar 7, 2011, at 9:21 PM, Sonal Goyal wrote:

    Hi Marcos,

    Thanks for replying. I think I was not very clear in my last post. Let me
    describe my use case in detail.

    I have two datasets coming from different sources, lets call them dataset1
    and dataset2. Both of them contain records for entities, say Person. A
    single record looks like:

    First Name Last Name,  Street, City, State,Zip

    We want to compare each record of dataset1 with each record of dataset2, in
    effect a cross join.

    We know that the way data is collected, names will not match exactly, but we
    want to find close enoughs. So we have a rule which says create bigrams and
    find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
    match, give a score of 20 and so on.
    For Zip, we have our rule saying exact match or within 5 kms of each
    other(through a lookup), give a score of 50 and so on.

    Once we have each person of dataset1 compared with that of dataset2, we find
    the overall rank. Which is a weighted average of scores of name, address etc
    comparison.

    One approach is to use the DistributedCache for the smaller dataset and do a
    nested loop join in the mapper. The second approach is to use multiple  MR
    flows, and compare the fields and reduce/collate the results.

    I am curious to know if people have other approaches they have implemented,
    what are the efficiencies they have built up etc.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies






    On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote:
    On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
    Hi,

    I am working on a problem to compare two different datasets, and rank
    each record of the first with respect to the other, in terms of how
    similar they are. The records are dimensional, but do not have a lot
    of dimensions. Some of the fields will be compared for exact matches,
    some for similar sound, some with closest match etc. One of the
    datasets is large, and the other is much smaller.  The final goal is
    to compute a rank between each record of first dataset with each
    record of the second. The rank is based on weighted scores of each
    dimension comparison.

    I was wondering if people in the community have any advice/suggested
    patterns/thoughts about cross joining two datasets in map reduce. Do
    let me know if you have any suggestions.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies
    Regards, Sonal. Can you give us more information about a basic workflow
    of your idea?

    Some questions:
    - How do you know that two records are identical? By id?
    - Can you give a example of the ranking that you want to archieve with a
    match of each case:
    - two records that are identical
    - two records that ar similar
    - two records with the closest match

    For MapReduce Design's Algoritms, I recommend to you this excelent from
    Ricky Ho:

    http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

    For the join of the two datasets, you can use Pig for this. Here you
    have a basic Pig example from Milind Bhandarkar
    (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with Hadoop
    and Pig":
    Users = load ‘users’ as (name, age);
    Filtered = filter Users by age >= 18 and age <= 25;
    Pages = load ‘pages’ as (user, url);
    Joined = join Filtered by name, Pages by user;
    Grouped = group Joined by url;
    Summed = foreach Grouped generate group,
    COUNT(Joined) as clicks;
    Sorted = order Summed by clicks desc;
    Top5 = limit Sorted 5;
    store Top5 into ‘top5sites’;


    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186


    --
    Lance Norskog
    goksron@gmail.com
  • Marcos Ortiz at Mar 8, 2011 at 6:12 pm

    On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
    Hi Marcos,

    Thanks for replying. I think I was not very clear in my last post. Let
    me describe my use case in detail.

    I have two datasets coming from different sources, lets call them
    dataset1 and dataset2. Both of them contain records for entities, say
    Person. A single record looks like:

    First Name Last Name, Street, City, State,Zip

    We want to compare each record of dataset1 with each record of
    dataset2, in effect a cross join.

    We know that the way data is collected, names will not match exactly,
    but we want to find close enoughs. So we have a rule which says create
    bigrams and find the matching bigrams. If 0 to 5 match, give a score
    of 10, if 5-15 match, give a score of 20 and so on.
    Well, a approach for this problem has a solution given by Milind
    Bhandarkar, on his presentation called "Practical Problem Solving with
    Hadoop and Pig".
    He talk about a solution for Bigrams giving a example with word
    matching.
    Bigrams
    ========

    Input: A large text corpus
    • Output: List(word , Top (word ))
    • Two Stages:
    • Generate all possible bigrams
    • Find most frequent K bigrams for each word

    Bigrams: Stage 1
    Map
    ===
    • Generate all possible Bigrams
    • Map Input: Large text corpus
    • Map computation
    • In each sentence, or each “word word ”
    • Output (word , word ), (word , word )
    • Partition & Sort by (word , word )

    pairs.pl
    --------
    while(<STDIN>) {
    chomp;
    $_ =~ s/[^a-zA-Z]+/ /g ;
    $_ =~ s/^\s+//g ;
    $_ =~ s/\s+$//g ;
    $_ =~ tr/A-Z/a-z/;
    my @words = split(/\s+/, $_);
    for (my $i = 0; $i < $#words - 1; ++$i) {
    print "$words[$i]:$words[$i+1]\n";
    print "$words[$i+1]:$words[$i]\n";
    }
    }

    Bigrams: Stage 1
    Reduce
    ======
    • Input: List(word , word ) sorted and partitioned
    • Output: List(word , [freq, word ])
    • Counting similar to Unigrams example

    count.pl
    --------

    $_ = <STDIN>; chomp;
    my ($pw1, $pw2) = split(/:/, $_);
    $count = 1;
    while(<STDIN>) {
    chomp;
    my ($w1, $w2) = split(/:/, $_);
    if ($w1 eq $pw1 && $w2 eq $pw2) {
    $count++;
    } else {
    print "$pw1:$count:$pw2\n";
    $pw1 = $w1;
    $pw2 = $w2;
    $count = 1;
    }
    }
    print "$pw1:$count:$pw2\n";

    Bigrams: Stage 2
    Map
    ===
    • Input: List(word , [freq,word ])
    • Output: List(word , [freq, word ])
    • Identity Mapper (/bin/cat)
    • Partition by word
    • Sort descending by (word , freq)

    Bigrams: Stage 2
    Reduce
    ======
    • Input: List(word , [freq,word ])
    • partitioned by word
    • sorted descending by (word , freq)
    • Output: Top (List(word , [freq, word ]))
    • For each word, throw away after K records

    firstN.pl
    $N = 5;
    $_ = <STDIN>; chomp;
    my ($pw1, $count, $pw2) = split(/:/, $_);
    $idx = 1;
    $out = "$pw1\t$pw2,$count;";
    while(<STDIN>) {
    chomp;
    my ($w1, $c, $w2) = split(/:/, $_);
    if ($w1 eq $pw1) {
    if ($idx < $N) {
    $out .= "$w2,$c;";
    $idx++;
    }
    } else {
    print "$out\n";
    $pw1 = $w1;
    $idx = 1;
    $out = "$pw1\t$w2,$c;";
    }
    }
    print "$out\n";


    You can translate this approach to your especific problem.
    I recommend you that you discuss this with him because he has a vast
    experience with all this, much more than me.

    Regards
    For Zip, we have our rule saying exact match or within 5 kms of each
    other(through a lookup), give a score of 50 and so on.

    Once we have each person of dataset1 compared with that of dataset2,
    we find the overall rank. Which is a weighted average of scores of
    name, address etc comparison.

    One approach is to use the DistributedCache for the smaller dataset
    and do a nested loop join in the mapper. The second approach is to use
    multiple MR flows, and compare the fields and reduce/collate the
    results.

    I am curious to know if people have other approaches they have
    implemented, what are the efficiencies they have built up etc.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies







    On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote:
    On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
    Hi,

    I am working on a problem to compare two different datasets, and rank
    each record of the first with respect to the other, in terms of how
    similar they are. The records are dimensional, but do not
    have a lot
    of dimensions. Some of the fields will be compared for exact matches,
    some for similar sound, some with closest match etc. One of the
    datasets is large, and the other is much smaller. The final goal is
    to compute a rank between each record of first dataset with each
    record of the second. The rank is based on weighted scores of each
    dimension comparison.

    I was wondering if people in the community have any
    advice/suggested
    patterns/thoughts about cross joining two datasets in map
    reduce. Do
    let me know if you have any suggestions.

    Thanks and Regards,
    Sonal
    Hadoop ETL and Data Integration
    Nube Technologies

    Regards, Sonal. Can you give us more information about a basic
    workflow
    of your idea?

    Some questions:
    - How do you know that two records are identical? By id?
    - Can you give a example of the ranking that you want to
    archieve with a
    match of each case:
    - two records that are identical
    - two records that ar similar
    - two records with the closest match

    For MapReduce Design's Algoritms, I recommend to you this
    excelent from
    Ricky Ho:
    http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

    For the join of the two datasets, you can use Pig for this.
    Here you
    have a basic Pig example from Milind Bhandarkar
    (milindb@yahoo-inc.com)'s talk "Practical Problem Solving with
    Hadoop
    and Pig":
    Users = load ‘users’ as (name, age);
    Filtered = filter Users by age >= 18 and age <= 25;
    Pages = load ‘pages’ as (user, url);
    Joined = join Filtered by name, Pages by user;
    Grouped = group Joined by url;
    Summed = foreach Grouped generate group,
    COUNT(Joined) as clicks;
    Sorted = order Summed by clicks desc;
    Top5 = limit Sorted 5;
    store Top5 into ‘top5sites’;


    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186

    --
    Marcos Luís Ortíz Valmaseda
    Software Engineer
    Centro de Tecnologías de Gestión de Datos (DATEC)
    Universidad de las Ciencias Informáticas
    http://uncubanitolinuxero.blogspot.com
    http://www.linkedin.com/in/marcosluis2186

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMar 7, '11 at 7:07p
activeMar 10, '11 at 6:39a
posts6
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase