On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
Hi Marcos,
Thanks for replying. I think I was not very clear in my last post. Let
me describe my use case in detail.
I have two datasets coming from different sources, lets call them
dataset1 and dataset2. Both of them contain records for entities, say
Person. A single record looks like:
First Name Last Name, Street, City, State,Zip
We want to compare each record of dataset1 with each record of
dataset2, in effect a cross join.
We know that the way data is collected, names will not match exactly,
but we want to find close enoughs. So we have a rule which says create
bigrams and find the matching bigrams. If 0 to 5 match, give a score
of 10, if 5-15 match, give a score of 20 and so on.
Well, a approach for this problem has a solution given by Milind
Bhandarkar, on his presentation called "Practical Problem Solving with
Hadoop and Pig".
He talk about a solution for Bigrams giving a example with word
matching.
Bigrams
========
Input: A large text corpus
• Output: List(word , Top (word ))
• Two Stages:
• Generate all possible bigrams
• Find most frequent K bigrams for each word
Bigrams: Stage 1
Map
===
• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
• In each sentence, or each “word word ”
• Output (word , word ), (word , word )
• Partition & Sort by (word , word )
pairs.pl
--------
while(<STDIN>) {
chomp;
$_ =~ s/[^a-zA-Z]+/ /g ;
$_ =~ s/^\s+//g ;
$_ =~ s/\s+$//g ;
$_ =~ tr/A-Z/a-z/;
my @words = split(/\s+/, $_);
for (my $i = 0; $i < $#words - 1; ++$i) {
print "$words[$i]:$words[$i+1]\n";
print "$words[$i+1]:$words[$i]\n";
}
}
Bigrams: Stage 1
Reduce
======
• Input: List(word , word ) sorted and partitioned
• Output: List(word , [freq, word ])
• Counting similar to Unigrams example
count.pl
--------
$_ = <STDIN>; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
chomp;
my ($w1, $w2) = split(/:/, $_);
if ($w1 eq $pw1 && $w2 eq $pw2) {
$count++;
} else {
print "$pw1:$count:$pw2\n";
$pw1 = $w1;
$pw2 = $w2;
$count = 1;
}
}
print "$pw1:$count:$pw2\n";
Bigrams: Stage 2
Map
===
• Input: List(word , [freq,word ])
• Output: List(word , [freq, word ])
• Identity Mapper (/bin/cat)
• Partition by word
• Sort descending by (word , freq)
Bigrams: Stage 2
Reduce
======
• Input: List(word , [freq,word ])
• partitioned by word
• sorted descending by (word , freq)
• Output: Top (List(word , [freq, word ]))
• For each word, throw away after K records
firstN.pl
$N = 5;
$_ = <STDIN>; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1\t$pw2,$count;";
while(<STDIN>) {
chomp;
my ($w1, $c, $w2) = split(/:/, $_);
if ($w1 eq $pw1) {
if ($idx < $N) {
$out .= "$w2,$c;";
$idx++;
}
} else {
print "$out\n";
$pw1 = $w1;
$idx = 1;
$out = "$pw1\t$w2,$c;";
}
}
print "$out\n";
You can translate this approach to your especific problem.
I recommend you that you discuss this with him because he has a vast
experience with all this, much more than me.
Regards
For Zip, we have our rule saying exact match or within 5 kms of each
other(through a lookup), give a score of 50 and so on.
Once we have each person of dataset1 compared with that of dataset2,
we find the overall rank. Which is a weighted average of scores of
name, address etc comparison.
One approach is to use the DistributedCache for the smaller dataset
and do a nested loop join in the mapper. The second approach is to use
multiple MR flows, and compare the fields and reduce/collate the
results.
I am curious to know if people have other approaches they have
implemented, what are the efficiencies they have built up etc.
Thanks and Regards,
Sonal
Hadoop ETL and Data Integration
Nube Technologies
On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz wrote:
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
Hi,
I am working on a problem to compare two different datasets, and rank
each record of the first with respect to the other, in terms of how
similar they are. The records are dimensional, but do not
have a lot
of dimensions. Some of the fields will be compared for exact matches,
some for similar sound, some with closest match etc. One of the
datasets is large, and the other is much smaller. The final goal is
to compute a rank between each record of first dataset with each
record of the second. The rank is based on weighted scores of each
dimension comparison.
I was wondering if people in the community have any
advice/suggested
patterns/thoughts about cross joining two datasets in map
reduce. Do
let me know if you have any suggestions.
Thanks and Regards,
Sonal
Hadoop ETL and Data Integration
Nube Technologies
Regards, Sonal. Can you give us more information about a basic
workflow
of your idea?
Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to
archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match
For MapReduce Design's Algoritms, I recommend to you this
excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.htmlFor the join of the two datasets, you can use Pig for this.
Here you
have a basic Pig example from Milind Bhandarkar
(milindb@yahoo-inc.com)'s talk "Practical Problem Solving with
Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
--
Marcos Luís Ortíz Valmaseda
Software Engineer
Centro de Tecnologías de Gestión de Datos (DATEC)
Universidad de las Ciencias Informáticas
http://uncubanitolinuxero.blogspot.comhttp://www.linkedin.com/in/marcosluis2186--
Marcos Luís Ortíz Valmaseda
Software Engineer
Centro de Tecnologías de Gestión de Datos (DATEC)
Universidad de las Ciencias Informáticas
http://uncubanitolinuxero.blogspot.comhttp://www.linkedin.com/in/marcosluis2186