Joe,
This is what I use for a related problem, using pure HDFS [no HBase]:
1. Run a one-time map-reduce job where you input your current historical file of hashes [say it is of the format <hash-key, hash-value> in some kind of flat file] using IdentityMapper and the output of your custom reducer is <key, value> is <hash-key, hash-value> or maybe even <hash-key, dummy value> to save space. The important thing is to use MapFileOutputFormat for the reducer output instead of the typical SequenceFileOutputFormat. Now you have a single look-up table which you use for efficient lookup using your hash keys.
Note down the HDFS path of where you stored this mapfile, call it dedupMapFile.
2. In your incremental data update job, pass the HDFS path of dedupMapFile to your conf, then open the mapfile in your reducer configure(), store the reference to the mapfile in the class, and close it in close().
Inside your reduce(), use the mapfile reference to lookup your hashkey; if there is a hit, it is a dup.
3. Also, for your reducer in 2. above, you can use a multiple output format custom format, in which one of the outputs is your current output, and the other is a new dedup output sequencefile which is in the same key-value format as the dedupMapFile. So in the reduce() if the current key value is a dup, discard it, else output to both your regular output, and the new dedup output.
4. After each incremental update job, run a new map reduce job [IdentityMapper and IdentityReducer] to merge the new dedup file with your old dedupMapFile, resulting in the updated dedupMapFile.
Some comments:
* I didn't read your approach too closely, so I suspect you might be doing something essentially like this already.
* All this stuff is basically what HBase does for free, where your dedupMapFile is now a HBase table, and you don't have to run Step 4, since you can just write new [non-duplicate] hash-keys to the HBase table in Step 3, and in Step 2, you just use table.exists(hash-key) to check if it is a dup. You still need Step 1 to populate the table with your historical data.
Hope this helps....
Cheers,
jp
-----Original Message-----
From: Joseph Stein
Sent: Thursday, March 25, 2010 11:35 AM
To:
[email protected]Subject: Re: DeDuplication Techniques
The thing is I have to check historic data (meaning data I have
already aggregated against) so I basically need to hold and read from
a file of hashes.
So within the current data set yes this would work but I then have to
open a file, loop through the value, see it is not there.
If it is there then throw it out, if not there add it to the end.
To me this opening a file checking for dups is a map/reduce task in itself.
What I was thinking is having my mapper take the data I wasn to
validate as unique. I then loop through the files filters. each data
point has a key that then allows me to get the file that has it's
data. e.g. a part of the data partions the hash of the data so each
file holds. So my map job takes the data and breaks it into the
key/value pair (the key allows me to look up my filter file).
When it gets to the reducer... the key is the file I open up, I then
open the file... loop through it... if it is there throw the data
away. if it is not there then add the hash of my data to the filter
file and then output (as the reduce output) the value of the unique.
This output of the unique is then the data I aggregate on which also
updated my historic filter so the next job (5 minutes later) see it,
etc.
On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner wrote:Joe,
what about this approach:
using hashmap values as your keys in MR maps. Since they are sorted by keys,
in reducer you will get all duplicates together, so that you can loop
through them. As the simplest solution, you just take the first one.
Sincerely,
Mark
On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein wrote:I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).
The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time (so we may see duplicates
of data re-posted months after it originally posted...)
I have 2 ways so far I can go about it (one way I do in production
without Hadoop) and interested to see if others have faced/solved this
in Hadoop/HDFS and what their experience might be.
1) handle my own hash filter (where I continually store and look up a
hash (MD5, bloom, whatever) of the data I am aggregating on as
existing already). We do this now without Hadoop perhaps a variant
can be ported into HDFS as map task, reducing the results to files and
restoring the hash table (maybe in Hive or something, dunno yet)
2) push the data into Cassandra (our NoSQL solution of choice) and let
that hash/map system do it for us. As I get more into Hadoop looking
at HBas is tempting but then just one more thing to learn.
I would really like to not have to reinvent a wheel here and even
contribute if something is going on as it is a use case in our work
effort.
Thanx in advance =8^) Apologize I posted this on common dev yesterday
by accident (so this is not a repost spam but appropriate for this
list)
Cheers.
/*
Joe Stein
http://www.linkedin.com/in/charmalloc*/
--
/*
Joe Stein
http://www.linkedin.com/in/charmalloc*/