FAQ
Hi All,
In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks. I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help?

Thanks!
Jonathan

________________________________
This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.

Search Discussions

  • Michael Segel at Jul 14, 2011 at 4:18 pm
    You don't have dupes because the key has to be unique. 



    Sent from my Palm Pre on AT&T
    On Jul 14, 2011 11:00 AM, jonathan.hwang@accenture.com <jonathan.hwang@accenture.com> wrote:

    Hi All,

    In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks. I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help?



    Thanks!

    Jonathan



    ________________________________

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
  • C.V.Krishnakumar Iyer at Jul 14, 2011 at 7:12 pm
    Hi,

    I guess by "system" you meant HDFS.

    In that case HBase might help. HBase needs to have unique keys. They are just bytes, so I guess you can just concatenate multiple columns in your primary key ( if you have a primary key spanning >1 column) to have a key for HBase, so that duplicates dont exist.

    So, data can be stored in HBase rather than in files and everything else is still the same.

    I dont know about Hive though.

    Thanks,
    Krishnakumar.


    On Jul 14, 2011, at 9:18 AM, Michael Segel wrote:

    You don't have dupes because the key has to be unique. 



    Sent from my Palm Pre on AT&T
    On Jul 14, 2011 11:00 AM, jonathan.hwang@accenture.com <jonathan.hwang@accenture.com> wrote:

    Hi All,

    In databases you can be able to define primary keys to ensure no duplicate data get loaded into the system. Let say I have a lot of 1 billion records flowing into my system everyday and some of these are repeated data (Same records). I can use 2-3 columns in the record to match and look for duplicates. What is the best strategy of de-duplication? The duplicated records should only appear within the last 2 weeks. I want a fast way to get the data into the system without much delay. Anyway HBase or Hive can help?



    Thanks!

    Jonathan



    ________________________________

    This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 14, '11 at 4:01p
activeJul 14, '11 at 7:12p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase