Normally in hive, a table or partition is loaded by a single job/process at once. Once it is loaded you can't append or insert any more data into that table (only if you do it manually by moving data to that directory) So you can most probably easier to enforce the constraints in that loading process. This solution not a nifty as RDBMS but the the probability of inserting duplicates is much higher in RDBMS though.
From: Jeff Hammerbacher <email@example.com>
Date: Thu, 29 Jan 2009 11:49:29 -0800
Cc: Zheng Shao <firstname.lastname@example.org>
Subject: Re: data integrity
One possibility would be to run a MapReduce/Hive job after the load that checks that your integrity constraints are met.
On Thu, Jan 29, 2009 at 10:03 AM, Zheng Shao wrote:
IF is just added lasy evening.
I will add it to wiki today.
We don't have case, decode etc yet.
On 1/29/09, Shane Brady wrote:
I'm rather new to Hive and have been playing with it the last couple weeks
to see if it is appropriate to use for a particular project inside where I
work. My essential question is, how to maintain data integrity inside the
tables so that we don't accidentally load duplicate data. Normally we rely
on indexes or unique keys to enforce this. Is there a general strategy for
this in Hive?
In a second question, I haven't seen anything like it in the docs, but is
there any equivalent to CASE,DECODE, or IF-THEN-ELSE allowed in the query?
-Shane P. Brady
Sent from Gmail for mobile | mobile.google.com <http://mobile.google.com