Grokbase Groups Hive user June 2011
FAQ
easiest way to achieve a level of robustness is probably to load into a
partition and then truncate the partition on the event of failure

Cleaning up after an incomplete load is a problem in many traditional
rdbm's, you can not always rely on rollback functionality

No explicit delete's in HIVE though so whatever you need to do to massage
and clean the data file is best done prior to inserting it into it's final
destination.

Many of the things you bring up are more ETL best practices then properties
of an RDBMS implementation though.
Guy
On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

My question is a "what if" question, not a production issue. It seems
natural, when replacing traditional database with hive, to ask
how much robustness is sacrificed for scalability. My concern is that if a
file is partially loaded, there might not be an easy way to clean up the
already loaded data before re-loading the data. The lack of unique index
also does not make it easy to avoid duplicate data either, although
duplicated data can perhaps be deleted after the load.

On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek wrote:

Hi,

I think this is a problem with open source in general and sometimes it can
be very frustrating.
However, your question is more of a "what if" question - you're not in the
trouble of finding a horrible bug after you deployed to production, am I
right?

Regarding your question, I would guess that if LOAD DATA INPATH crashes
while moving files into the Hive warehouse, the data which was moved will
appear as legitimate loaded data. Or the files will be moved but the
metadata will not be updated. In any case, you should detect the crash and
redo the operation. The easiest answer might actually be to look into the
source code - sometimes it can be easier to find than one would expect.

Not a complete answer, but hope this helps a bit.

Martin

On 14/06/2011 00:47, W S Chung wrote:

I submit a question like this before, but somehow that question is never
delivered. I can even find my question in google. Since I cannot find any
admin e-mail/feedback form on the hive website that I can ask why the last
question is not delivered. There is not much option other than to post the
question again and hope that the question get through this time. Sorry for
the double posting if you have seen my last e-mail.

What is the behaviour if a client of hive crashes in the middle of
running a "load data inpath" for either a local file or a file on HDFS? Will
the file be partially loaded in the db? Thanks.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 7 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '11 at 10:47p
activeJun 15, '11 at 7:47p
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase