Grokbase Groups Hive user June 2011
I think if you load a file, validate it, and then* alter table add partition
*to the final table at the end, in the event of crash you only have a
partially loaded etl file that no one will be querying anyway.

That should work, though I am not speaking from personal experience, at
least not with HIVE
On Wed, Jun 15, 2011 at 12:11 PM, W S Chung wrote:

If the failure of the loading is severe enough, like the whole machine
crashes, that there might not be an opportunity to catch the exception and
cleanup the partition right away. The best I can think of is to cleanup the
partition in a background job reasonably regularly. In that case, before the
cleanup, is there anyway I can prevent any query from seeing the data in the
partition that should not be there?

Or will this really happens? If the metadata is only updated after the
successful load, the partition may not exist unless the load runs till its

On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes wrote:

easiest way to achieve a level of robustness is probably to load into a
partition and then truncate the partition on the event of failure

Cleaning up after an incomplete load is a problem in many traditional
rdbm's, you can not always rely on rollback functionality

No explicit delete's in HIVE though so whatever you need to do to massage
and clean the data file is best done prior to inserting it into it's final

Many of the things you bring up are more ETL best practices then
properties of an RDBMS implementation though.

On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

My question is a "what if" question, not a production issue. It seems
natural, when replacing traditional database with hive, to ask
how much robustness is sacrificed for scalability. My concern is that if
a file is partially loaded, there might not be an easy way to clean up the
already loaded data before re-loading the data. The lack of unique index
also does not make it easy to avoid duplicate data either, although
duplicated data can perhaps be deleted after the load.

On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
[email protected]> wrote:

I think this is a problem with open source in general and sometimes it
can be very frustrating.
However, your question is more of a "what if" question - you're not in
the trouble of finding a horrible bug after you deployed to production, am I

Regarding your question, I would guess that if LOAD DATA INPATH crashes
while moving files into the Hive warehouse, the data which was moved will
appear as legitimate loaded data. Or the files will be moved but the
metadata will not be updated. In any case, you should detect the crash and
redo the operation. The easiest answer might actually be to look into the
source code - sometimes it can be easier to find than one would expect.

Not a complete answer, but hope this helps a bit.


On 14/06/2011 00:47, W S Chung wrote:

I submit a question like this before, but somehow that question is
never delivered. I can even find my question in google. Since I cannot find
any admin e-mail/feedback form on the hive website that I can ask why the
last question is not delivered. There is not much option other than to post
the question again and hope that the question get through this time. Sorry
for the double posting if you have seen my last e-mail.

What is the behaviour if a client of hive crashes in the middle of
running a "load data inpath" for either a local file or a file on HDFS? Will
the file be partially loaded in the db? Thanks.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 6 of 7 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '11 at 10:47p
activeJun 15, '11 at 7:47p



site design / logo © 2023 Grokbase