Grokbase Groups Hive user June 2011
FAQ
I submit a question like this before, but somehow that question is never
delivered. I can even find my question in google. Since I cannot find any
admin e-mail/feedback form on the hive website that I can ask why the last
question is not delivered. There is not much option other than to post the
question again and hope that the question get through this time. Sorry for
the double posting if you have seen my last e-mail.

What is the behaviour if a client of hive crashes in the middle of running
a "load data inpath" for either a local file or a file on HDFS? Will the
file be partially loaded in the db? Thanks.

Search Discussions

  • Martin Konicek at Jun 13, 2011 at 11:12 pm
    Hi,

    I think this is a problem with open source in general and sometimes it
    can be very frustrating.
    However, your question is more of a "what if" question - you're not in
    the trouble of finding a horrible bug after you deployed to production,
    am I right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved
    will appear as legitimate loaded data. Or the files will be moved but
    the metadata will not be updated. In any case, you should detect the
    crash and redo the operation. The easiest answer might actually be to
    look into the source code - sometimes it can be easier to find than one
    would expect.

    Not a complete answer, but hope this helps a bit.

    Martin
    On 14/06/2011 00:47, W S Chung wrote:
    I submit a question like this before, but somehow that question is
    never delivered. I can even find my question in google. Since I cannot
    find any admin e-mail/feedback form on the hive website that I can ask
    why the last question is not delivered. There is not much option other
    than to post the question again and hope that the question get through
    this time. Sorry for the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on
    HDFS? Will the file be partially loaded in the db? Thanks.
  • W S Chung at Jun 14, 2011 at 3:57 pm
    My question is a "what if" question, not a production issue. It seems
    natural, when replacing traditional database with hive, to ask
    how much robustness is sacrificed for scalability. My concern is that if a
    file is partially loaded, there might not be an easy way to clean up the
    already loaded data before re-loading the data. The lack of unique index
    also does not make it easy to avoid duplicate data either, although
    duplicated data can perhaps be deleted after the load.
    On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek wrote:

    Hi,

    I think this is a problem with open source in general and sometimes it can
    be very frustrating.
    However, your question is more of a "what if" question - you're not in the
    trouble of finding a horrible bug after you deployed to production, am I
    right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved will
    appear as legitimate loaded data. Or the files will be moved but the
    metadata will not be updated. In any case, you should detect the crash and
    redo the operation. The easiest answer might actually be to look into the
    source code - sometimes it can be easier to find than one would expect.

    Not a complete answer, but hope this helps a bit.

    Martin

    On 14/06/2011 00:47, W S Chung wrote:

    I submit a question like this before, but somehow that question is never
    delivered. I can even find my question in google. Since I cannot find any
    admin e-mail/feedback form on the hive website that I can ask why the last
    question is not delivered. There is not much option other than to post the
    question again and hope that the question get through this time. Sorry for
    the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on HDFS? Will
    the file be partially loaded in the db? Thanks.

  • Guy Bayes at Jun 14, 2011 at 4:21 pm
    easiest way to achieve a level of robustness is probably to load into a
    partition and then truncate the partition on the event of failure

    Cleaning up after an incomplete load is a problem in many traditional
    rdbm's, you can not always rely on rollback functionality

    No explicit delete's in HIVE though so whatever you need to do to massage
    and clean the data file is best done prior to inserting it into it's final
    destination.

    Many of the things you bring up are more ETL best practices then properties
    of an RDBMS implementation though.
    Guy
    On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

    My question is a "what if" question, not a production issue. It seems
    natural, when replacing traditional database with hive, to ask
    how much robustness is sacrificed for scalability. My concern is that if a
    file is partially loaded, there might not be an easy way to clean up the
    already loaded data before re-loading the data. The lack of unique index
    also does not make it easy to avoid duplicate data either, although
    duplicated data can perhaps be deleted after the load.

    On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek wrote:

    Hi,

    I think this is a problem with open source in general and sometimes it can
    be very frustrating.
    However, your question is more of a "what if" question - you're not in the
    trouble of finding a horrible bug after you deployed to production, am I
    right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved will
    appear as legitimate loaded data. Or the files will be moved but the
    metadata will not be updated. In any case, you should detect the crash and
    redo the operation. The easiest answer might actually be to look into the
    source code - sometimes it can be easier to find than one would expect.

    Not a complete answer, but hope this helps a bit.

    Martin

    On 14/06/2011 00:47, W S Chung wrote:

    I submit a question like this before, but somehow that question is never
    delivered. I can even find my question in google. Since I cannot find any
    admin e-mail/feedback form on the hive website that I can ask why the last
    question is not delivered. There is not much option other than to post the
    question again and hope that the question get through this time. Sorry for
    the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on HDFS? Will
    the file be partially loaded in the db? Thanks.

  • W S Chung at Jun 15, 2011 at 7:11 pm
    If the failure of the loading is severe enough, like the whole machine
    crashes, that there might not be an opportunity to catch the exception and
    cleanup the partition right away. The best I can think of is to cleanup the
    partition in a background job reasonably regularly. In that case, before the
    cleanup, is there anyway I can prevent any query from seeing the data in the
    partition that should not be there?

    Or will this really happens? If the metadata is only updated after the
    successful load, the partition may not exist unless the load runs till its
    end.
    On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes wrote:

    easiest way to achieve a level of robustness is probably to load into a
    partition and then truncate the partition on the event of failure

    Cleaning up after an incomplete load is a problem in many traditional
    rdbm's, you can not always rely on rollback functionality

    No explicit delete's in HIVE though so whatever you need to do to massage
    and clean the data file is best done prior to inserting it into it's final
    destination.

    Many of the things you bring up are more ETL best practices then properties
    of an RDBMS implementation though.
    Guy

    On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

    My question is a "what if" question, not a production issue. It seems
    natural, when replacing traditional database with hive, to ask
    how much robustness is sacrificed for scalability. My concern is that if a
    file is partially loaded, there might not be an easy way to clean up the
    already loaded data before re-loading the data. The lack of unique index
    also does not make it easy to avoid duplicate data either, although
    duplicated data can perhaps be deleted after the load.


    On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <martin.konicek@gmail.com
    wrote:
    Hi,

    I think this is a problem with open source in general and sometimes it
    can be very frustrating.
    However, your question is more of a "what if" question - you're not in
    the trouble of finding a horrible bug after you deployed to production, am I
    right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved will
    appear as legitimate loaded data. Or the files will be moved but the
    metadata will not be updated. In any case, you should detect the crash and
    redo the operation. The easiest answer might actually be to look into the
    source code - sometimes it can be easier to find than one would expect.

    Not a complete answer, but hope this helps a bit.

    Martin

    On 14/06/2011 00:47, W S Chung wrote:

    I submit a question like this before, but somehow that question is never
    delivered. I can even find my question in google. Since I cannot find any
    admin e-mail/feedback form on the hive website that I can ask why the last
    question is not delivered. There is not much option other than to post the
    question again and hope that the question get through this time. Sorry for
    the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on HDFS? Will
    the file be partially loaded in the db? Thanks.

  • Guy Bayes at Jun 15, 2011 at 7:29 pm
    I think if you load a file, validate it, and then* alter table add partition
    *to the final table at the end, in the event of crash you only have a
    partially loaded etl file that no one will be querying anyway.

    That should work, though I am not speaking from personal experience, at
    least not with HIVE
    Guy
    On Wed, Jun 15, 2011 at 12:11 PM, W S Chung wrote:

    If the failure of the loading is severe enough, like the whole machine
    crashes, that there might not be an opportunity to catch the exception and
    cleanup the partition right away. The best I can think of is to cleanup the
    partition in a background job reasonably regularly. In that case, before the
    cleanup, is there anyway I can prevent any query from seeing the data in the
    partition that should not be there?

    Or will this really happens? If the metadata is only updated after the
    successful load, the partition may not exist unless the load runs till its
    end.

    On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes wrote:

    easiest way to achieve a level of robustness is probably to load into a
    partition and then truncate the partition on the event of failure

    Cleaning up after an incomplete load is a problem in many traditional
    rdbm's, you can not always rely on rollback functionality

    No explicit delete's in HIVE though so whatever you need to do to massage
    and clean the data file is best done prior to inserting it into it's final
    destination.

    Many of the things you bring up are more ETL best practices then
    properties of an RDBMS implementation though.
    Guy

    On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

    My question is a "what if" question, not a production issue. It seems
    natural, when replacing traditional database with hive, to ask
    how much robustness is sacrificed for scalability. My concern is that if
    a file is partially loaded, there might not be an easy way to clean up the
    already loaded data before re-loading the data. The lack of unique index
    also does not make it easy to avoid duplicate data either, although
    duplicated data can perhaps be deleted after the load.


    On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
    martin.konicek@gmail.com> wrote:
    Hi,

    I think this is a problem with open source in general and sometimes it
    can be very frustrating.
    However, your question is more of a "what if" question - you're not in
    the trouble of finding a horrible bug after you deployed to production, am I
    right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved will
    appear as legitimate loaded data. Or the files will be moved but the
    metadata will not be updated. In any case, you should detect the crash and
    redo the operation. The easiest answer might actually be to look into the
    source code - sometimes it can be easier to find than one would expect.

    Not a complete answer, but hope this helps a bit.

    Martin

    On 14/06/2011 00:47, W S Chung wrote:

    I submit a question like this before, but somehow that question is
    never delivered. I can even find my question in google. Since I cannot find
    any admin e-mail/feedback form on the hive website that I can ask why the
    last question is not delivered. There is not much option other than to post
    the question again and hope that the question get through this time. Sorry
    for the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on HDFS? Will
    the file be partially loaded in the db? Thanks.

  • W S Chung at Jun 15, 2011 at 7:47 pm
    If that is the case, I'll just need to cleanup the partially loaded hdfs
    file in a background job. That should do.
    On Wed, Jun 15, 2011 at 3:28 PM, Guy Bayes wrote:

    I think if you load a file, validate it, and then* alter table add
    partition *to the final table at the end, in the event of crash you only
    have a partially loaded etl file that no one will be querying anyway.

    That should work, though I am not speaking from personal experience, at
    least not with HIVE
    Guy

    On Wed, Jun 15, 2011 at 12:11 PM, W S Chung wrote:

    If the failure of the loading is severe enough, like the whole machine
    crashes, that there might not be an opportunity to catch the exception and
    cleanup the partition right away. The best I can think of is to cleanup the
    partition in a background job reasonably regularly. In that case, before the
    cleanup, is there anyway I can prevent any query from seeing the data in the
    partition that should not be there?

    Or will this really happens? If the metadata is only updated after the
    successful load, the partition may not exist unless the load runs till its
    end.

    On Tue, Jun 14, 2011 at 12:21 PM, Guy Bayes wrote:

    easiest way to achieve a level of robustness is probably to load into a
    partition and then truncate the partition on the event of failure

    Cleaning up after an incomplete load is a problem in many traditional
    rdbm's, you can not always rely on rollback functionality

    No explicit delete's in HIVE though so whatever you need to do to massage
    and clean the data file is best done prior to inserting it into it's final
    destination.

    Many of the things you bring up are more ETL best practices then
    properties of an RDBMS implementation though.
    Guy

    On Tue, Jun 14, 2011 at 8:57 AM, W S Chung wrote:

    My question is a "what if" question, not a production issue. It seems
    natural, when replacing traditional database with hive, to ask
    how much robustness is sacrificed for scalability. My concern is that if
    a file is partially loaded, there might not be an easy way to clean up the
    already loaded data before re-loading the data. The lack of unique index
    also does not make it easy to avoid duplicate data either, although
    duplicated data can perhaps be deleted after the load.


    On Mon, Jun 13, 2011 at 7:12 PM, Martin Konicek <
    martin.konicek@gmail.com> wrote:
    Hi,

    I think this is a problem with open source in general and sometimes it
    can be very frustrating.
    However, your question is more of a "what if" question - you're not in
    the trouble of finding a horrible bug after you deployed to production, am I
    right?

    Regarding your question, I would guess that if LOAD DATA INPATH crashes
    while moving files into the Hive warehouse, the data which was moved will
    appear as legitimate loaded data. Or the files will be moved but the
    metadata will not be updated. In any case, you should detect the crash and
    redo the operation. The easiest answer might actually be to look into the
    source code - sometimes it can be easier to find than one would expect.

    Not a complete answer, but hope this helps a bit.

    Martin

    On 14/06/2011 00:47, W S Chung wrote:

    I submit a question like this before, but somehow that question is
    never delivered. I can even find my question in google. Since I cannot find
    any admin e-mail/feedback form on the hive website that I can ask why the
    last question is not delivered. There is not much option other than to post
    the question again and hope that the question get through this time. Sorry
    for the double posting if you have seen my last e-mail.

    What is the behaviour if a client of hive crashes in the middle of
    running a "load data inpath" for either a local file or a file on HDFS? Will
    the file be partially loaded in the db? Thanks.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '11 at 10:47p
activeJun 15, '11 at 7:47p
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase