Grokbase Groups Hive user May 2010
FAQ
Hi,

I am implementing my own serde and storage handler. Is there any
method in one of these interfaces (or any other) which give me a
handle to do some operation after all the records have been written by
all reducer. Something very similar to job level output committer. I
want to update some state in an external system once I know job has
completed successfully. Ideally, I would do this kind of a thing in a
job level output committer, but since Hive is on old MR api, I dont
have access to that. There is a Hive's RecordWriter#close() I tried
that but it looks like its a task level handle. So, every reducer will
try to update the state of my external system, which is not I want.
Any pointers on how to achieve this will be much appreciated. If its
unclear what I am asking for, let me know and I will provide more
details.

Thanks,
Ashutosh

Search Discussions

  • Kortni Smith at May 25, 2010 at 7:26 pm
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in case
    it helps I am writing because it sounds like you to know when your job is
    done so you can update something externally and my company will also be
    implementing this in the near future. Our plan is to have the process that
    kicks off our hive jobs in the cloud, to monitor each job status periodically
    using amazon's emr java library, and when their state changes to complete,
    update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272 | fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com | www.abebooks.co.uk | www.abebooks.de
    www.abebooks.fr | www.abebooks.it | www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer. Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that. There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashutosh Chauhan at May 26, 2010 at 4:17 pm
    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in case
    it helps I am writing because it sounds like you to know when your job is
    done so you can update something externally and my company will also be
    implementing this in the near future.  Our plan is to have the process that
    kicks off our hive jobs in the cloud, to monitor each job status periodically
    using amazon's emr java library, and when their state changes to complete,
    update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer.  Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that.  There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashish Thusoo at May 26, 2010 at 5:16 pm
    Hive supports PostExecute hooks that can support this. Look at

    org.apache.hadoop.hive.ql.hooks.PostExecute

    After you implement this hook you can register it through

    hive.exec.post.hooks

    Variable which is a comma separated list of hook implementations.

    The caveat is that this hook gets called at the end of the query (which may comprise of a number of hadoop jobs). If you need per job hook there is effort underway to add PreTask and PostTask hooks.

    https://issues.apache.org/jira/browse/HIVE-1347

    Would those help for your use case or is the postexecute hook enough. We use the postexecute hook to drive replication and collect a lot of usage stats..

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Wednesday, May 26, 2010 9:17 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are not spinning hive jobs in a separate process which we can monitor rather I want to get the handle on when job finishes in my storage handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in
    case it helps I am writing because it sounds like you to know when
    your job is done so you can update something externally and my company
    will also be implementing this in the near future.  Our plan is to
    have the process that kicks off our hive jobs in the cloud, to monitor
    each job status periodically using amazon's emr java library, and when
    their state changes to complete, update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer.  Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that.  There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ning Zhang at May 26, 2010 at 5:22 pm
    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in case
    it helps I am writing because it sounds like you to know when your job is
    done so you can update something externally and my company will also be
    implementing this in the near future. Our plan is to have the process that
    kicks off our hive jobs in the cloud, to monitor each job status periodically
    using amazon's emr java library, and when their state changes to complete,
    update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272 | fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com | www.abebooks.co.uk | www.abebooks.de
    www.abebooks.fr | www.abebooks.it | www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer. Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that. There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • John Sichi at May 26, 2010 at 5:36 pm
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned). That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in case
    it helps I am writing because it sounds like you to know when your job is
    done so you can update something externally and my company will also be
    implementing this in the near future. Our plan is to have the process that
    kicks off our hive jobs in the cloud, to monitor each job status periodically
    using amazon's emr java library, and when their state changes to complete,
    update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272 | fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com | www.abebooks.co.uk | www.abebooks.de
    www.abebooks.fr | www.abebooks.it | www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer. Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that. There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashutosh Chauhan at May 27, 2010 at 1:36 am
    Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
    At this point I can implement PostExecute as I need to call the hook
    only at the end of query and not at the end of each job or task of
    query. If I register it through hive-site.xml then I guess it will
    get executed for each query which is where the complication starts. I
    want to execute this hook only for insert queries and not for all the
    queries. One workaround is to get the Cmd string from session and
    then parse it to find out if it actually is an insert query and only
    if it is then execute the remainder of code.
    But that looks hacky, I look forward to HIVE-1225.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 10:35, John Sichi wrote:
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned).  That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but in case
    it helps I am writing because it sounds like you to know when your job is
    done so you can update something externally and my company will also be
    implementing this in the near future.  Our plan is to have the process that
    kicks off our hive jobs in the cloud, to monitor each job status periodically
    using amazon's emr java library, and when their state changes to complete,
    update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written by
    all reducer.  Something very similar to job level output committer. I
    want to update some state in an external system once I know job has
    completed successfully. Ideally, I would do this kind of a thing in a
    job level output committer, but since Hive is on old MR api, I dont
    have access to that.  There is a Hive's RecordWriter#close() I tried
    that but it looks like its a task level handle. So, every reducer will
    try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashish Thusoo at May 27, 2010 at 1:46 am
    Actually if you want to do that then I believe you can check in the post execute hook that you have a valid write entity that is of the type table or partition. You should have that only in the case of an insert or a CTAS.

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Wednesday, May 26, 2010 6:36 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
    At this point I can implement PostExecute as I need to call the hook only at the end of query and not at the end of each job or task of query. If I register it through hive-site.xml then I guess it will get executed for each query which is where the complication starts. I want to execute this hook only for insert queries and not for all the queries. One workaround is to get the Cmd string from session and then parse it to find out if it actually is an insert query and only if it is then execute the remainder of code.
    But that looks hacky, I look forward to HIVE-1225.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 10:35, John Sichi wrote:
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned).  That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but
    in case it helps I am writing because it sounds like you to know
    when your job is done so you can update something externally and my
    company will also be implementing this in the near future.  Our plan
    is to have the process that kicks off our hive jobs in the cloud, to
    monitor each job status periodically using amazon's emr java
    library, and when their state changes to complete, update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written
    by all reducer.  Something very similar to job level output
    committer. I want to update some state in an external system once I
    know job has completed successfully. Ideally, I would do this kind
    of a thing in a job level output committer, but since Hive is on old
    MR api, I dont have access to that.  There is a Hive's
    RecordWriter#close() I tried that but it looks like its a task level
    handle. So, every reducer will try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashutosh Chauhan at May 27, 2010 at 1:51 am
    Oh cool.. I will try that out.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 18:46, Ashish Thusoo wrote:
    Actually if you want to do that then I believe you can check in the post execute hook that you have a valid write entity that is of the type table or partition. You should have that only in the case of an insert or a CTAS.

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Wednesday, May 26, 2010 6:36 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
    At this point I can implement PostExecute as I need to call the hook only at the end of query and not at the end of each job or task of query.  If I register it through hive-site.xml then I guess it will get executed for each query which is where the complication starts.  I want to execute this hook only for insert queries and not for all the queries.  One workaround is to get the Cmd string from session and then parse it to find out if it actually is an insert query and only if it is then execute the remainder of code.
    But that looks hacky, I look forward to HIVE-1225.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 10:35, John Sichi wrote:
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned).  That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but
    in case it helps I am writing because it sounds like you to know
    when your job is done so you can update something externally and my
    company will also be implementing this in the near future.  Our plan
    is to have the process that kicks off our hive jobs in the cloud, to
    monitor each job status periodically using amazon's emr java
    library, and when their state changes to complete, update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written
    by all reducer.  Something very similar to job level output
    committer. I want to update some state in an external system once I
    know job has completed successfully. Ideally, I would do this kind
    of a thing in a job level output committer, but since Hive is on old
    MR api, I dont have access to that.  There is a Hive's
    RecordWriter#close() I tried that but it looks like its a task level
    handle. So, every reducer will try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashutosh Chauhan at May 28, 2010 at 6:34 am
    Hi Ashish,

    Thanks for your suggestion. I am getting closer. Few questions I have
    while I was trying it: I presume this hook is executed by hive client.
    What will happen if hive client goes away after job is submitted to
    cluster. I guess this hook wont be executed. If so, this will hurt my
    use case as I cant afford to not update my external system on
    successful completion of job. Note that this may not be a problem if a
    similar hook is exposed via StorageHandler interface (via job level
    output committer) as then hook will become part of the job itself and
    will be invoked by MR framework in clean up task at the end.

    Also write entity is set to be of type table in case of create table
    query in addition to Insert overwrite and CTAS queries. I want to
    invoke the hook only in case of INSERT OVERWRITE and CTAS and not in
    create table statement. Is there way to know this in the hook ?

    Lastly, if my external system throws an exception or I cant update it,
    I want to tell hive to consider query as failure. In the hook how can
    I tell this to hive?

    Thanks,
    Ashutosh

    On Wed, May 26, 2010 at 18:50, Ashutosh Chauhan
    wrote:
    Oh cool.. I will try that out.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 18:46, Ashish Thusoo wrote:
    Actually if you want to do that then I believe you can check in the post execute hook that you have a valid write entity that is of the type table or partition. You should have that only in the case of an insert or a CTAS.

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Wednesday, May 26, 2010 6:36 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
    At this point I can implement PostExecute as I need to call the hook only at the end of query and not at the end of each job or task of query.  If I register it through hive-site.xml then I guess it will get executed for each query which is where the complication starts.  I want to execute this hook only for insert queries and not for all the queries.  One workaround is to get the Cmd string from session and then parse it to find out if it actually is an insert query and only if it is then execute the remainder of code.
    But that looks hacky, I look forward to HIVE-1225.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 10:35, John Sichi wrote:
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned).  That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things, but
    in case it helps I am writing because it sounds like you to know
    when your job is done so you can update something externally and my
    company will also be implementing this in the near future.  Our plan
    is to have the process that kicks off our hive jobs in the cloud, to
    monitor each job status periodically using amazon's emr java
    library, and when their state changes to complete, update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been written
    by all reducer.  Something very similar to job level output
    committer. I want to update some state in an external system once I
    know job has completed successfully. Ideally, I would do this kind
    of a thing in a job level output committer, but since Hive is on old
    MR api, I dont have access to that.  There is a Hive's
    RecordWriter#close() I tried that but it looks like its a task level
    handle. So, every reducer will try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If its
    unclear what I am asking for, let me know and I will provide more
    details.

    Thanks,
    Ashutosh
  • Ashish Thusoo at May 28, 2010 at 7:11 pm
    Hi Ashutosh,

    Yes that hook will not actually fire if the client goes away in any way. One way around that (apart from the StorageHandler approach) is to have a PreExecute hook to generate a start of query entry as well. So the ones which do not have an end of query entry are the ones that failed.

    About distinguishing between CTAS and create table - one simple method would be that the later would not have any ReadEntity associated with it.

    For the exception if you throw any Exception in the hook the driver should be able to catch it and report back a failure.

    I hope this helps.

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Thursday, May 27, 2010 11:34 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashish,

    Thanks for your suggestion. I am getting closer. Few questions I have while I was trying it: I presume this hook is executed by hive client.
    What will happen if hive client goes away after job is submitted to cluster. I guess this hook wont be executed. If so, this will hurt my use case as I cant afford to not update my external system on successful completion of job. Note that this may not be a problem if a similar hook is exposed via StorageHandler interface (via job level output committer) as then hook will become part of the job itself and will be invoked by MR framework in clean up task at the end.

    Also write entity is set to be of type table in case of create table query in addition to Insert overwrite and CTAS queries. I want to invoke the hook only in case of INSERT OVERWRITE and CTAS and not in create table statement. Is there way to know this in the hook ?

    Lastly, if my external system throws an exception or I cant update it, I want to tell hive to consider query as failure. In the hook how can I tell this to hive?

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 18:50, Ashutosh Chauhan wrote:
    Oh cool.. I will try that out.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 18:46, Ashish Thusoo wrote:
    Actually if you want to do that then I believe you can check in the post execute hook that you have a valid write entity that is of the type table or partition. You should have that only in the case of an insert or a CTAS.

    Ashish

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Wednesday, May 26, 2010 6:36 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Thanks everyone for the reply. I think its HIVE-1225 is really what I want.
    At this point I can implement PostExecute as I need to call the hook only at the end of query and not at the end of each job or task of query.  If I register it through hive-site.xml then I guess it will get executed for each query which is where the complication starts.  I want to execute this hook only for insert queries and not for all the queries.  One workaround is to get the Cmd string from session and then parse it to find out if it actually is an insert query and only if it is then execute the remainder of code.
    But that looks hacky, I look forward to HIVE-1225.

    Thanks,
    Ashutosh
    On Wed, May 26, 2010 at 10:35, John Sichi wrote:
    I think we'll need to extend the StorageHandler interface so that it can participate in the commit semantics (separate from the handler-independent hooks Ashish mentioned).  That was the intention of this followup JIRA issue I logged as part of HBase integration work:

    https://issues.apache.org/jira/browse/HIVE-1225

    To add this one, we need to determine what information needs to be passed along to the storage handler now (and how to make it easy to pass along more information as needed without having to change the interface in the future).

    JVS

    ________________________________________
    From: Ning Zhang [nzhang@facebook.com]
    Sent: Wednesday, May 26, 2010 10:22 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: job level output committer in storage handler

    Hi Ashutosh,

    Hive doesn't use OutputCommitter explicitly because it handles commit and abort by itself.

    If you are looking for task level committer where you want to do something after a task successfully finished, you can take a look at the FileSinkOperator.cloaseOp(). It renames tempFile to final file name which implement the commit semantics.

    If you are looking for job level committer where you want to do something after the job (including all task) finished successfully, you can take a look at the MoveTask implementation. The MoveTask is generated as a follow up task after a MR job for each insert overwrite statement. It moves the directory that contains the results from all finished tasks to its destination path (e.g. a directory specified in the insert statement or inferred from the table's storage location property). The MoveTask implements the commit semantics of the whole job.

    Ning
    On May 26, 2010, at 9:16 AM, Ashutosh Chauhan wrote:

    Hi Kortni,

    Thanks for your suggestion. But we cant use it in our setup. We are
    not spinning hive jobs in a separate process which we can monitor
    rather I want to get the handle on when job finishes in my storage
    handler / serde.

    Ashutosh
    On Tue, May 25, 2010 at 12:25, Kortni Smith wrote:
    Hi Ashutosh ,

    I'm not sure how to accomplish that on the hive side of things,
    but in case it helps I am writing because it sounds like you to
    know when your job is done so you can update something externally
    and my company will also be implementing this in the near future.
    Our plan is to have the process that kicks off our hive jobs in
    the cloud, to monitor each job status periodically using amazon's
    emr java library, and when their state changes to complete, update our external systems accordingly.


    Kortni Smith | Software Developer
    AbeBooks.com  Passion for books.

    ksmith@abebooks.com
    phone: 250.412.3272  |  fax: 250.475.6014

    Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

    www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
    www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Tuesday, May 25, 2010 12:13 PM
    To: hive-user@hadoop.apache.org
    Subject: job level output committer in storage handler

    Hi,

    I am implementing my own serde and storage handler. Is there any
    method in one of these interfaces (or any other) which give me a
    handle to do some operation after all the records have been
    written by all reducer.  Something very similar to job level
    output committer. I want to update some state in an external
    system once I know job has completed successfully. Ideally, I
    would do this kind of a thing in a job level output committer, but
    since Hive is on old MR api, I dont have access to that.  There is
    a Hive's
    RecordWriter#close() I tried that but it looks like its a task
    level handle. So, every reducer will try to update the state of my external system, which is not I want.
    Any pointers on how to achieve this will be much appreciated. If
    its unclear what I am asking for, let me know and I will provide
    more details.

    Thanks,
    Ashutosh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 25, '10 at 7:13p
activeMay 28, '10 at 7:11p
posts11
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase