FAQ
Hi list,

I'm goint to take Hive into production to analyze our web logs, which
are hundreds of giga-bytes per day. Previously, we did this job by using
Apache hadoop, running our raw mapreduce code. It did work, but it also
decreased our productivity directly. We were suffering from writting code
with similar logic again and again. It could be worse, when the format of
our logs being changed. For example, when we want to insert one more field
in each line of the log, the previous work would be useless, then we have to
redo it. Hence we are thinking about using Hive as a persistent layer, to
store and retrieve the schemes of the data easily. But we found that
sometimes Hive could not do some sort of complex analysis, because of the
limitation of the ideographic ability of SQL. We have to write our own UDFs,
even though, some difficulties Hive still cannot go through. Thus we also
need to write raw mapreduces code, which let us come up against another
problem. Since one is a set of SQL scripts, the other is pieces of java or
hybrid code, How to coordinate Hive and raw mapreduce code and how to
shedule them? How does Facebook use Hive? And what is your solution when you
come across the similar problems?

In the end, we are considering about using Hive as our data warehouse.
Any suggestions?

Thanks in advance!
Min

--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com

Search Discussions

  • Prasad Chakka at Feb 23, 2009 at 3:48 am
    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE facilities. Check the wiki on how to use them. Or do you want something different?




    ________________________________
    From: Min Zhou <coderplay@gmail.com>
    Reply-To: <hive-user@hadoop.apache.org>
    Date: Sun, 22 Feb 2009 19:42:50 -0800
    To: <hive-user@hadoop.apache.org>
    Subject: How to simplify our development flow under the means of using Hive?

    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which are hundreds of giga-bytes per day. Previously, we did this job by using Apache hadoop, running our raw mapreduce code. It did work, but it also decreased our productivity directly. We were suffering from writting code with similar logic again and again. It could be worse, when the format of our logs being changed. For example, when we want to insert one more field in each line of the log, the previous work would be useless, then we have to redo it. Hence we are thinking about using Hive as a persistent layer, to store and retrieve the schemes of the data easily. But we found that sometimes Hive could not do some sort of complex analysis, because of the limitation of the ideographic ability of SQL. We have to write our own UDFs, even though, some difficulties Hive still cannot go through. Thus we also need to write raw mapreduces code, which let us come up against another problem. Since one is a set of SQL scripts, the other is pieces of java or hybrid code, How to coordinate Hive and raw mapreduce code and how to shedule them? How does Facebook use Hive? And what is your solution when you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse. Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Min Zhou at Feb 23, 2009 at 4:00 am
    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of
    Hive sql. Sometimes this trick is also useless. For example, if I want to
    do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things,
    hive and raw mapreduce.
    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
    facilities. Check the wiki on how to use them. Or do you want something
    different?




    ------------------------------
    *From: *Min Zhou <coderplay@gmail.com>
    *Reply-To: *<hive-user@hadoop.apache.org>
    *Date: *Sun, 22 Feb 2009 19:42:50 -0800
    *To: *<hive-user@hadoop.apache.org>
    *Subject: *How to simplify our development flow under the means of using
    Hive?


    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which
    are hundreds of giga-bytes per day. Previously, we did this job by using
    Apache hadoop, running our raw mapreduce code. It did work, but it also
    decreased our productivity directly. We were suffering from writting code
    with similar logic again and again. It could be worse, when the format of
    our logs being changed. For example, when we want to insert one more field
    in each line of the log, the previous work would be useless, then we have to
    redo it. Hence we are thinking about using Hive as a persistent layer, to
    store and retrieve the schemes of the data easily. But we found that
    sometimes Hive could not do some sort of complex analysis, because of the
    limitation of the ideographic ability of SQL. We have to write our own
    UDFs, even though, some difficulties Hive still cannot go through. Thus we
    also need to write raw mapreduces code, which let us come up against
    another problem. Since one is a set of SQL scripts, the other is pieces of
    java or hybrid code, How to coordinate Hive and raw mapreduce code and how
    to shedule them? How does Facebook use Hive? And what is your solution when
    you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse.
    Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Joydeep Sen Sarma at Feb 23, 2009 at 4:24 am
    Hi Min,

    One possibility is to have ur data sets stored in Hive - but for ur map-reduce programs - use the Hive Java api's (to find input files for a table, to extract rows from a table - etc.). That way at least the metadata about all data is standardized in Hive. If you want to go down this route - we can write up an example use of these APIs (which admittedly are not well documented).

    The other option is for Hive to allow application code to takeover after a small sql fragment (selects and where clause - perhaps with some udf). As a crude example:

    ? <some-harness> -hiveinput "select a.x, a.y+a.z from a where a.country='HK'" -mapper <urmapper> -reducer <urreducer> and so on.

    Before we released Hive in open source - we had something like this available inside Facebook for the prototype version of Hive - and I think it would be fairly easy to resurrect it. For some reason in we haven't had much reason to use it internally. Do you think this would be useful and what u are looking for?


    BTW - The example u mention (secondary sort) - is supported by Hive (distribute by ... sort by ...) - where the partitioning and sorting keys are different. There may be some inefficiencies in Hive implementation compared to hand-written code (we might have to duplicate data in keys and values for this). Also we haven't allowed aggregate functions on this stream yet (this is something we want to do in future).

    Joydeep


    ________________________________
    From: Min Zhou
    Sent: Sunday, February 22, 2009 8:00 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?

    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of Hive sql. Sometimes this trick is also useless. For example, if I want to do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things, hive and raw mapreduce.
    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:
    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE facilities. Check the wiki on how to use them. Or do you want something different?



    ________________________________
    From: Min Zhou <coderplay@gmail.com<http://coderplay@gmail.com>>
    Reply-To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Date: Sun, 22 Feb 2009 19:42:50 -0800
    To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Subject: How to simplify our development flow under the means of using Hive?


    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which are hundreds of giga-bytes per day. Previously, we did this job by using Apache hadoop, running our raw mapreduce code. It did work, but it also decreased our productivity directly. We were suffering from writting code with similar logic again and again. It could be worse, when the format of our logs being changed. For example, when we want to insert one more field in each line of the log, the previous work would be useless, then we have to redo it. Hence we are thinking about using Hive as a persistent layer, to store and retrieve the schemes of the data easily. But we found that sometimes Hive could not do some sort of complex analysis, because of the limitation of the ideographic ability of SQL. We have to write our own UDFs, even though, some difficulties Hive still cannot go through. Thus we also need to write raw mapreduces code, which let us come up against another problem. Since one is a set of SQL scripts, the other is pieces of java or hybrid code, How to coordinate Hive and raw mapreduce code and how to shedule them? How does Facebook use Hive? And what is your solution when you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse. Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com


    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Min Zhou at Feb 24, 2009 at 1:43 am
    Hi Joydeep,

    Thanks for your reply which help me a lot. So If I'd like to choose your
    first suggestion. How can I schedule my mapreduce code and Hive sql code?
    Because our jobs are batch-processing, we may often come across scenario
    like this, frist we run a raw mapreduce job,then the second Hive job take
    its output as input, and finally the third mapreduce job will deal
    remaining work. What is your solution then?

    BTW, is it Hive only run as a thrift service in Facebook?

    On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma wrote:

    Hi Min,



    One possibility is to have ur data sets stored in Hive – but for ur
    map-reduce programs – use the Hive Java api's (to find input files for a
    table, to extract rows from a table – etc.). That way at least the metadata
    about all data is standardized in Hive. If you want to go down this route –
    we can write up an example use of these APIs (which admittedly are not well
    documented).



    The other option is for Hive to allow application code to takeover after a
    small sql fragment (selects and where clause – perhaps with some udf). As a
    crude example:



    Ø <some-harness> -hiveinput "select a.x, a.y+a.z from a where
    a.country='HK'" –mapper <urmapper> -reducer <urreducer> and so on.



    Before we released Hive in open source – we had something like this
    available inside Facebook for the prototype version of Hive – and I think it
    would be fairly easy to resurrect it. For some reason in we haven't had much
    reason to use it internally. Do you think this would be useful and what u
    are looking for?





    BTW - The example u mention (secondary sort) – is supported by Hive
    (distribute by … sort by …) – where the partitioning and sorting keys are
    different. There may be some inefficiencies in Hive implementation compared
    to hand-written code (we might have to duplicate data in keys and values for
    this). Also we haven't allowed aggregate functions on this stream yet (this
    is something we want to do in future).



    Joydeep




    ------------------------------

    *From:* Min Zhou
    *Sent:* Sunday, February 22, 2009 8:00 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: How to simplify our development flow under the means of
    using Hive?



    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of
    Hive sql. Sometimes this trick is also useless. For example, if I want to
    do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things,
    hive and raw mapreduce.

    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
    facilities. Check the wiki on how to use them. Or do you want something
    different?



    ------------------------------

    *From: *Min Zhou <coderplay@gmail.com>
    *Reply-To: *<hive-user@hadoop.apache.org>
    *Date: *Sun, 22 Feb 2009 19:42:50 -0800
    *To: *<hive-user@hadoop.apache.org>
    *Subject: *How to simplify our development flow under the means of using
    Hive?



    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which
    are hundreds of giga-bytes per day. Previously, we did this job by using
    Apache hadoop, running our raw mapreduce code. It did work, but it also
    decreased our productivity directly. We were suffering from writting code
    with similar logic again and again. It could be worse, when the format of
    our logs being changed. For example, when we want to insert one more field
    in each line of the log, the previous work would be useless, then we have to
    redo it. Hence we are thinking about using Hive as a persistent layer, to
    store and retrieve the schemes of the data easily. But we found that
    sometimes Hive could not do some sort of complex analysis, because of the
    limitation of the ideographic ability of SQL. We have to write our own
    UDFs, even though, some difficulties Hive still cannot go through. Thus we
    also need to write raw mapreduces code, which let us come up against
    another problem. Since one is a set of SQL scripts, the other is pieces of
    java or hybrid code, How to coordinate Hive and raw mapreduce code and how
    to shedule them? How does Facebook use Hive? And what is your solution when
    you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse.
    Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com



    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com


    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Joydeep Sen Sarma at Feb 24, 2009 at 1:56 am
    The scenario below is pretty simple actually.

    U can make the first job write to a hdfs directory. Then create a temporary table (using create external table) over this directory. Run the Hive sql over it and write the results to another hdfs directory (using insert overwrite directory). And then run ur last map-reduce job on the previous directory.

    So this doesn't require the use of any Java APIs.

    That said - we can write up a simple demo program and build environment to write map-reduce on Hive tables. Javadocs may need a lot of work though ..

    ________________________________
    From: Min Zhou
    Sent: Monday, February 23, 2009 5:43 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?

    Hi Joydeep,

    Thanks for your reply which help me a lot. So If I'd like to choose your first suggestion. How can I schedule my mapreduce code and Hive sql code? Because our jobs are batch-processing, we may often come across scenario like this, frist we run a raw mapreduce job,then the second Hive job take its output as input, and finally the third mapreduce job will deal remaining work. What is your solution then?

    BTW, is it Hive only run as a thrift service in Facebook?

    On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma wrote:

    Hi Min,



    One possibility is to have ur data sets stored in Hive - but for ur map-reduce programs - use the Hive Java api's (to find input files for a table, to extract rows from a table - etc.). That way at least the metadata about all data is standardized in Hive. If you want to go down this route - we can write up an example use of these APIs (which admittedly are not well documented).



    The other option is for Hive to allow application code to takeover after a small sql fragment (selects and where clause - perhaps with some udf). As a crude example:


    <some-harness> -hiveinput "select a.x, a.y+a.z from a where a.country='HK'" -mapper <urmapper> -reducer <urreducer> and so on.


    Before we released Hive in open source - we had something like this available inside Facebook for the prototype version of Hive - and I think it would be fairly easy to resurrect it. For some reason in we haven't had much reason to use it internally. Do you think this would be useful and what u are looking for?





    BTW - The example u mention (secondary sort) - is supported by Hive (distribute by ... sort by ...) - where the partitioning and sorting keys are different. There may be some inefficiencies in Hive implementation compared to hand-written code (we might have to duplicate data in keys and values for this). Also we haven't allowed aggregate functions on this stream yet (this is something we want to do in future).



    Joydeep





    ________________________________

    From: Min Zhou
    Sent: Sunday, February 22, 2009 8:00 PM

    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?



    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of Hive sql. Sometimes this trick is also useless. For example, if I want to do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things, hive and raw mapreduce.

    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE facilities. Check the wiki on how to use them. Or do you want something different?



    ________________________________

    From: Min Zhou <coderplay@gmail.com<http://coderplay@gmail.com>>
    Reply-To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Date: Sun, 22 Feb 2009 19:42:50 -0800
    To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Subject: How to simplify our development flow under the means of using Hive?


    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which are hundreds of giga-bytes per day. Previously, we did this job by using Apache hadoop, running our raw mapreduce code. It did work, but it also decreased our productivity directly. We were suffering from writting code with similar logic again and again. It could be worse, when the format of our logs being changed. For example, when we want to insert one more field in each line of the log, the previous work would be useless, then we have to redo it. Hence we are thinking about using Hive as a persistent layer, to store and retrieve the schemes of the data easily. But we found that sometimes Hive could not do some sort of complex analysis, because of the limitation of the ideographic ability of SQL. We have to write our own UDFs, even though, some difficulties Hive still cannot go through. Thus we also need to write raw mapreduces code, which let us come up against another problem. Since one is a set of SQL scripts, the other is pieces of java or hybrid code, How to coordinate Hive and raw mapreduce code and how to shedule them? How does Facebook use Hive? And what is your solution when you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse. Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com


    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com



    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Min Zhou at Feb 24, 2009 at 2:15 am
    Hi Joydeep,

    What drive your batch-processing jobs to work? Data? or a crontab script?
    or your shell script?

    On Tue, Feb 24, 2009 at 9:56 AM, Joydeep Sen Sarma wrote:

    The scenario below is pretty simple actually.



    U can make the first job write to a hdfs directory. Then create a temporary
    table (using create external table) over this directory. Run the Hive sql
    over it and write the results to another hdfs directory (using insert
    overwrite directory). And then run ur last map-reduce job on the previous
    directory.



    So this doesn't require the use of any Java APIs.



    That said – we can write up a simple demo program and build environment to
    write map-reduce on Hive tables. Javadocs may need a lot of work though ..


    ------------------------------

    *From:* Min Zhou
    *Sent:* Monday, February 23, 2009 5:43 PM

    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: How to simplify our development flow under the means of
    using Hive?



    Hi Joydeep,

    Thanks for your reply which help me a lot. So If I'd like to choose your
    first suggestion. How can I schedule my mapreduce code and Hive sql code?
    Because our jobs are batch-processing, we may often come across scenario
    like this, frist we run a raw mapreduce job,then the second Hive job take
    its output as input, and finally the third mapreduce job will deal
    remaining work. What is your solution then?

    BTW, is it Hive only run as a thrift service in Facebook?

    On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma wrote:

    Hi Min,



    One possibility is to have ur data sets stored in Hive – but for ur
    map-reduce programs – use the Hive Java api's (to find input files for a
    table, to extract rows from a table – etc.). That way at least the metadata
    about all data is standardized in Hive. If you want to go down this route –
    we can write up an example use of these APIs (which admittedly are not well
    documented).



    The other option is for Hive to allow application code to takeover after a
    small sql fragment (selects and where clause – perhaps with some udf). As a
    crude example:



    Ø <some-harness> -hiveinput "select a.x, a.y+a.z from a where
    a.country='HK'" –mapper <urmapper> -reducer <urreducer> and so on.



    Before we released Hive in open source – we had something like this
    available inside Facebook for the prototype version of Hive – and I think it
    would be fairly easy to resurrect it. For some reason in we haven't had much
    reason to use it internally. Do you think this would be useful and what u
    are looking for?





    BTW - The example u mention (secondary sort) – is supported by Hive
    (distribute by … sort by …) – where the partitioning and sorting keys are
    different. There may be some inefficiencies in Hive implementation compared
    to hand-written code (we might have to duplicate data in keys and values for
    this). Also we haven't allowed aggregate functions on this stream yet (this
    is something we want to do in future).



    Joydeep




    ------------------------------

    *From:* Min Zhou
    *Sent:* Sunday, February 22, 2009 8:00 PM


    *To:* hive-user@hadoop.apache.org

    *Subject:* Re: How to simplify our development flow under the means of
    using Hive?



    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of
    Hive sql. Sometimes this trick is also useless. For example, if I want to
    do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things,
    hive and raw mapreduce.

    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
    facilities. Check the wiki on how to use them. Or do you want something
    different?


    ------------------------------

    *From: *Min Zhou <coderplay@gmail.com>
    *Reply-To: *<hive-user@hadoop.apache.org>
    *Date: *Sun, 22 Feb 2009 19:42:50 -0800
    *To: *<hive-user@hadoop.apache.org>
    *Subject: *How to simplify our development flow under the means of using
    Hive?



    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which
    are hundreds of giga-bytes per day. Previously, we did this job by using
    Apache hadoop, running our raw mapreduce code. It did work, but it also
    decreased our productivity directly. We were suffering from writting code
    with similar logic again and again. It could be worse, when the format of
    our logs being changed. For example, when we want to insert one more field
    in each line of the log, the previous work would be useless, then we have to
    redo it. Hence we are thinking about using Hive as a persistent layer, to
    store and retrieve the schemes of the data easily. But we found that
    sometimes Hive could not do some sort of complex analysis, because of the
    limitation of the ideographic ability of SQL. We have to write our own
    UDFs, even though, some difficulties Hive still cannot go through. Thus we
    also need to write raw mapreduces code, which let us come up against
    another problem. Since one is a set of SQL scripts, the other is pieces of
    java or hybrid code, How to coordinate Hive and raw mapreduce code and how
    to shedule them? How does Facebook use Hive? And what is your solution when
    you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse.
    Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com



    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com




    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
    Thanks,
    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Joydeep Sen Sarma at Feb 24, 2009 at 6:47 am
    Ok - as someone else said - this is pretty different topic from Hive ..

    It's a mix of cron and a workflow engine developed in house (that is not open source right now).

    ________________________________
    From: Min Zhou
    Sent: Monday, February 23, 2009 6:15 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?

    Hi Joydeep,

    What drive your batch-processing jobs to work? Data? or a crontab script? or your shell script?

    On Tue, Feb 24, 2009 at 9:56 AM, Joydeep Sen Sarma wrote:

    The scenario below is pretty simple actually.



    U can make the first job write to a hdfs directory. Then create a temporary table (using create external table) over this directory. Run the Hive sql over it and write the results to another hdfs directory (using insert overwrite directory). And then run ur last map-reduce job on the previous directory.



    So this doesn't require the use of any Java APIs.



    That said - we can write up a simple demo program and build environment to write map-reduce on Hive tables. Javadocs may need a lot of work though ..



    ________________________________

    From: Min Zhou
    Sent: Monday, February 23, 2009 5:43 PM

    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?



    Hi Joydeep,

    Thanks for your reply which help me a lot. So If I'd like to choose your first suggestion. How can I schedule my mapreduce code and Hive sql code? Because our jobs are batch-processing, we may often come across scenario like this, frist we run a raw mapreduce job,then the second Hive job take its output as input, and finally the third mapreduce job will deal remaining work. What is your solution then?

    BTW, is it Hive only run as a thrift service in Facebook?

    On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma wrote:

    Hi Min,



    One possibility is to have ur data sets stored in Hive - but for ur map-reduce programs - use the Hive Java api's (to find input files for a table, to extract rows from a table - etc.). That way at least the metadata about all data is standardized in Hive. If you want to go down this route - we can write up an example use of these APIs (which admittedly are not well documented).



    The other option is for Hive to allow application code to takeover after a small sql fragment (selects and where clause - perhaps with some udf). As a crude example:


    <some-harness> -hiveinput "select a.x, a.y+a.z from a where a.country='HK'" -mapper <urmapper> -reducer <urreducer> and so on.


    Before we released Hive in open source - we had something like this available inside Facebook for the prototype version of Hive - and I think it would be fairly easy to resurrect it. For some reason in we haven't had much reason to use it internally. Do you think this would be useful and what u are looking for?





    BTW - The example u mention (secondary sort) - is supported by Hive (distribute by ... sort by ...) - where the partitioning and sorting keys are different. There may be some inefficiencies in Hive implementation compared to hand-written code (we might have to duplicate data in keys and values for this). Also we haven't allowed aggregate functions on this stream yet (this is something we want to do in future).



    Joydeep





    ________________________________

    From: Min Zhou
    Sent: Sunday, February 22, 2009 8:00 PM

    To: hive-user@hadoop.apache.org

    Subject: Re: How to simplify our development flow under the means of using Hive?



    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of Hive sql. Sometimes this trick is also useless. For example, if I want to do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things, hive and raw mapreduce.

    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE facilities. Check the wiki on how to use them. Or do you want something different?


    ________________________________

    From: Min Zhou <coderplay@gmail.com<http://coderplay@gmail.com>>
    Reply-To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Date: Sun, 22 Feb 2009 19:42:50 -0800
    To: <hive-user@hadoop.apache.org<http://hive-user@hadoop.apache.org>>
    Subject: How to simplify our development flow under the means of using Hive?


    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which are hundreds of giga-bytes per day. Previously, we did this job by using Apache hadoop, running our raw mapreduce code. It did work, but it also decreased our productivity directly. We were suffering from writting code with similar logic again and again. It could be worse, when the format of our logs being changed. For example, when we want to insert one more field in each line of the log, the previous work would be useless, then we have to redo it. Hence we are thinking about using Hive as a persistent layer, to store and retrieve the schemes of the data easily. But we found that sometimes Hive could not do some sort of complex analysis, because of the limitation of the ideographic ability of SQL. We have to write our own UDFs, even though, some difficulties Hive still cannot go through. Thus we also need to write raw mapreduces code, which let us come up against another problem. Since one is a set of SQL scripts, the other is pieces of java or hybrid code, How to coordinate Hive and raw mapreduce code and how to shedule them? How does Facebook use Hive? And what is your solution when you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse. Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com


    Regards,
    Min
    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com



    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com

    Thanks,
    Min
    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Edward Capriolo at Feb 24, 2009 at 12:55 pm
    I would take a look into how hive could possibly integrate with
    zoo-keeper and/or cascading.I know cascading just added the capability
    of working with hbase it its model.
  • Min Zhou at Feb 24, 2009 at 3:19 pm
    Hi Joydeep,

    How can I use Hive's java api to operate data sets stored in Hive? Could
    you give me a simple example?


    Thanks
    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Edward Capriolo at Feb 24, 2009 at 4:06 pm
    One way to interact with hive is by starting hives thrift server. If
    you want to use the raw java API, I stole some code from the Command
    Line Interface and made a simple driver program. (I attached it).

    You could also take a look at the source hwi folder. That is how we
    have multiple hive clients started inside a web application.

    I would suggest the thrift service as that should present yoru client
    program with a stable API. If you just make a work-alike program like
    TestHive upstream changes may be an issue down the line.
  • Min Zhou at Feb 25, 2009 at 1:12 am
    My bad. I meant that use Hive's java API to do raw mapreduce things, not
    drive a sql.
    Sorry!
    On Wed, Feb 25, 2009 at 12:05 AM, Edward Capriolo wrote:

    One way to interact with hive is by starting hives thrift server. If
    you want to use the raw java API, I stole some code from the Command
    Line Interface and made a simple driver program. (I attached it).

    You could also take a look at the source hwi folder. That is how we
    have multiple hive clients started inside a web application.

    I would suggest the thrift service as that should present yoru client
    program with a stable API. If you just make a work-alike program like
    TestHive upstream changes may be an issue down the line.


    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Zheng Shao at Feb 25, 2009 at 1:38 am
    Please take a look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

    You will have to create a string command and pass it to Hive. There is no
    way of doing that directly (without creating a string) using Java API.

    Zheng
    On Tue, Feb 24, 2009 at 5:12 PM, Min Zhou wrote:

    My bad. I meant that use Hive's java API to do raw mapreduce things, not
    drive a sql.
    Sorry!
    On Wed, Feb 25, 2009 at 12:05 AM, Edward Capriolo wrote:

    One way to interact with hive is by starting hives thrift server. If
    you want to use the raw java API, I stole some code from the Command
    Line Interface and made a simple driver program. (I attached it).

    You could also take a look at the source hwi folder. That is how we
    have multiple hive clients started inside a web application.

    I would suggest the thrift service as that should present yoru client
    program with a stable API. If you just make a work-alike program like
    TestHive upstream changes may be an issue down the line.


    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com


    --
    Yours,
    Zheng
  • Joydeep Sen Sarma at Feb 25, 2009 at 2:15 am
    We can write a small example program to get files for a table/partition. To open a table using deserializer and get rows from it etc.

    This would help people write java map-reduce on hive tables.

    ________________________________
    From: Zheng Shao
    Sent: Tuesday, February 24, 2009 5:38 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: How to simplify our development flow under the means of using Hive?

    Please take a look at: http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform

    You will have to create a string command and pass it to Hive. There is no way of doing that directly (without creating a string) using Java API.

    Zheng
    On Tue, Feb 24, 2009 at 5:12 PM, Min Zhou wrote:
    My bad. I meant that use Hive's java API to do raw mapreduce things, not drive a sql.
    Sorry!

    On Wed, Feb 25, 2009 at 12:05 AM, Edward Capriolo wrote:
    One way to interact with hive is by starting hives thrift server. If
    you want to use the raw java API, I stole some code from the Command
    Line Interface and made a simple driver program. (I attached it).

    You could also take a look at the source hwi folder. That is how we
    have multiple hive clients started inside a web application.

    I would suggest the thrift service as that should present yoru client
    program with a stable API. If you just make a work-alike program like
    TestHive upstream changes may be an issue down the line.


    --
    My research interests are distributed systems, parallel computing and bytecode based virtual machine.

    http://coderplay.javaeye.com



    --
    Yours,
    Zheng
  • Min Zhou at Feb 25, 2009 at 2:23 am
    Thanks you!
    On Wed, Feb 25, 2009 at 10:15 AM, Joydeep Sen Sarma wrote:

    We can write a small example program to get files for a table/partition.
    To open a table using deserializer and get rows from it etc.



    This would help people write java map-reduce on hive tables.


    ------------------------------

    *From:* Zheng Shao
    *Sent:* Tuesday, February 24, 2009 5:38 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: How to simplify our development flow under the means of
    using Hive?



    Please take a look at:
    http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform


    You will have to create a string command and pass it to Hive. There is no
    way of doing that directly (without creating a string) using Java API.

    Zheng

    On Tue, Feb 24, 2009 at 5:12 PM, Min Zhou wrote:

    My bad. I meant that use Hive's java API to do raw mapreduce things, not
    drive a sql.
    Sorry!



    On Wed, Feb 25, 2009 at 12:05 AM, Edward Capriolo wrote:

    One way to interact with hive is by starting hives thrift server. If
    you want to use the raw java API, I stole some code from the Command
    Line Interface and made a simple driver program. (I attached it).

    You could also take a look at the source hwi folder. That is how we
    have multiple hive clients started inside a web application.

    I would suggest the thrift service as that should present yoru client
    program with a stable API. If you just make a work-alike program like
    TestHive upstream changes may be an issue down the line.



    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com




    --
    Yours,
    Zheng


    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
  • Qing Yan at Feb 23, 2009 at 5:34 am
    **
    Therotically, the streaming facility is as powerful as raw java M/R
    programs,less efficent but easier to use(argubly). It does support seconardy
    sort (DISTRIBUTED BY... SORT BY..). Though I do agree with you that not
    everything can be easily expressed in SQL but that's
    why they include TRANSFORM/streaming facility in Hive.

    Regarding the scheduling part, my guess is it is beyond the scope of Hive ,
    can you just use shell script?

    my 2

    Qingc
    On Mon, Feb 23, 2009 at 11:59 AM, Min Zhou wrote:

    Hi Prasad ,

    This is just streaming, a sort of tech how to complement the ability of
    Hive sql. Sometimes this trick is also useless. For example, if I want to
    do jobs like SecondarySort, can that way be okay?
    My major intention is that I want to know how to schedule those two things,
    hive and raw mapreduce.
    On Mon, Feb 23, 2009 at 11:47 AM, Prasad Chakka wrote:

    You can use custom mapper and reducer scripts using TRANSFORM/MAP/REDUCE
    facilities. Check the wiki on how to use them. Or do you want something
    different?




    ------------------------------
    *From: *Min Zhou <coderplay@gmail.com>
    *Reply-To: *<hive-user@hadoop.apache.org>
    *Date: *Sun, 22 Feb 2009 19:42:50 -0800
    *To: *<hive-user@hadoop.apache.org>
    *Subject: *How to simplify our development flow under the means of using
    Hive?


    Hi list,

    I'm goint to take Hive into production to analyze our web logs, which
    are hundreds of giga-bytes per day. Previously, we did this job by using
    Apache hadoop, running our raw mapreduce code. It did work, but it also
    decreased our productivity directly. We were suffering from writting code
    with similar logic again and again. It could be worse, when the format of
    our logs being changed. For example, when we want to insert one more field
    in each line of the log, the previous work would be useless, then we have to
    redo it. Hence we are thinking about using Hive as a persistent layer, to
    store and retrieve the schemes of the data easily. But we found that
    sometimes Hive could not do some sort of complex analysis, because of the
    limitation of the ideographic ability of SQL. We have to write our own
    UDFs, even though, some difficulties Hive still cannot go through. Thus we
    also need to write raw mapreduces code, which let us come up against
    another problem. Since one is a set of SQL scripts, the other is pieces of
    java or hybrid code, How to coordinate Hive and raw mapreduce code and how
    to shedule them? How does Facebook use Hive? And what is your solution when
    you come across the similar problems?

    In the end, we are considering about using Hive as our data warehouse.
    Any suggestions?

    Thanks in advance!
    Min

    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com
    Regards,

    Min
    --
    My research interests are distributed systems, parallel computing and
    bytecode based virtual machine.

    http://coderplay.javaeye.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedFeb 23, '09 at 3:43a
activeFeb 25, '09 at 2:23a
posts16
users6
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase