Grokbase Groups Hive user May 2011
FAQ
Hi,

My end goal is to load the daily Apache logs. I wish to partition it by date
and the group has helped me in giving some advices, but it seems that I am
still stuck.

My "daily Apache logs" can contain dates for 2 days ago, yesterday, and
today. So, what I did was I created a staging_weblog table and used hive
(using SerDe) to load the logs without any partition. Then I ran a select
distinct to get all the unique dates. And I converted this date into
YYYY-MM-DD format.

After this I created a multi-insert statement like this:
from staging_weblog
insert overwrite table real_weblog
partition ( logdate = '<yyyy-mm-dd>')
select * where regexp_extract( logtime, '([^:]*):.*', 1) = 'dd/MMM/yyyy'

The above works fine if there is no existing partition. But if there is an
existing partition, then it 'overwrites' and replace the old partition with
the new one. The 'overwrite' keyword is mandatory based on the
documentation. But I wish to just append the data to the existing partition.

Has anyone solved this problem before?

Or let me ask a general question, how do you load your daily Apache logs
into Hadoop so that you can use Hive to process the data?

Search Discussions

  • Hadoopman at May 10, 2011 at 10:29 pm
    You're pretty much going to overwrite the partition every time you want
    to add data to it. I wish there was an append but there isn't. We're
    basically doing it the same way you are (looking at your insert statement).

    The challenge I'm running into (doing the same thing you are) is when
    I'm bulk loading historical logs into hadoop/hive I see heap memory
    errors appear. However when I load smaller batches I rarely see those
    same out of memory errors go away. I'm curious if you (or anyone) has a
    better way of loading historicals.


    On 05/10/2011 11:56 AM, bichonfrise74 wrote:
    Hi,

    My end goal is to load the daily Apache logs. I wish to partition it
    by date and the group has helped me in giving some advices, but it
    seems that I am still stuck.

    My "daily Apache logs" can contain dates for 2 days ago, yesterday,
    and today. So, what I did was I created a staging_weblog table and
    used hive (using SerDe) to load the logs without any partition. Then I
    ran a select distinct to get all the unique dates. And I converted
    this date into YYYY-MM-DD format.

    After this I created a multi-insert statement like this:
    from staging_weblog
    insert overwrite table real_weblog
    partition ( logdate = '<yyyy-mm-dd>')
    select * where regexp_extract( logtime, '([^:]*):.*', 1) = 'dd/MMM/yyyy'

    The above works fine if there is no existing partition. But if there
    is an existing partition, then it 'overwrites' and replace the old
    partition with the new one. The 'overwrite' keyword is mandatory based
    on the documentation. But I wish to just append the data to the
    existing partition.

    Has anyone solved this problem before?

    Or let me ask a general question, how do you load your daily Apache
    logs into Hadoop so that you can use Hive to process the data?
  • Ashish Thusoo at May 11, 2011 at 7:57 pm
    you could always have another sub partition under the daily partition. This sub partition could be the timestamp on when you did the load. So when you run the statement you would create a new sub partition within the date partition and in effect you end up doing an append to the Hive partition. The downside is of course is increase in the number of files, but you could always compact the sub partitions away after a week or so.

    Ashish
    On May 10, 2011, at 3:28 PM, hadoopman wrote:

    You're pretty much going to overwrite the partition every time you want to add data to it. I wish there was an append but there isn't. We're basically doing it the same way you are (looking at your insert statement).

    The challenge I'm running into (doing the same thing you are) is when I'm bulk loading historical logs into hadoop/hive I see heap memory errors appear. However when I load smaller batches I rarely see those same out of memory errors go away. I'm curious if you (or anyone) has a better way of loading historicals.


    On 05/10/2011 11:56 AM, bichonfrise74 wrote:
    Hi,

    My end goal is to load the daily Apache logs. I wish to partition it by date and the group has helped me in giving some advices, but it seems that I am still stuck.

    My "daily Apache logs" can contain dates for 2 days ago, yesterday, and today. So, what I did was I created a staging_weblog table and used hive (using SerDe) to load the logs without any partition. Then I ran a select distinct to get all the unique dates. And I converted this date into YYYY-MM-DD format.

    After this I created a multi-insert statement like this:
    from staging_weblog
    insert overwrite table real_weblog
    partition ( logdate = '<yyyy-mm-dd>')
    select * where regexp_extract( logtime, '([^:]*):.*', 1) = 'dd/MMM/yyyy'

    The above works fine if there is no existing partition. But if there is an existing partition, then it 'overwrites' and replace the old partition with the new one. The 'overwrite' keyword is mandatory based on the documentation. But I wish to just append the data to the existing partition.

    Has anyone solved this problem before?

    Or let me ask a general question, how do you load your daily Apache logs into Hadoop so that you can use Hive to process the data?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 10, '11 at 5:56p
activeMay 11, '11 at 7:57p
posts3
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase