FAQ
Hello,

I'd like to store my log file data that's imported into Hive in compressed format. I was following some steps outlined by Zheng on how to do that, where he says:

CREATE TABLE texttable (...) STORED AS TEXTFILE;
LOAD DATA ... OVERWRITE INTO texttable;
CREATE TABLE seqtable (...) STORED AS SEQUENCEFILE;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compression.type=BLOCK;
INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;

but I get stuck on the last step.

I can't write to my new SYSLOG_SEQUENCE table because the tables are partitioned:

hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE SELECT * FROM SYSLOG;
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.

What syntax can I use to get the data in the new table? It has DS and TYPE as partition columns:

hive> describe syslog;
OK
month string from deserializer
day string from deserializer
time string from deserializer
host string from deserializer
logline string from deserializer
ds string
type string

I took a stab at it like below, but that only gives me two partitions in total, and of course what I want is the same partitions as exist in the original SYSLOG table.

Any pointers welcome-
Thanks
Ken

hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE PARTITION(ds='*',type='*') select month, day, time, host, logline from syslog;
OK
Time taken: 94.885 seconds

Search Discussions

  • Ken Barclay at Dec 17, 2009 at 2:37 am
    Correct me if I'm wrong, but it looks like I'll need an INSERT statement for each (DS, TYPE) partition:

    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE PARTITION(ds='2009-12-07',type='user') select month, day, time, host, logline from syslog where syslog.ds='2009-12-07' and syslog.type='user';
    Loading data to table syslog_sequence partition {ds=2009-12-07, type=user}
    OK

    Since I'm programmatically loading the data from files into the syslog table in the first place, I just need to add another statement to that program that adds the data to syslog_sequence every time it adds a new partition to syslog.

    From: Ken.Barclay@wellsfargo.com
    Sent: Wednesday, December 16, 2009 6:16 PM
    To: hive-user@hadoop.apache.org
    Subject: How do I INSERT OVERWRITE into a new table if it's partitioned?

    Hello,

    I'd like to store my log file data that's imported into Hive in compressed format. I was following some steps outlined by Zheng on how to do that, where he says:

    CREATE TABLE texttable (...) STORED AS TEXTFILE;
    LOAD DATA ... OVERWRITE INTO texttable;
    CREATE TABLE seqtable (...) STORED AS SEQUENCEFILE;
    set hive.exec.compress.output=true;
    set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
    set mapred.output.compression.type=BLOCK;
    INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;

    but I get stuck on the last step.

    I can't write to my new SYSLOG_SEQUENCE table because the tables are partitioned:

    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE SELECT * FROM SYSLOG;
    FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.

    What syntax can I use to get the data in the new table? It has DS and TYPE as partition columns:

    hive> describe syslog;
    OK
    month string from deserializer
    day string from deserializer
    time string from deserializer
    host string from deserializer
    logline string from deserializer
    ds string
    type string

    I took a stab at it like below, but that only gives me two partitions in total, and of course what I want is the same partitions as exist in the original SYSLOG table.

    Any pointers welcome-
    Thanks
    Ken

    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE PARTITION(ds='*',type='*') select month, day, time, host, logline from syslog;
    OK
    Time taken: 94.885 seconds
  • Zheng Shao at Dec 21, 2009 at 8:47 am
    You are correct.

    Just opened https://issues.apache.org/jira/browse/HIVE-1002
    This is a highly wanted feature from a lot of users.

    Please comment on the JIRA. Let's figure out how we want to do it.

    Zheng
    On Wed, Dec 16, 2009 at 6:36 PM, wrote:
    Correct me if I’m wrong, but it looks like I’ll need an INSERT statement for
    each (DS, TYPE) partition:



    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE
    PARTITION(ds='2009-12-07',type='user') select month, day, time, host,
    logline from syslog where syslog.ds='2009-12-07' and syslog.type='user';

    Loading data to table syslog_sequence partition {ds=2009-12-07, type=user}

    OK



    Since I’m programmatically loading the data from files into the syslog table
    in the first place, I just need to add another statement to that program
    that adds the data to syslog_sequence every time it  adds a new partition to
    syslog.



    From: Ken.Barclay@wellsfargo.com
    Sent: Wednesday, December 16, 2009 6:16 PM
    To: hive-user@hadoop.apache.org
    Subject: How do I INSERT OVERWRITE into a new table if it's partitioned?



    Hello,



    I’d like to store my log file data that’s imported into Hive in compressed
    format. I was following some steps outlined by Zheng on how to do that,
    where he says:



    CREATE TABLE texttable (...) STORED AS TEXTFILE;

    LOAD DATA ... OVERWRITE INTO texttable;

    CREATE TABLE seqtable (...) STORED AS SEQUENCEFILE;

    set hive.exec.compress.output=true;

    set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

    set mapred.output.compression.type=BLOCK;

    INSERT OVERWRITE TABLE seqtable SELECT * FROM texttable;



    but I get stuck on the last step.



    I can’t write to my new SYSLOG_SEQUENCE  table because the tables are
    partitioned:



    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE  SELECT * FROM SYSLOG;

    FAILED: Error in semantic analysis: need to specify partition columns
    because the destination table is partitioned.



    What syntax can I use to get the data in the new table? It has DS and TYPE
    as partition columns:



    hive> describe syslog;

    OK

    month   string  from deserializer

    day     string  from deserializer

    time    string  from deserializer

    host    string  from deserializer

    logline string  from deserializer

    ds      string

    type    string



    I took a stab at it like below, but that only gives me two partitions in
    total, and of course what I want is the same partitions as exist in the
    original SYSLOG table.



    Any pointers welcome-

    Thanks

    Ken



    hive> INSERT OVERWRITE TABLE SYSLOG_SEQUENCE PARTITION(ds='*',type='*')
    select month, day, time, host, logline from syslog;

    OK

    Time taken: 94.885 seconds


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedDec 17, '09 at 2:16a
activeDec 21, '09 at 8:47a
posts3
users2
websitehive.apache.org

2 users in discussion

Ken Barclay: 2 posts Zheng Shao: 1 post

People

Translate

site design / logo © 2021 Grokbase