FAQ
For one of the hive table I switched from TextFile to SequenceFile format.
This is how I created the new table:

CREATE EXTERNAL TABLE IMPRESSIONS ( A STRING, B STRING)
PARTITIONED BY(DATA_DATE STRING COMMENT 'yyyyMMdd (e.g. 20090801) on which
log records are collected')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
LOCATION '/user/hadoop/warehouse/facts/impressions/';

This external table is sourced by our custom ETL job which writes data in
MultipleSequenceFileOutputFormat.

When I issue simple query like: SELECT * FROM IMPRESSIONS;
This is what I am getting for all the records:
NULL NULL 20090715
NULL NULL 20090715
NULL NULL 20090715
....

But if I do: hadoop dfs -text
/user/hadoop/warehouse/facts/impressions/data_date=20090715/* | less
I get expected output.

Previously I was using MultipleTextFileOutputFormat to feed TextFile version
of this table and it worked well.

Any hints?

Thanks,
Abhi

Search Discussions

  • Avram Aelony at Sep 18, 2009 at 10:17 pm
    Check your field delimiter?
    -A


    From: Abhijit Pol
    Sent: Friday, September 18, 2009 2:55 PM
    To: hive-user@hadoop.apache.org
    Subject: getting all null values

    For one of the hive table I switched from TextFile to SequenceFile format. This is how I created the new table:

    CREATE EXTERNAL TABLE IMPRESSIONS ( A STRING, B STRING)
    PARTITIONED BY(DATA_DATE STRING COMMENT 'yyyyMMdd (e.g. 20090801) on which log records are collected')
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS SEQUENCEFILE
    LOCATION '/user/hadoop/warehouse/facts/impressions/';

    This external table is sourced by our custom ETL job which writes data in MultipleSequenceFileOutputFormat.

    When I issue simple query like: SELECT * FROM IMPRESSIONS;
    This is what I am getting for all the records:
    NULL NULL 20090715
    NULL NULL 20090715
    NULL NULL 20090715
    ....

    But if I do: hadoop dfs -text /user/hadoop/warehouse/facts/impressions/data_date=20090715/* | less
    I get expected output.

    Previously I was using MultipleTextFileOutputFormat to feed TextFile version of this table and it worked well.

    Any hints?

    Thanks,
    Abhi
  • Bobby Rullo at Oct 5, 2009 at 6:45 pm
    I'm actually having a similar problem with using sequence files.

    I create a table like so:

    CREATE TABLE txns (
    tid STRING,
    txn STRING
    )
    PARTITIONED BY(dt DATETIME)
    STORED AS SEQUENCEFILE;

    And then I point to one our existing transaction log files, which are
    gzipped compressed sequence files where the 'tid' is the key and 'txn'
    is the value:

    ALTER TABLE txns ADD PARTITION (dt="2009_04_01") LOCATION '/user/br/
    2009_04_01';

    Then I go to inspect the table like:

    SELECT txn from txns limit 3;

    and it's just NULL's. So then I try to see what is in tid:

    SELECT tid from txns limit 3;

    And it's the txn field!

    So it seems that Hive is ignoring the 'key' in sequence files....Is
    there anyway to make that not happen?

    On a related note, I noticed that when I dropped the table (DROP TABLE
    txns;), the files were not deleted. Why is that? Is it because I added
    them via "ALTER TABLE/ADD PARITITION"? If so, can I rely on this
    behavior in the future? I like being able to drop a table and not
    worry about all my files disappearing!

    Thanks

    Bobby
    On Sep 18, 2009, at 3:16 PM, Avram Aelony wrote:

    Check your field delimiter?

    -A





    From: Abhijit Pol
    Sent: Friday, September 18, 2009 2:55 PM
    To: hive-user@hadoop.apache.org
    Subject: getting all null values



    For one of the hive table I switched from TextFile to SequenceFile
    format. This is how I created the new table:

    CREATE EXTERNAL TABLE IMPRESSIONS ( A STRING, B STRING)
    PARTITIONED BY(DATA_DATE STRING COMMENT 'yyyyMMdd (e.g. 20090801) on
    which log records are collected')
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS SEQUENCEFILE
    LOCATION '/user/hadoop/warehouse/facts/impressions/';


    This external table is sourced by our custom ETL job which writes
    data in MultipleSequenceFileOutputFormat.

    When I issue simple query like: SELECT * FROM IMPRESSIONS;
    This is what I am getting for all the records:
    NULL NULL 20090715
    NULL NULL 20090715
    NULL NULL 20090715
    ....

    But if I do: hadoop dfs -text /user/hadoop/warehouse/facts/
    impressions/data_date=20090715/* | less
    I get expected output.

    Previously I was using MultipleTextFileOutputFormat to feed TextFile
    version of this table and it worked well.

    Any hints?

    Thanks,
    Abhi

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 18, '09 at 9:55p
activeOct 5, '09 at 6:45p
posts3
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase