Grokbase Groups Hive user June 2011
FAQ
Hello,
We're using hive on amazon elastic mapreduce to process logs on s3, and I
had a couple basic questions. Apologies if they've been answered already-- I
gathered most info from the hive tutorial on amazon (
http://aws.amazon.com/articles/2855), as well as from skimming the hive wiki
pages, but I'm still very new to all of this. So, questions:

1) Is it possible to partition on directories that do not have the "key="
prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
and so ideally we could partition on that structure instead of adding "dt="
to every directory name. I found an old thread discussing this (
http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
but couldnt find the actual syntax.

2) How does hive handle tab-delimited files where rows sometimes have
different column counts? For instance, if we are parsing an event log that
contains multiple events, some of which have more columns associated with
them:

event_a user_id apple 300
event_b user_id cat

If i define my hive table to have 4 columns, how will hive react to the
event_b row?

Thanks!

Search Discussions

  • Christopher, Pat at Jun 28, 2011 at 6:45 pm
    allo,
    1 dunno. I generate my EMR scripts in a separate script so generating a stack of 'alter table...' queries is easy for me
    2 event_b will have a null value in column 4.
    2 b ( you didn't ask) what happens with this row:

    event_c user_id france 500 afifthcolumn

    afifthcolumn will be truncated and you'll have only event_c through 500 in the row

    Pat

    From: Kennon Lee
    Sent: Monday, June 27, 2011 5:50 PM
    To: user@hive.apache.org
    Subject: loading datafiles in s3

    Hello,
    We're using hive on amazon elastic mapreduce to process logs on s3, and I had a couple basic questions. Apologies if they've been answered already-- I gathered most info from the hive tutorial on amazon (http://aws.amazon.com/articles/2855), as well as from skimming the hive wiki pages, but I'm still very new to all of this. So, questions:

    1) Is it possible to partition on directories that do not have the "key=" prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 and so ideally we could partition on that structure instead of adding "dt=" to every directory name. I found an old thread discussing this (http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>) but couldnt find the actual syntax.

    2) How does hive handle tab-delimited files where rows sometimes have different column counts? For instance, if we are parsing an event log that contains multiple events, some of which have more columns associated with them:

    event_a user_id apple 300
    event_b user_id cat

    If i define my hive table to have 4 columns, how will hive react to the event_b row?

    Thanks!
  • Igor Tatarinov at Jun 28, 2011 at 7:19 pm
    I think the answer to 1 is No but you can confirm on the AWS EMR forum.

    The problem I've been having is that if you have x=foo in the prefix of your
    S3 path, EMR will try to use it as part of your partitioning key even if you
    don't want it.
    Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
    confused. Sometimes it works, other times it complains that x is not part of
    your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
    why.

    On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat wrote:

    allo,****

    1 dunno. I generate my EMR scripts in a separate script so generating a
    stack of ‘alter table…’ queries is easy for me****

    2 event_b will have a null value in column 4.****

    2 b ( you didn’t ask) what happens with this row:****

    ** **

    event_c user_id france 500 afifthcolumn****

    ** **

    afifthcolumn will be truncated and you’ll have only event_c through 500 in
    the row****

    ** **

    Pat****

    ** **

    *From:* Kennon Lee
    *Sent:* Monday, June 27, 2011 5:50 PM
    *To:* user@hive.apache.org
    *Subject:* loading datafiles in s3****

    ** **

    Hello,****

    We're using hive on amazon elastic mapreduce to process logs on s3, and I
    had a couple basic questions. Apologies if they've been answered already-- I
    gathered most info from the hive tutorial on amazon (
    http://aws.amazon.com/articles/2855), as well as from skimming the hive
    wiki pages, but I'm still very new to all of this. So, questions:****

    ** **

    1) Is it possible to partition on directories that do not have the "key="
    prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
    and so ideally we could partition on that structure instead of adding "dt="
    to every directory name. I found an old thread discussing this (
    http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
    but couldnt find the actual syntax.****

    ** **

    2) How does hive handle tab-delimited files where rows sometimes have
    different column counts? For instance, if we are parsing an event log that
    contains multiple events, some of which have more columns associated with
    them:****

    ** **

    event_a user_id apple 300****

    event_b user_id cat****

    ** **

    If i define my hive table to have 4 columns, how will hive react to the
    event_b row?****

    ** **

    Thanks!****

    ** **
  • Kennon Lee at Jun 29, 2011 at 8:28 pm
    Thanks for the responses. Regarding the first question, I wasnt sure what
    you meant by using ALTER TABLE statements to allow for non-prefixed
    directory names. Don't you still have to name the directories with the
    'blah=' part? For instance, if we do:

    ALTER TABLE foo ADD PARTITION (dt='2011-06-29');

    Doesnt this look for a directory called "dt=2011-06-29"?
    On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov wrote:

    I think the answer to 1 is No but you can confirm on the AWS EMR forum.

    The problem I've been having is that if you have x=foo in the prefix of
    your S3 path, EMR will try to use it as part of your partitioning key even
    if you don't want it.
    Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
    confused. Sometimes it works, other times it complains that x is not part of
    your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
    why.


    On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
    patrick.christopher@hp.com> wrote:
    allo,****

    1 dunno. I generate my EMR scripts in a separate script so generating a
    stack of ‘alter table…’ queries is easy for me****

    2 event_b will have a null value in column 4.****

    2 b ( you didn’t ask) what happens with this row:****

    ** **

    event_c user_id france 500 afifthcolumn****

    ** **

    afifthcolumn will be truncated and you’ll have only event_c through 500 in
    the row****

    ** **

    Pat****

    ** **

    *From:* Kennon Lee
    *Sent:* Monday, June 27, 2011 5:50 PM
    *To:* user@hive.apache.org
    *Subject:* loading datafiles in s3****

    ** **

    Hello,****

    We're using hive on amazon elastic mapreduce to process logs on s3, and I
    had a couple basic questions. Apologies if they've been answered already-- I
    gathered most info from the hive tutorial on amazon (
    http://aws.amazon.com/articles/2855), as well as from skimming the hive
    wiki pages, but I'm still very new to all of this. So, questions:****

    ** **

    1) Is it possible to partition on directories that do not have the "key="
    prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
    and so ideally we could partition on that structure instead of adding "dt="
    to every directory name. I found an old thread discussing this (
    http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
    but couldnt find the actual syntax.****

    ** **

    2) How does hive handle tab-delimited files where rows sometimes have
    different column counts? For instance, if we are parsing an event log that
    contains multiple events, some of which have more columns associated with
    them:****

    ** **

    event_a user_id apple 300****

    event_b user_id cat****

    ** **

    If i define my hive table to have 4 columns, how will hive react to the
    event_b row?****

    ** **

    Thanks!****

    ** **
  • Christopher, Pat at Jun 30, 2011 at 10:31 pm
    no. you can specify the location of a partition as you go. Hive will use the value provided in the partition value clause as the value of the partition column but it doesn't have to be in the file name. Ex:

    ALTER TABLE foo ADD PARITION (dt='2011-02-01') SET LOCATION 's3n://bucket/key/subkey/';

    then under subkey you can have as many keys/files as you want and all of those files will be added as part of that partition. Now on the table side of things, lets say foo is:

    CREATE TABLE foo (
    allofthedata string
    )
    partitioned by (dt string)

    This creates a table with this structure:

    allofthedata string
    dt string

    Yes! dt is a column. Its kind of a pain in the ass. But, hive will then set the 'dt' column to the value supplied in the alter table statement for all rows that are imported as part of that partition load.

    tl;dr?

    If you specifiy the full location of the directory you wish to add, your data will be added correctly.

    pat

    From: Kennon Lee
    Sent: Wednesday, June 29, 2011 1:27 PM
    To: user@hive.apache.org
    Subject: Re: loading datafiles in s3

    Thanks for the responses. Regarding the first question, I wasnt sure what you meant by using ALTER TABLE statements to allow for non-prefixed directory names. Don't you still have to name the directories with the 'blah=' part? For instance, if we do:

    ALTER TABLE foo ADD PARTITION (dt='2011-06-29');

    Doesnt this look for a directory called "dt=2011-06-29"?
    On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov wrote:
    I think the answer to 1 is No but you can confirm on the AWS EMR forum.

    The problem I've been having is that if you have x=foo in the prefix of your S3 path, EMR will try to use it as part of your partitioning key even if you don't want it.
    Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get confused. Sometimes it works, other times it complains that x is not part of your INSERT .. PARTITION(y) clause. I haven't quite figured out when and why.

    On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat wrote:
    allo,
    1 dunno. I generate my EMR scripts in a separate script so generating a stack of 'alter table...' queries is easy for me
    2 event_b will have a null value in column 4.
    2 b ( you didn't ask) what happens with this row:

    event_c user_id france 500 afifthcolumn

    afifthcolumn will be truncated and you'll have only event_c through 500 in the row

    Pat

    From: Kennon Lee
    Sent: Monday, June 27, 2011 5:50 PM
    To: user@hive.apache.org
    Subject: loading datafiles in s3

    Hello,
    We're using hive on amazon elastic mapreduce to process logs on s3, and I had a couple basic questions. Apologies if they've been answered already-- I gathered most info from the hive tutorial on amazon (http://aws.amazon.com/articles/2855), as well as from skimming the hive wiki pages, but I'm still very new to all of this. So, questions:

    1) Is it possible to partition on directories that do not have the "key=" prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2 and so ideally we could partition on that structure instead of adding "dt=" to every directory name. I found an old thread discussing this (http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>) but couldnt find the actual syntax.

    2) How does hive handle tab-delimited files where rows sometimes have different column counts? For instance, if we are parsing an event log that contains multiple events, some of which have more columns associated with them:

    event_a user_id apple 300
    event_b user_id cat

    If i define my hive table to have 4 columns, how will hive react to the event_b row?

    Thanks!
  • Kennon Lee at Jun 30, 2011 at 10:49 pm
    Sweet, i think we can work with that. Beats renaming a bunch of directories
    to adhere to a weird hive-specific naming convention, at least.

    Thanks for the help!
    On Thu, Jun 30, 2011 at 3:28 PM, Christopher, Pat wrote:

    no. you can specify the location of a partition as you go. Hive will use
    the value provided in the partition value clause as the value of the
    partition column but it doesn’t have to be in the file name. Ex:****

    ** **

    ALTER TABLE foo ADD PARITION (dt=’2011-02-01’) SET LOCATION
    ‘s3n://bucket/key/subkey/’;****

    ** **

    then under subkey you can have as many keys/files as you want and all of
    those files will be added as part of that partition. Now on the table side
    of things, lets say foo is:****

    ** **

    CREATE TABLE foo (****

    allofthedata string****

    ) ****

    partitioned by (dt string)****

    ** **

    This creates a table with this structure:****

    ** **

    allofthedata string****

    dt string****

    ** **

    Yes! dt is a column. Its kind of a pain in the ass. But, hive will then
    set the ‘dt’ column to the value supplied in the alter table statement for
    all rows that are imported as part of that partition load.****

    ** **

    tl;dr?****

    ** **

    If you specifiy the full location of the directory you wish to add, your
    data will be added correctly.****

    ** **

    pat****

    ** **

    *From:* Kennon Lee
    *Sent:* Wednesday, June 29, 2011 1:27 PM

    *To:* user@hive.apache.org
    *Subject:* Re: loading datafiles in s3****

    ** **

    Thanks for the responses. Regarding the first question, I wasnt sure what
    you meant by using ALTER TABLE statements to allow for non-prefixed
    directory names. Don't you still have to name the directories with the
    'blah=' part? For instance, if we do:****

    ** **

    ALTER TABLE foo ADD PARTITION (dt='2011-06-29');****

    ** **

    Doesnt this look for a directory called "dt=2011-06-29"?****

    On Tue, Jun 28, 2011 at 12:18 PM, Igor Tatarinov wrote:*
    ***

    I think the answer to 1 is No but you can confirm on the AWS EMR forum.***
    *

    ** **

    The problem I've been having is that if you have x=foo in the prefix of
    your S3 path, EMR will try to use it as part of your partitioning key even
    if you don't want it.****

    Say, x=foo/y=bar/data and you want to partition on y only, EMR Hive can get
    confused. Sometimes it works, other times it complains that x is not part of
    your INSERT .. PARTITION(y) clause. I haven't quite figured out when and
    why.****

    ** **

    On Tue, Jun 28, 2011 at 11:42 AM, Christopher, Pat <
    patrick.christopher@hp.com> wrote:****

    allo,****

    1 dunno. I generate my EMR scripts in a separate script so generating a
    stack of ‘alter table…’ queries is easy for me****

    2 event_b will have a null value in column 4.****

    2 b ( you didn’t ask) what happens with this row:****

    ****

    event_c user_id france 500 afifthcolumn****

    ****

    afifthcolumn will be truncated and you’ll have only event_c through 500 in
    the row****

    ****

    Pat****

    ****

    *From:* Kennon Lee
    *Sent:* Monday, June 27, 2011 5:50 PM
    *To:* user@hive.apache.org
    *Subject:* loading datafiles in s3****

    ****

    Hello,****

    We're using hive on amazon elastic mapreduce to process logs on s3, and I
    had a couple basic questions. Apologies if they've been answered already-- I
    gathered most info from the hive tutorial on amazon (
    http://aws.amazon.com/articles/2855), as well as from skimming the hive
    wiki pages, but I'm still very new to all of this. So, questions:****

    ****

    1) Is it possible to partition on directories that do not have the "key="
    prefix? Our logs are organized like s3://bucketname/dir/YYYY/MM/DD/HH/*.bz2
    and so ideally we could partition on that structure instead of adding "dt="
    to every directory name. I found an old thread discussing this (
    http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded<http://search-hadoop.com/m/SGTqLox5Il/partition+directory/v=threaded)>)
    but couldnt find the actual syntax.****

    ****

    2) How does hive handle tab-delimited files where rows sometimes have
    different column counts? For instance, if we are parsing an event log that
    contains multiple events, some of which have more columns associated with
    them:****

    ****

    event_a user_id apple 300****

    event_b user_id cat****

    ****

    If i define my hive table to have 4 columns, how will hive react to the
    event_b row?****

    ****

    Thanks!****

    ****

    ** **

    ** **

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 28, '11 at 12:50a
activeJun 30, '11 at 10:49p
posts6
users4
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase