Grokbase Groups Hive user August 2011
FAQ
I found a set of slides from Facebook online about Hive that claims you can
have a schema per partition in the table, this is exciting to us, because we
have a table like so:

id int
name string
level int
date string

And it's broken up into partitions by date. However, on a particular date
last year, the table dramatically changed its schema to:

id int
level int
date string
name_id int

So now if I do "select * from table" in hive, the data is completely garbled
for whichever portion of data doesn't fit the Hive schema. We are
considering re-writing the datafiles so they're the same before/after that
date, but if Hive supports having two entirely different schemas depending
on the partition, that'd be really convenient, since these datafiles are
hundreds of gigabytes in size (and we do sort of like the idea of knowing
how the datafile looked back then...).

This page:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't
seem to have an appropriate example, so I'm left wondering.

Has anyone done anything like this?

--
Tim Ellis
Data Architect, Riot Games

Search Discussions

  • Ashutosh Chauhan at Aug 25, 2011 at 1:15 am
    Hey Tim,

    Hive does support different schema's for different partitions. If your data
    comes out garbled, that seems to be a bug then. In your case, is the
    following sequence of steps resemble what you did:

    a) create table tbl (id: int, name: string, level: int) partitioned by date;
    b) -- add partitions
    c) alter table tbl replace columns (id: int, level: int, name_id: int)
    d) -- add more partitions.

    If you do select * from tbl, then this should work. You need not to rewrite
    any of your data. Can you provide more info about what output you were
    expecting and what you got. Are there any error logs?

    Ashutosh

    On Mon, Aug 22, 2011 at 14:34, Time Less wrote:

    I found a set of slides from Facebook online about Hive that claims you can
    have a schema per partition in the table, this is exciting to us, because we
    have a table like so:

    id int
    name string
    level int
    date string

    And it's broken up into partitions by date. However, on a particular date
    last year, the table dramatically changed its schema to:

    id int
    level int
    date string
    name_id int

    So now if I do "select * from table" in hive, the data is completely
    garbled for whichever portion of data doesn't fit the Hive schema. We are
    considering re-writing the datafiles so they're the same before/after that
    date, but if Hive supports having two entirely different schemas depending
    on the partition, that'd be really convenient, since these datafiles are
    hundreds of gigabytes in size (and we do sort of like the idea of knowing
    how the datafile looked back then...).

    This page:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.

    Has anyone done anything like this?

    --
    Tim Ellis
    Data Architect, Riot Games
  • Time Less at Aug 29, 2011 at 10:26 pm
    Hello, Ashutosh,

    I did nothing like that... :)

    It seems the problem here is I didn't RTFM. Perchance, could you say where
    you figured this out? I am going from the Hive DDL page on confluence[1],
    and although it mentions partitions and it mentions the "replace columns"
    you've mentioned here, it doesn't mention them together that I see. I would
    like to document this for future generations. Is that the proper page where
    I'd document this?

    I would probably explicitly create a section titled "Different Schemas per
    Partition" and basically give the syntax you give (from quoted, assuming
    after I test it, it works).


    [1]
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
    On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan wrote:

    Hey Tim,

    Hive does support different schema's for different partitions. If your data
    comes out garbled, that seems to be a bug then. In your case, is the
    following sequence of steps resemble what you did:

    a) create table tbl (id: int, name: string, level: int) partitioned by
    date;
    b) -- add partitions
    c) alter table tbl replace columns (id: int, level: int, name_id: int)
    d) -- add more partitions.

    If you do select * from tbl, then this should work. You need not to rewrite
    any of your data. Can you provide more info about what output you were
    expecting and what you got. Are there any error logs?

    Ashutosh

    On Mon, Aug 22, 2011 at 14:34, Time Less wrote:

    I found a set of slides from Facebook online about Hive that claims you
    can have a schema per partition in the table, this is exciting to us,
    because we have a table like so:

    id int
    name string
    level int
    date string

    And it's broken up into partitions by date. However, on a particular date
    last year, the table dramatically changed its schema to:

    id int
    level int
    date string
    name_id int

    So now if I do "select * from table" in hive, the data is completely
    garbled for whichever portion of data doesn't fit the Hive schema. We are
    considering re-writing the datafiles so they're the same before/after that
    date, but if Hive supports having two entirely different schemas depending
    on the partition, that'd be really convenient, since these datafiles are
    hundreds of gigabytes in size (and we do sort of like the idea of knowing
    how the datafile looked back then...).

    This page:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.

    Has anyone done anything like this?

    --
    Tim Ellis
    Data Architect, Riot Games

    --
    Tim
  • Ashutosh Chauhan at Aug 30, 2011 at 5:47 am
    Hi Tim,

    I figured that both reading the code and manual. I don't think
    its explicitly documented anywhere, so it will be great if you document
    this. This page looks right place where this place of information can live.
    Thanks for the help in making Hive better.

    Ashutosh
    On Mon, Aug 29, 2011 at 15:26, Time Less wrote:

    Hello, Ashutosh,

    I did nothing like that... :)

    It seems the problem here is I didn't RTFM. Perchance, could you say where
    you figured this out? I am going from the Hive DDL page on confluence[1],
    and although it mentions partitions and it mentions the "replace columns"
    you've mentioned here, it doesn't mention them together that I see. I would
    like to document this for future generations. Is that the proper page where
    I'd document this?

    I would probably explicitly create a section titled "Different Schemas per
    Partition" and basically give the syntax you give (from quoted, assuming
    after I test it, it works).


    [1]
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

    On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan wrote:

    Hey Tim,

    Hive does support different schema's for different partitions. If your
    data comes out garbled, that seems to be a bug then. In your case, is the
    following sequence of steps resemble what you did:

    a) create table tbl (id: int, name: string, level: int) partitioned by
    date;
    b) -- add partitions
    c) alter table tbl replace columns (id: int, level: int, name_id: int)
    d) -- add more partitions.

    If you do select * from tbl, then this should work. You need not to
    rewrite any of your data. Can you provide more info about what output you
    were expecting and what you got. Are there any error logs?

    Ashutosh

    On Mon, Aug 22, 2011 at 14:34, Time Less wrote:

    I found a set of slides from Facebook online about Hive that claims you
    can have a schema per partition in the table, this is exciting to us,
    because we have a table like so:

    id int
    name string
    level int
    date string

    And it's broken up into partitions by date. However, on a particular date
    last year, the table dramatically changed its schema to:

    id int
    level int
    date string
    name_id int

    So now if I do "select * from table" in hive, the data is completely
    garbled for whichever portion of data doesn't fit the Hive schema. We are
    considering re-writing the datafiles so they're the same before/after that
    date, but if Hive supports having two entirely different schemas depending
    on the partition, that'd be really convenient, since these datafiles are
    hundreds of gigabytes in size (and we do sort of like the idea of knowing
    how the datafile looked back then...).

    This page:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.

    Has anyone done anything like this?

    --
    Tim Ellis
    Data Architect, Riot Games

    --
    Tim
  • Time Less at Oct 6, 2011 at 6:14 pm
    I have finally gotten around to testing this functionality, and it would
    doesn't work. The ALTER table change columns command just changes the
    metadata for the table, not for the partition. Follows is exactly what I did
    to test this, and the (broken) result:

    hive> create table multischema_test (
    id int,
    crdt string,
    name string,
    age int
    )
    partitioned by (dtp string)
    row format delimited
    fields terminated by '\t'
    lines terminated by '\n'
    stored as textfile;
    OK
    Time taken: 0.345 seconds
    hive> alter table multischema_test add partition (dtp=20110101) location
    '/user/hive/warehouse/test/multischema_test/20110101';
    OK
    Time taken: 0.662 seconds
    hive> alter table multischema_test replace columns (id int, name string,
    gender string, age int, crdt string) ;
    OK
    Time taken: 0.119 seconds
    hive> alter table multischema_test add partition (dtp=20110102) location
    '/user/hive/warehouse/test/multischema_test/20110102';
    OK
    Time taken: 0.186 seconds
    hive> select * from multischema_test ;
    OK
    1 2010-07-01 Jeff 32 NULL 20110101
    2 2010-07-01 Lisa 33 NULL 20110101
    3 2010-07-01 Bob 22 NULL 20110101
    4 2010-07-01 Fred 27 NULL 20110101
    100 Gregory Male 45 2010-08-01 20110102
    101 Horus Male 14 2010-08-01 20110102
    102 Verdann Male 33 2010-08-01 20110102
    103 Gennefer Female 32 2010-08-01 20110102
    Time taken: 0.348 seconds
    hive> select name,gender from multischema_test ;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201108291505_28322, Tracking URL =
    http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201108291505_28322
    Kill Command = /usr/lib/hadoop/bin/hadoop job
    -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201108291505_28322
    2011-10-06 11:02:27,099 Stage-1 map = 0%, reduce = 0%
    2011-10-06 11:02:31,129 Stage-1 map = 100%, reduce = 0%
    2011-10-06 11:02:33,144 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_201108291505_28322
    OK
    Gregory Male
    Horus Male
    Verdann Male
    Gennefer Female
    2010-07-01 Jeff
    2010-07-01 Lisa
    2010-07-01 Bob
    2010-07-01 Fred
    Time taken: 8.977 seconds
    hive>

    The two text files that make up the data in the table are like this:

    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101 | hadoop fs
    -put - /user/hive/warehouse/test/multischema_test/20110101/datafile
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102 | hadoop fs
    -put - /user/hive/warehouse/test/multischema_test/20110102/datafile
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101
    1 2010-07-01 Jeff 32
    2 2010-07-01 Lisa 33
    3 2010-07-01 Bob 22
    4 2010-07-01 Fred 27
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102
    100 Gregory Male 45 2010-08-01
    101 Horus Male 14 2010-08-01
    102 Verdann Male 33 2010-08-01
    103 Gennefer Female 32 2010-08-01

    Did I do something wrong?

    On Mon, Aug 29, 2011 at 10:46 PM, Ashutosh Chauhan wrote:

    Hi Tim,

    I figured that both reading the code and manual. I don't think
    its explicitly documented anywhere, so it will be great if you document
    this. This page looks right place where this place of information can live.
    Thanks for the help in making Hive better.

    Ashutosh
    On Mon, Aug 29, 2011 at 15:26, Time Less wrote:

    Hello, Ashutosh,

    I did nothing like that... :)

    It seems the problem here is I didn't RTFM. Perchance, could you say where
    you figured this out? I am going from the Hive DDL page on confluence[1],
    and although it mentions partitions and it mentions the "replace columns"
    you've mentioned here, it doesn't mention them together that I see. I would
    like to document this for future generations. Is that the proper page where
    I'd document this?

    I would probably explicitly create a section titled "Different Schemas per
    Partition" and basically give the syntax you give (from quoted, assuming
    after I test it, it works).


    [1]
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

    On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan wrote:

    Hey Tim,

    Hive does support different schema's for different partitions. If your
    data comes out garbled, that seems to be a bug then. In your case, is the
    following sequence of steps resemble what you did:

    a) create table tbl (id: int, name: string, level: int) partitioned by
    date;
    b) -- add partitions
    c) alter table tbl replace columns (id: int, level: int, name_id: int)
    d) -- add more partitions.

    If you do select * from tbl, then this should work. You need not to
    rewrite any of your data. Can you provide more info about what output you
    were expecting and what you got. Are there any error logs?

    Ashutosh

    On Mon, Aug 22, 2011 at 14:34, Time Less wrote:

    I found a set of slides from Facebook online about Hive that claims you
    can have a schema per partition in the table, this is exciting to us,
    because we have a table like so:

    id int
    name string
    level int
    date string

    And it's broken up into partitions by date. However, on a particular
    date last year, the table dramatically changed its schema to:

    id int
    level int
    date string
    name_id int

    So now if I do "select * from table" in hive, the data is completely
    garbled for whichever portion of data doesn't fit the Hive schema. We are
    considering re-writing the datafiles so they're the same before/after that
    date, but if Hive supports having two entirely different schemas depending
    on the partition, that'd be really convenient, since these datafiles are
    hundreds of gigabytes in size (and we do sort of like the idea of knowing
    how the datafile looked back then...).

    This page:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.

    Has anyone done anything like this?

    --
    Tim Ellis
    Data Architect, Riot Games

    --
    Tim

    --
    Tim
  • Ashutosh Chauhan at Oct 7, 2011 at 3:05 am
    How is it broken? What was the result you were expecting?

    Ashutosh
    On Thu, Oct 6, 2011 at 11:13, Time Less wrote:

    I have finally gotten around to testing this functionality, and it would
    doesn't work. The ALTER table change columns command just changes the
    metadata for the table, not for the partition. Follows is exactly what I did
    to test this, and the (broken) result:

    hive> create table multischema_test (
    id int,
    crdt string,
    name string,
    age int
    )
    partitioned by (dtp string)
    row format delimited
    fields terminated by '\t'
    lines terminated by '\n'
    stored as textfile;
    OK
    Time taken: 0.345 seconds
    hive> alter table multischema_test add partition (dtp=20110101) location
    '/user/hive/warehouse/test/multischema_test/20110101';
    OK
    Time taken: 0.662 seconds
    hive> alter table multischema_test replace columns (id int, name string,
    gender string, age int, crdt string) ;
    OK
    Time taken: 0.119 seconds
    hive> alter table multischema_test add partition (dtp=20110102) location
    '/user/hive/warehouse/test/multischema_test/20110102';
    OK
    Time taken: 0.186 seconds
    hive> select * from multischema_test ;
    OK
    1 2010-07-01 Jeff 32 NULL 20110101
    2 2010-07-01 Lisa 33 NULL 20110101
    3 2010-07-01 Bob 22 NULL 20110101
    4 2010-07-01 Fred 27 NULL 20110101
    100 Gregory Male 45 2010-08-01 20110102
    101 Horus Male 14 2010-08-01 20110102
    102 Verdann Male 33 2010-08-01 20110102
    103 Gennefer Female 32 2010-08-01 20110102
    Time taken: 0.348 seconds
    hive> select name,gender from multischema_test ;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201108291505_28322, Tracking URL =
    http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201108291505_28322
    Kill Command = /usr/lib/hadoop/bin/hadoop job
    -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201108291505_28322
    2011-10-06 11:02:27,099 Stage-1 map = 0%, reduce = 0%
    2011-10-06 11:02:31,129 Stage-1 map = 100%, reduce = 0%
    2011-10-06 11:02:33,144 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_201108291505_28322
    OK
    Gregory Male
    Horus Male
    Verdann Male
    Gennefer Female
    2010-07-01 Jeff
    2010-07-01 Lisa
    2010-07-01 Bob
    2010-07-01 Fred
    Time taken: 8.977 seconds
    hive>

    The two text files that make up the data in the table are like this:

    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101 | hadoop
    fs -put - /user/hive/warehouse/test/multischema_test/20110101/datafile
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102 | hadoop
    fs -put - /user/hive/warehouse/test/multischema_test/20110102/datafile
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101
    1 2010-07-01 Jeff 32
    2 2010-07-01 Lisa 33
    3 2010-07-01 Bob 22
    4 2010-07-01 Fred 27
    [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102
    100 Gregory Male 45 2010-08-01
    101 Horus Male 14 2010-08-01
    102 Verdann Male 33 2010-08-01
    103 Gennefer Female 32 2010-08-01

    Did I do something wrong?


    On Mon, Aug 29, 2011 at 10:46 PM, Ashutosh Chauhan wrote:

    Hi Tim,

    I figured that both reading the code and manual. I don't think
    its explicitly documented anywhere, so it will be great if you document
    this. This page looks right place where this place of information can live.
    Thanks for the help in making Hive better.

    Ashutosh
    On Mon, Aug 29, 2011 at 15:26, Time Less wrote:

    Hello, Ashutosh,

    I did nothing like that... :)

    It seems the problem here is I didn't RTFM. Perchance, could you say
    where you figured this out? I am going from the Hive DDL page on
    confluence[1], and although it mentions partitions and it mentions the
    "replace columns" you've mentioned here, it doesn't mention them together
    that I see. I would like to document this for future generations. Is that
    the proper page where I'd document this?

    I would probably explicitly create a section titled "Different Schemas
    per Partition" and basically give the syntax you give (from quoted, assuming
    after I test it, it works).


    [1]
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

    On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan wrote:

    Hey Tim,

    Hive does support different schema's for different partitions. If your
    data comes out garbled, that seems to be a bug then. In your case, is the
    following sequence of steps resemble what you did:

    a) create table tbl (id: int, name: string, level: int) partitioned by
    date;
    b) -- add partitions
    c) alter table tbl replace columns (id: int, level: int, name_id: int)
    d) -- add more partitions.

    If you do select * from tbl, then this should work. You need not to
    rewrite any of your data. Can you provide more info about what output you
    were expecting and what you got. Are there any error logs?

    Ashutosh

    On Mon, Aug 22, 2011 at 14:34, Time Less wrote:

    I found a set of slides from Facebook online about Hive that claims you
    can have a schema per partition in the table, this is exciting to us,
    because we have a table like so:

    id int
    name string
    level int
    date string

    And it's broken up into partitions by date. However, on a particular
    date last year, the table dramatically changed its schema to:

    id int
    level int
    date string
    name_id int

    So now if I do "select * from table" in hive, the data is completely
    garbled for whichever portion of data doesn't fit the Hive schema. We are
    considering re-writing the datafiles so they're the same before/after that
    date, but if Hive supports having two entirely different schemas depending
    on the partition, that'd be really convenient, since these datafiles are
    hundreds of gigabytes in size (and we do sort of like the idea of knowing
    how the datafile looked back then...).

    This page:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't seem to have an appropriate example, so I'm left wondering.

    Has anyone done anything like this?

    --
    Tim Ellis
    Data Architect, Riot Games

    --
    Tim

    --
    Tim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 22, '11 at 9:35p
activeOct 7, '11 at 3:05a
posts6
users2
websitehive.apache.org

2 users in discussion

Time Less: 3 posts Ashutosh Chauhan: 3 posts

People

Translate

site design / logo © 2022 Grokbase