Grokbase Groups Hive user April 2011
FAQ

[Hive-user] External tables pros & cons

Shantian Purkad
Apr 26, 2011 at 4:19 pm
Hi,

What is the difference between external table and managed tables (apart from
data being stored outside hive warehouse and not deleted when table is dropped)

Are there any drawbacks of external tables? Or lack of support for certain
features (such as indexes, joins, subqueries etc etc )?

When and why should one not use external tables?

Thanks and Regards,
Shantian
reply

Search Discussions

3 responses

  • Raghunath, Ranjith at Apr 26, 2011 at 4:43 pm
    From my understanding of external tables they are great if we need multiple schemas associated with the same base data. A place where this could come handy is if we using custom SerDe implementation. The drawback that I have noticed has to do with binary file formats. One will have to use a custom map reduce job to generate the base data if you plan to use that as an external table. Please feel free to correct me here if I am wrong.


    Thanks,
    Ranjith

    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com>
    To: user@hive.apache.org <user@hive.apache.org>
    Sent: Tue Apr 26 11:18:46 2011
    Subject: External tables pros & cons

    Hi,

    What is the difference between external table and managed tables (apart from data being stored outside hive warehouse and not deleted when table is dropped)

    Are there any drawbacks of external tables? Or lack of support for certain features (such as indexes, joins, subqueries etc etc )?

    When and why should one not use external tables?

    Thanks and Regards,
    Shantian
  • Soren Flexner at Apr 26, 2011 at 6:18 pm
    The only real difference between a managed table and an external table is
    the location of the data file on HDFS.

    - Managed table:

    Say you have some data you want to analyze in Hive, with fields (foo,
    bar).

    So you CREATE a table "table_name", with fields (foo,bar) and LOAD the
    data into that table using the Hive cli client. That's a 'managed' table,
    and hive will decide where the data gets stored on HDFS ( probably
    /hive/warehouse/table_name/...)

    - External table:

    Say that instead, you have some data already on HDFS, maybe in
    /user/login/data/..., also of the form (foo,bar). Maybe you use this data
    in some MapReduce processes, so you wouldn't want to move it out of the
    directory it's in. But maybe you DO want to analyze it using Hive.

    So you CREATE EXTERNAL a table "table_name", and include the path to the
    data. In this case, you don't LOAD the data using the Hive client, because
    the data is already on HDFS. You just point Hive to the location of the
    data.

    You can find out where Hive thinks the data for a table is by issuing the
    command: describe extended table_name; The path is in the (somewhat messy)
    result.

    To convince yourself that there's really NO DIFFERENCE between a managed
    table and an external table, create a managed table and then hadoop fs -mv
    the data for it from an 'external' location into
    /hive/warehouse/table_name/... as long as the schema matches up with the
    file, hive will magically find the data.

    One important realization is that Hive doesn't "do" anything to your data
    in either case. The data is just stored as a set of files on HDFS (say, tab
    delimited), the contents of which look roughly like:

    foo1\tbar1
    foo2\tbar2
    foo3\tbar3
    foo4\tbar4
    etc...

    http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL

    -s

    P.S. As an aside, it should be clear by now that Ranjith's comment is
    correct: if you want multiple schemas for the same data, you'll need to
    make N-1 of them EXTERNAL tables, since Hive will create a copy of your data
    under /hive/warehouse/tableN/... for each table you LOAD to. You could
    create the first table as a managed table, and then point all the others to
    the directory which Hive creates.
    On Tue, Apr 26, 2011 at 9:42 AM, Raghunath, Ranjith wrote:

    From my understanding of external tables they are great if we need
    multiple schemas associated with the same base data. A place where this
    could come handy is if we using custom SerDe implementation. The drawback
    that I have noticed has to do with binary file formats. One will have to use
    a custom map reduce job to generate the base data if you plan to use that as
    an external table. Please feel free to correct me here if I am wrong.


    Thanks,
    Ranjith

    ------------------------------
    *From*: Shantian Purkad <shantian_purkad@yahoo.com>
    *To*: user@hive.apache.org <user@hive.apache.org>
    *Sent*: Tue Apr 26 11:18:46 2011
    *Subject*: External tables pros & cons

    Hi,

    What is the difference between external table and managed tables (apart
    from data being stored outside hive warehouse and not deleted when table is
    dropped)

    Are there any drawbacks of external tables? Or lack of support for
    certain features (such as indexes, joins, subqueries etc etc )?

    When and why should one not use external tables?

    Thanks and Regards,
    Shantian
  • Christopher, Pat at Apr 26, 2011 at 6:30 pm
    External tables do not all support index creation. If you are using a non-HDFS data store, say S3, you can not create an index against that table.

    I am not sure about other non-HDFS data stores.

    Pat

    From: Soren Flexner
    Sent: Tuesday, April 26, 2011 11:18 AM
    To: user@hive.apache.org
    Subject: Re: External tables pros & cons


    The only real difference between a managed table and an external table is the location of the data file on HDFS.

    - Managed table:

    Say you have some data you want to analyze in Hive, with fields (foo, bar).

    So you CREATE a table "table_name", with fields (foo,bar) and LOAD the data into that table using the Hive cli client. That's a 'managed' table, and hive will decide where the data gets stored on HDFS ( probably /hive/warehouse/table_name/...)

    - External table:

    Say that instead, you have some data already on HDFS, maybe in /user/login/data/..., also of the form (foo,bar). Maybe you use this data in some MapReduce processes, so you wouldn't want to move it out of the directory it's in. But maybe you DO want to analyze it using Hive.

    So you CREATE EXTERNAL a table "table_name", and include the path to the data. In this case, you don't LOAD the data using the Hive client, because the data is already on HDFS. You just point Hive to the location of the data.

    You can find out where Hive thinks the data for a table is by issuing the command: describe extended table_name; The path is in the (somewhat messy) result.

    To convince yourself that there's really NO DIFFERENCE between a managed table and an external table, create a managed table and then hadoop fs -mv the data for it from an 'external' location into /hive/warehouse/table_name/... as long as the schema matches up with the file, hive will magically find the data.

    One important realization is that Hive doesn't "do" anything to your data in either case. The data is just stored as a set of files on HDFS (say, tab delimited), the contents of which look roughly like:

    foo1\tbar1
    foo2\tbar2
    foo3\tbar3
    foo4\tbar4
    etc...

    http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL

    -s

    P.S. As an aside, it should be clear by now that Ranjith's comment is correct: if you want multiple schemas for the same data, you'll need to make N-1 of them EXTERNAL tables, since Hive will create a copy of your data under /hive/warehouse/tableN/... for each table you LOAD to. You could create the first table as a managed table, and then point all the others to the directory which Hive creates.
    On Tue, Apr 26, 2011 at 9:42 AM, Raghunath, Ranjith wrote:
    From my understanding of external tables they are great if we need multiple schemas associated with the same base data. A place where this could come handy is if we using custom SerDe implementation. The drawback that I have noticed has to do with binary file formats. One will have to use a custom map reduce job to generate the base data if you plan to use that as an external table. Please feel free to correct me here if I am wrong.

    Thanks,
    Ranjith

    ________________________________
    From: Shantian Purkad <shantian_purkad@yahoo.com
    To: user@hive.apache.org <user@hive.apache.org
    Sent: Tue Apr 26 11:18:46 2011
    Subject: External tables pros & cons
    Hi,

    What is the difference between external table and managed tables (apart from data being stored outside hive warehouse and not deleted when table is dropped)

    Are there any drawbacks of external tables? Or lack of support for certain features (such as indexes, joins, subqueries etc etc )?

    When and why should one not use external tables?

    Thanks and Regards,
    Shantian

Related Discussions

Discussion Navigation
viewthread | post