The only real difference between a managed table and an external table is
the location of the data file on HDFS.
- Managed table:
Say you have some data you want to analyze in Hive, with fields (foo,
bar).
So you CREATE a table "table_name", with fields (foo,bar) and LOAD the
data into that table using the Hive cli client. That's a 'managed' table,
and hive will decide where the data gets stored on HDFS ( probably
/hive/warehouse/table_name/...)
- External table:
Say that instead, you have some data already on HDFS, maybe in
/user/login/data/..., also of the form (foo,bar). Maybe you use this data
in some MapReduce processes, so you wouldn't want to move it out of the
directory it's in. But maybe you DO want to analyze it using Hive.
So you CREATE EXTERNAL a table "table_name", and include the path to the
data. In this case, you don't LOAD the data using the Hive client, because
the data is already on HDFS. You just point Hive to the location of the
data.
You can find out where Hive thinks the data for a table is by issuing the
command: describe extended table_name; The path is in the (somewhat messy)
result.
To convince yourself that there's really NO DIFFERENCE between a managed
table and an external table, create a managed table and then hadoop fs -mv
the data for it from an 'external' location into
/hive/warehouse/table_name/... as long as the schema matches up with the
file, hive will magically find the data.
One important realization is that Hive doesn't "do" anything to your data
in either case. The data is just stored as a set of files on HDFS (say, tab
delimited), the contents of which look roughly like:
foo1\tbar1
foo2\tbar2
foo3\tbar3
foo4\tbar4
etc...
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL-s
P.S. As an aside, it should be clear by now that Ranjith's comment is
correct: if you want multiple schemas for the same data, you'll need to
make N-1 of them EXTERNAL tables, since Hive will create a copy of your data
under /hive/warehouse/tableN/... for each table you LOAD to. You could
create the first table as a managed table, and then point all the others to
the directory which Hive creates.
On Tue, Apr 26, 2011 at 9:42 AM, Raghunath, Ranjith wrote:
From my understanding of external tables they are great if we need
multiple schemas associated with the same base data. A place where this
could come handy is if we using custom SerDe implementation. The drawback
that I have noticed has to do with binary file formats. One will have to use
a custom map reduce job to generate the base data if you plan to use that as
an external table. Please feel free to correct me here if I am wrong.
Thanks,
Ranjith
------------------------------
*From*: Shantian Purkad <shantian_purkad@yahoo.com>
*To*: user@hive.apache.org <user@hive.apache.org>
*Sent*: Tue Apr 26 11:18:46 2011
*Subject*: External tables pros & cons
Hi,
What is the difference between external table and managed tables (apart
from data being stored outside hive warehouse and not deleted when table is
dropped)
Are there any drawbacks of external tables? Or lack of support for
certain features (such as indexes, joins, subqueries etc etc )?
When and why should one not use external tables?
Thanks and Regards,
Shantian