|| at Mar 11, 2013 at 3:34 pm
To add to what Marcel said:
Hive does not currently make use of the block replica location metadata so
it does need need to load/cache this information. This is why the initial
DESCRIBE takes longer in Impala than Hive. As Marcel mentioned, the
performance will be improved once we move to using a single call per-table
(rather than per-partition) to gather this information.
Software Engineer - Cloudera
On Mon, Mar 11, 2013 at 7:01 AM, Marcel Kornacker wrote: On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang wrote:
Thanks for the reply.
the impalad process needs to load the metadata
It's unexpected that loading the metadata costs so much of time (several
minutes), and the time varies according to the number of partitions. Does it
mean that the first time the impalad loads the meadata, it scans all of the
partitions? And why?
It doesn't scan the partitions, but it gets the all of the relevant
partition data, which also includes locations of block replicas and
volume ids. This data is cached in order to avoid having to do this
for every single query.
Right now, this is done per-partition, but we're going to change that
to coalesce that into a single call per table.
On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:
On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
Hi Impala Users,
I'm very glad to join this group and to talk with all of you.
Impala is new to me, and I encountered a problem when I tried to use
on an existing hive table which had many partitions. Let's name the
"tbl_some_table". The problem is that, when I queried "describe
tbl_some_table", it took very long a time to respond. From the log I
that it seemed to scan all the partitions of the table.
Does anyone know why did it do this? How to avoid the problem and make
"impala describe" as fast as hive does?
The first time after startup you run "describe" (or any query, for
that matter), the impalad process needs to load the metadata.
Subsequent "describe" commands should run much faster.