FAQ
Hi Impala Users,

I'm very glad to join this group and to talk with all of you.
Impala is new to me, and I encountered a problem when I tried to use Impala
on an existing hive table which had many partitions. Let's name the table
"tbl_some_table". The problem is that, when I queried "describe
tbl_some_table", it took very long a time to respond. From the log I saw
that it seemed to scan all the partitions of the table.
Does anyone know why did it do this? How to avoid the problem and make
"impala describe" as fast as hive does?

Thanks,
- Aaron

Search Discussions

  • Marcel Kornacker at Mar 11, 2013 at 3:59 am

    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to use Impala
    on an existing hive table which had many partitions. Let's name the table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the log I saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron
  • Marcel Kornacker at Mar 11, 2013 at 2:01 pm

    On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang wrote:
    Thanks for the reply.
    the impalad process needs to load the metadata
    It's unexpected that loading the metadata costs so much of time (several
    minutes), and the time varies according to the number of partitions. Does it
    mean that the first time the impalad loads the meadata, it scans all of the
    partitions? And why?
    It doesn't scan the partitions, but it gets the all of the relevant
    partition data, which also includes locations of block replicas and
    volume ids. This data is cached in order to avoid having to do this
    for every single query.

    Right now, this is done per-partition, but we're going to change that
    to coalesce that into a single call per table.
    Yours,
    - Aaron
    On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:
    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to use
    Impala
    on an existing hive table which had many partitions. Let's name the
    table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the log I saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron
  • Lenni Kuff at Mar 11, 2013 at 3:34 pm
    To add to what Marcel said:

    Hive does not currently make use of the block replica location metadata so
    it does need need to load/cache this information. This is why the initial
    DESCRIBE takes longer in Impala than Hive. As Marcel mentioned, the
    performance will be improved once we move to using a single call per-table
    (rather than per-partition) to gather this information.

    Thanks,
    Lenni
    Software Engineer - Cloudera
    On Mon, Mar 11, 2013 at 7:01 AM, Marcel Kornacker wrote:
    On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang wrote:
    Thanks for the reply.
    the impalad process needs to load the metadata
    It's unexpected that loading the metadata costs so much of time (several
    minutes), and the time varies according to the number of partitions. Does it
    mean that the first time the impalad loads the meadata, it scans all of the
    partitions? And why?
    It doesn't scan the partitions, but it gets the all of the relevant
    partition data, which also includes locations of block replicas and
    volume ids. This data is cached in order to avoid having to do this
    for every single query.

    Right now, this is done per-partition, but we're going to change that
    to coalesce that into a single call per table.
    Yours,
    - Aaron
    On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:
    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to use
    Impala
    on an existing hive table which had many partitions. Let's name the
    table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the log I
    saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron
  • Marcel Kornacker at Mar 13, 2013 at 4:09 am
    On Tue, Mar 12, 2013 at 7:26 PM, Lake Chang wrote:
    Thanks for Marcel and Lenni's replies!
    I still have some doubts.
    1.
    but it gets the all of the relevant partition data, which also includes
    locations of block replicas and volume ids.
    I don't think the information of "locations of block replicas" should be
    collected before giving the answer of a describe request. Can we separate
    the process of getting the meta info and getting the locations of block
    replicas?
    That would be very inconvenient, given how the metadata is organized internally.
    2.
    will be improved once we move to using a single call per-table (rather
    than per-partition) to gather this information.
    I don't know how the information of "locations of block replicas" is stored,
    I just wonder can we "using a single call per-table" to get all the
    locations of block replicas of all the partitions?

    Thanks,
    - Aaron

    On Mon, Mar 11, 2013 at 11:34 PM, Lenni Kuff wrote:

    To add to what Marcel said:

    Hive does not currently make use of the block replica location metadata so
    it does need need to load/cache this information. This is why the initial
    DESCRIBE takes longer in Impala than Hive. As Marcel mentioned, the
    performance will be improved once we move to using a single call per-table
    (rather than per-partition) to gather this information.

    Thanks,
    Lenni
    Software Engineer - Cloudera


    On Mon, Mar 11, 2013 at 7:01 AM, Marcel Kornacker <marcel@cloudera.com>
    wrote:
    On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang wrote:
    Thanks for the reply.
    the impalad process needs to load the metadata
    It's unexpected that loading the metadata costs so much of time
    (several
    minutes), and the time varies according to the number of partitions.
    Does it
    mean that the first time the impalad loads the meadata, it scans all of
    the
    partitions? And why?
    It doesn't scan the partitions, but it gets the all of the relevant
    partition data, which also includes locations of block replicas and
    volume ids. This data is cached in order to avoid having to do this
    for every single query.

    Right now, this is done per-partition, but we're going to change that
    to coalesce that into a single call per table.
    Yours,
    - Aaron
    On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:
    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to use
    Impala
    on an existing hive table which had many partitions. Let's name the
    table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the log I
    saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and
    make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron
  • Darren Lo at Mar 13, 2013 at 3:25 pm
    You can also use Hive CLI or Hue for a faster describe if this is a big
    issue.

    On Tue, Mar 12, 2013 at 9:09 PM, Marcel Kornacker wrote:
    On Tue, Mar 12, 2013 at 7:26 PM, Lake Chang wrote:
    Thanks for Marcel and Lenni's replies!
    I still have some doubts.
    1.
    but it gets the all of the relevant partition data, which also includes
    locations of block replicas and volume ids.
    I don't think the information of "locations of block replicas" should be
    collected before giving the answer of a describe request. Can we separate
    the process of getting the meta info and getting the locations of block
    replicas?
    That would be very inconvenient, given how the metadata is organized
    internally.
    2.
    will be improved once we move to using a single call per-table (rather
    than per-partition) to gather this information.
    I don't know how the information of "locations of block replicas" is stored,
    I just wonder can we "using a single call per-table" to get all the
    locations of block replicas of all the partitions?

    Thanks,
    - Aaron

    On Mon, Mar 11, 2013 at 11:34 PM, Lenni Kuff wrote:

    To add to what Marcel said:

    Hive does not currently make use of the block replica location metadata
    so
    it does need need to load/cache this information. This is why the
    initial
    DESCRIBE takes longer in Impala than Hive. As Marcel mentioned, the
    performance will be improved once we move to using a single call
    per-table
    (rather than per-partition) to gather this information.

    Thanks,
    Lenni
    Software Engineer - Cloudera


    On Mon, Mar 11, 2013 at 7:01 AM, Marcel Kornacker <marcel@cloudera.com>
    wrote:
    On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang wrote:
    Thanks for the reply.
    the impalad process needs to load the metadata
    It's unexpected that loading the metadata costs so much of time
    (several
    minutes), and the time varies according to the number of partitions.
    Does it
    mean that the first time the impalad loads the meadata, it scans all
    of
    the
    partitions? And why?
    It doesn't scan the partitions, but it gets the all of the relevant
    partition data, which also includes locations of block replicas and
    volume ids. This data is cached in order to avoid having to do this
    for every single query.

    Right now, this is done per-partition, but we're going to change that
    to coalesce that into a single call per table.
    Yours,
    - Aaron
    On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:
    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to
    use
    Impala
    on an existing hive table which had many partitions. Let's name
    the
    table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the
    log I
    saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and
    make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron


    --
    Thanks,
    Darren
  • Lake Chang at Mar 13, 2013 at 3:29 pm
    thats's OK. i am just used to have a look at the schema before composing
    sql.
    在 2013-3-13 PM11:25,"Darren Lo" <dlo@cloudera.com>写道:
    You can also use Hive CLI or Hue for a faster describe if this is a big
    issue.

    On Tue, Mar 12, 2013 at 9:09 PM, Marcel Kornacker wrote:
    On Tue, Mar 12, 2013 at 7:26 PM, Lake Chang wrote:
    Thanks for Marcel and Lenni's replies!
    I still have some doubts.
    1.
    but it gets the all of the relevant partition data, which also
    includes
    locations of block replicas and volume ids.
    I don't think the information of "locations of block replicas" should be
    collected before giving the answer of a describe request. Can we separate
    the process of getting the meta info and getting the locations of block
    replicas?
    That would be very inconvenient, given how the metadata is organized
    internally.
    2.
    will be improved once we move to using a single call per-table (rather
    than per-partition) to gather this information.
    I don't know how the information of "locations of block replicas" is stored,
    I just wonder can we "using a single call per-table" to get all the
    locations of block replicas of all the partitions?

    Thanks,
    - Aaron


    On Mon, Mar 11, 2013 at 11:34 PM, Lenni Kuff <lskuff@cloudera.com>
    wrote:
    To add to what Marcel said:

    Hive does not currently make use of the block replica location
    metadata so
    it does need need to load/cache this information. This is why the
    initial
    DESCRIBE takes longer in Impala than Hive. As Marcel mentioned, the
    performance will be improved once we move to using a single call
    per-table
    (rather than per-partition) to gather this information.

    Thanks,
    Lenni
    Software Engineer - Cloudera


    On Mon, Mar 11, 2013 at 7:01 AM, Marcel Kornacker <marcel@cloudera.com
    wrote:
    On Sun, Mar 10, 2013 at 10:57 PM, Lake Chang <lakechang@gmail.com>
    wrote:
    Thanks for the reply.
    the impalad process needs to load the metadata
    It's unexpected that loading the metadata costs so much of time
    (several
    minutes), and the time varies according to the number of partitions.
    Does it
    mean that the first time the impalad loads the meadata, it scans
    all of
    the
    partitions? And why?
    It doesn't scan the partitions, but it gets the all of the relevant
    partition data, which also includes locations of block replicas and
    volume ids. This data is cached in order to avoid having to do this
    for every single query.

    Right now, this is done per-partition, but we're going to change that
    to coalesce that into a single call per table.
    Yours,
    - Aaron
    On Monday, March 11, 2013 11:58:58 AM UTC+8, Marcel Kornacker wrote:

    On Sun, Mar 10, 2013 at 7:47 PM, Lake Chang <lake...@gmail.com>
    wrote:
    Hi Impala Users,

    I'm very glad to join this group and to talk with all of you.
    Impala is new to me, and I encountered a problem when I tried to
    use
    Impala
    on an existing hive table which had many partitions. Let's name
    the
    table
    "tbl_some_table". The problem is that, when I queried "describe
    tbl_some_table", it took very long a time to respond. From the
    log I
    saw
    that it seemed to scan all the partitions of the table.
    Does anyone know why did it do this? How to avoid the problem and
    make
    "impala describe" as fast as hive does?
    The first time after startup you run "describe" (or any query, for
    that matter), the impalad process needs to load the metadata.
    Subsequent "describe" commands should run much faster.

    Thanks,
    - Aaron


    --
    Thanks,
    Darren

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 11, '13 at 2:47a
activeMar 13, '13 at 3:29p
posts7
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase