FAQ
Will the 0.7 release have a CREATE TABLE t1 AS SELECT * from t2 command?
If it does, and t1 is created, will we still have to force a call to every
impala node to refresh the individual nodes?

I am looking into an API that will create temporary new tables for each
request to Impala to store the results. The tables can then be requeried
as needed. But if every table that is created must force a refresh to
occur to all nodes before being used again, then I will have to build in a
delay as to when the temporary tables can be requeried until the cluster
has properly been synced.

I'm hoping that the internal CREATE TABLE call with AS SELECT * may handle
this. I don't have high hopes since the GA release doesn't have the
"refresh" done from a single node yet.

Thanks,
Scott Ruffing

Search Discussions

  • Scott Ruffing at Apr 12, 2013 at 4:03 am
    That ticket shows the create table is to be added in 0.7, but it does not specify if the refresh is handled in that process.
  • Alex Breshears at Apr 12, 2013 at 8:10 am
    Thanks! To clarify, one only needs to refresh on a node only when
    connecting to that node to query right? It doesn't actually change the
    query results?

    I guess another way of asking is: is the metadata only used by the querying
    node, or does it need to be on all for the daemons to query?
    On Thursday, April 11, 2013, Lenni Kuff wrote:

    Hi Scott and Alex,
    The Impala v0.7 beta refresh will include support for CREATE/DROP
    DATABASE/TABLE and ALTER TABLE/PARTITION. The operations will automatically
    update the table metadata for the Impala daemon you are connected to -
    meaning you will not be required to execute a 'refresh' command to query
    the new table. Each impala daemon caches the table metadata separately, so
    you will need to issue a "refresh" command to see the table when connecting
    to an impalad on a different node.

    With this release we will support CREATE TABLE and CREATE TABLE LIKE, but
    are still working on CREATE TABLE AS SELECT (see IMPALA-161<https://issues.cloudera.org/browse/IMPALA-161>).
    Until that is supported you can do what you want in two steps:
    CREATE TABLE temp_table (...);
    INSERT INTO temp_table SELECT ...;

    Keep your eye out for the .7 release, it should be coming very soon.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Thu, Apr 11, 2013 at 9:03 PM, Scott Ruffing ({}, 'cvml', 'scottruffing@gmail.com');>
    wrote:
    That ticket shows the create table is to be added in 0.7, but it does not
    specify if the refresh is handled in that process.
    --
    Alex Breshears
    e: apbresh@gmail.com
    l: www.linkedin.com/in/alexbreshears/ <http://p-k.co/BxbZcI>
    c: 479-685-3368
  • Henry Robinson at Apr 12, 2013 at 8:33 am

    On 12 April 2013 01:10, Alex Breshears wrote:

    Thanks! To clarify, one only needs to refresh on a node only when
    connecting to that node to query right? It doesn't actually change the
    query results?

    I guess another way of asking is: is the metadata only used by the
    querying node, or does it need to be on all for the daemons to query?
    It's used only by the querying node, during the planning process. If all
    your queries go through one node, you need to refresh that node only when
    the metadata changes.

    Henry

    On Thursday, April 11, 2013, Lenni Kuff wrote:

    Hi Scott and Alex,
    The Impala v0.7 beta refresh will include support for CREATE/DROP
    DATABASE/TABLE and ALTER TABLE/PARTITION. The operations will automatically
    update the table metadata for the Impala daemon you are connected to -
    meaning you will not be required to execute a 'refresh' command to query
    the new table. Each impala daemon caches the table metadata separately, so
    you will need to issue a "refresh" command to see the table when connecting
    to an impalad on a different node.

    With this release we will support CREATE TABLE and CREATE TABLE LIKE, but
    are still working on CREATE TABLE AS SELECT (see IMPALA-161<https://issues.cloudera.org/browse/IMPALA-161>).
    Until that is supported you can do what you want in two steps:
    CREATE TABLE temp_table (...);
    INSERT INTO temp_table SELECT ...;

    Keep your eye out for the .7 release, it should be coming very soon.

    Thanks,
    Lenni
    Software Engineer - Cloudera
    On Thu, Apr 11, 2013 at 9:03 PM, Scott Ruffing wrote:

    That ticket shows the create table is to be added in 0.7, but it does
    not specify if the refresh is handled in that process.
    --
    Alex Breshears
    e: apbresh@gmail.com
    l: www.linkedin.com/in/alexbreshears/ <http://p-k.co/BxbZcI>
    c: 479-685-3368

    --
    Henry Robinson
    Software Engineer
    Cloudera
    415-994-6679
  • Scott Ruffing at Apr 12, 2013 at 2:56 pm
    Alex, I am familiar with the top N optimization. Your team kindly answered
    that issue for me in another profiling question.

    One consideration there is as the limit provided increases, the network
    effort does so as well. The optimization for application code using Impala
    would be to force users only to see a top N for all queries. There are
    some instances however where server code may want to pull all ordered data,
    in which case the optimization has no net effect without an aggregation or
    filter level reducing the row size down significantly.

    You did include merge aggregation as well which I was not familiar with
    based on previous posts here. Is that part of the top N, or is it another
    optimization. Is there an article or post on this already? It's a bit off
    topic to go into the details of this post.

    I appreciate all the answers. Impala is a great project and community!


    On Friday, April 12, 2013 9:48:18 AM UTC-5, Marcel Kornacker wrote:
    On Fri, Apr 12, 2013 at 6:42 AM, Scott Ruffing wrote:
    Thanks Alex for following up with the question I was trying to get to. I'm
    glad I do not have to refresh the whole cluster after each CREATE TABLE and
    eventually CREATE TABLE AS SELECT. The problem is still there though if you
    use multiple query nodes, which I would recommend based on our current
    testing such that you get higher throughput than if you had only one query
    node. The reason I've seen for that recommendation is the query node is the
    consolidation node, and that node is the reduce node that takes in all the
    data from other nodes. That nodes' network throughput can be a
    bottleneck
    (especially on a 100Mb connection). I don't think it is as likely on a
    1000Mb connection (such as an EC2 Cluster Compute Instance).
    While you are right in principle, note that from the upcoming 0.7
    release on, Impala will distribute top-n and merge aggregation across
    all nodes. In other words, the amount of data that will need to go to
    the coordinator node for final consolidation will be far lower.
    So the problem still stands to update the query nodes tables with a refresh,
    but at least that is only probably 25% of the nodes running Impala.

    Thanks again Alex and Henry.



    On Fri, Apr 12, 2013 at 3:33 AM, Henry Robinson wrote:



    On 12 April 2013 01:10, Alex Breshears <apb...@gmail.com <javascript:>>
    wrote:
    Thanks! To clarify, one only needs to refresh on a node only when
    connecting to that node to query right? It doesn't actually change the
    query
    results?

    I guess another way of asking is: is the metadata only used by the
    querying node, or does it need to be on all for the daemons to query?
    It's used only by the querying node, during the planning process. If
    all
    your queries go through one node, you need to refresh that node only
    when
    the metadata changes.

    Henry
    On Thursday, April 11, 2013, Lenni Kuff wrote:

    Hi Scott and Alex,
    The Impala v0.7 beta refresh will include support for CREATE/DROP
    DATABASE/TABLE and ALTER TABLE/PARTITION. The operations will
    automatically
    update the table metadata for the Impala daemon you are connected to
    -
    meaning you will not be required to execute a 'refresh' command to
    query the
    new table. Each impala daemon caches the table metadata separately,
    so you
    will need to issue a "refresh" command to see the table when
    connecting to
    an impalad on a different node.

    With this release we will support CREATE TABLE and CREATE TABLE LIKE,
    but are still working on CREATE TABLE AS SELECT (see IMPALA-161).
    Until
    that is supported you can do what you want in two steps:
    CREATE TABLE temp_table (...);
    INSERT INTO temp_table SELECT ...;

    Keep your eye out for the .7 release, it should be coming very soon.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Thu, Apr 11, 2013 at 9:03 PM, Scott Ruffing <scottr...@gmail.com<javascript:>>
    wrote:
    That ticket shows the create table is to be added in 0.7, but it
    does
    not specify if the refresh is handled in that process.

    --
    Alex Breshears
    e: apb...@gmail.com <javascript:>
    l: www.linkedin.com/in/alexbreshears/
    c: 479-685-3368


    --
    Henry Robinson
    Software Engineer
    Cloudera
    415-994-6679
  • Marcel Kornacker at Apr 17, 2013 at 4:35 am

    On Fri, Apr 12, 2013 at 7:56 AM, Scott Ruffing wrote:
    Alex, I am familiar with the top N optimization. Your team kindly answered
    that issue for me in another profiling question.

    One consideration there is as the limit provided increases, the network
    effort does so as well. The optimization for application code using Impala
    would be to force users only to see a top N for all queries. There are some
    instances however where server code may want to pull all ordered data, in
    which case the optimization has no net effect without an aggregation or
    filter level reducing the row size down significantly.

    You did include merge aggregation as well which I was not familiar with
    based on previous posts here. Is that part of the top N, or is it another
    optimization. Is there an article or post on this already? It's a bit off
    topic to go into the details of this post.
    There is no article or description of this, and it is separate from
    the Top-N optimization.

    Aggregation used to be distributed in the following way:
    - pre-aggregation takes place at each node that produces input to the
    aggregation operation
    - the output of that is sent to the coordinator, which does the final
    merge aggregation

    In 0.7, the second step is now:
    - the pre-aggregation output is hash-partitioned on the grouping
    expressions and distributed among all nodes (that do the
    pre-aggregation)
    - those nodes also do the merge aggregation (restricted to their
    particular slice of the key space)
    - the output of all those merge aggregation operations is sent to the
    coordinator to pass it back to the client (without doing any more
    aggregation)

    Marcel
    I appreciate all the answers. Impala is a great project and community!


    On Friday, April 12, 2013 9:48:18 AM UTC-5, Marcel Kornacker wrote:

    On Fri, Apr 12, 2013 at 6:42 AM, Scott Ruffing <scottr...@gmail.com>
    wrote:
    Thanks Alex for following up with the question I was trying to get to.
    I'm
    glad I do not have to refresh the whole cluster after each CREATE TABLE
    and
    eventually CREATE TABLE AS SELECT. The problem is still there though if
    you
    use multiple query nodes, which I would recommend based on our current
    testing such that you get higher throughput than if you had only one
    query
    node. The reason I've seen for that recommendation is the query node is
    the
    consolidation node, and that node is the reduce node that takes in all
    the
    data from other nodes. That nodes' network throughput can be a
    bottleneck
    (especially on a 100Mb connection). I don't think it is as likely on a
    1000Mb connection (such as an EC2 Cluster Compute Instance).
    While you are right in principle, note that from the upcoming 0.7
    release on, Impala will distribute top-n and merge aggregation across
    all nodes. In other words, the amount of data that will need to go to
    the coordinator node for final consolidation will be far lower.
    So the problem still stands to update the query nodes tables with a
    refresh,
    but at least that is only probably 25% of the nodes running Impala.

    Thanks again Alex and Henry.




    On Fri, Apr 12, 2013 at 3:33 AM, Henry Robinson <he...@cloudera.com>
    wrote:

    On 12 April 2013 01:10, Alex Breshears wrote:

    Thanks! To clarify, one only needs to refresh on a node only when
    connecting to that node to query right? It doesn't actually change the
    query
    results?

    I guess another way of asking is: is the metadata only used by the
    querying node, or does it need to be on all for the daemons to query?
    It's used only by the querying node, during the planning process. If
    all
    your queries go through one node, you need to refresh that node only
    when
    the metadata changes.

    Henry
    On Thursday, April 11, 2013, Lenni Kuff wrote:

    Hi Scott and Alex,
    The Impala v0.7 beta refresh will include support for CREATE/DROP
    DATABASE/TABLE and ALTER TABLE/PARTITION. The operations will
    automatically
    update the table metadata for the Impala daemon you are connected to
    -
    meaning you will not be required to execute a 'refresh' command to
    query the
    new table. Each impala daemon caches the table metadata separately,
    so you
    will need to issue a "refresh" command to see the table when
    connecting to
    an impalad on a different node.

    With this release we will support CREATE TABLE and CREATE TABLE LIKE,
    but are still working on CREATE TABLE AS SELECT (see IMPALA-161).
    Until
    that is supported you can do what you want in two steps:
    CREATE TABLE temp_table (...);
    INSERT INTO temp_table SELECT ...;

    Keep your eye out for the .7 release, it should be coming very soon.

    Thanks,
    Lenni
    Software Engineer - Cloudera

    On Thu, Apr 11, 2013 at 9:03 PM, Scott Ruffing <scottr...@gmail.com>
    wrote:
    That ticket shows the create table is to be added in 0.7, but it
    does
    not specify if the refresh is handled in that process.

    --
    Alex Breshears
    e: apb...@gmail.com
    l: www.linkedin.com/in/alexbreshears/
    c: 479-685-3368


    --
    Henry Robinson
    Software Engineer
    Cloudera
    415-994-6679

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedApr 11, '13 at 11:09p
activeApr 17, '13 at 4:35a
posts6
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase