Alex, I am familiar with the top N optimization. Your team kindly answered
that issue for me in another profiling question.
One consideration there is as the limit provided increases, the network
effort does so as well. The optimization for application code using Impala
would be to force users only to see a top N for all queries. There are
some instances however where server code may want to pull all ordered data,
in which case the optimization has no net effect without an aggregation or
filter level reducing the row size down significantly.
You did include merge aggregation as well which I was not familiar with
based on previous posts here. Is that part of the top N, or is it another
optimization. Is there an article or post on this already? It's a bit off
topic to go into the details of this post.
I appreciate all the answers. Impala is a great project and community!
On Friday, April 12, 2013 9:48:18 AM UTC-5, Marcel Kornacker wrote:
On Fri, Apr 12, 2013 at 6:42 AM, Scott Ruffing wrote:
Thanks Alex for following up with the question I was trying to get to. I'm
glad I do not have to refresh the whole cluster after each CREATE TABLE and
eventually CREATE TABLE AS SELECT. The problem is still there though if you
use multiple query nodes, which I would recommend based on our current
testing such that you get higher throughput than if you had only one query
node. The reason I've seen for that recommendation is the query node is the
consolidation node, and that node is the reduce node that takes in all the
data from other nodes. That nodes' network throughput can be a
(especially on a 100Mb connection). I don't think it is as likely on a
1000Mb connection (such as an EC2 Cluster Compute Instance).
While you are right in principle, note that from the upcoming 0.7
release on, Impala will distribute top-n and merge aggregation across
all nodes. In other words, the amount of data that will need to go to
the coordinator node for final consolidation will be far lower.
So the problem still stands to update the query nodes tables with a refresh,
but at least that is only probably 25% of the nodes running Impala.
Thanks again Alex and Henry.
On Fri, Apr 12, 2013 at 3:33 AM, Henry Robinson wrote:
Thanks! To clarify, one only needs to refresh on a node only when
connecting to that node to query right? It doesn't actually change the
I guess another way of asking is: is the metadata only used by the
querying node, or does it need to be on all for the daemons to query?
It's used only by the querying node, during the planning process. If
your queries go through one node, you need to refresh that node only
the metadata changes.
On Thursday, April 11, 2013, Lenni Kuff wrote:
Hi Scott and Alex,
The Impala v0.7 beta refresh will include support for CREATE/DROP
DATABASE/TABLE and ALTER TABLE/PARTITION. The operations will
update the table metadata for the Impala daemon you are connected to
meaning you will not be required to execute a 'refresh' command to
new table. Each impala daemon caches the table metadata separately,
will need to issue a "refresh" command to see the table when
an impalad on a different node.
With this release we will support CREATE TABLE and CREATE TABLE LIKE,
but are still working on CREATE TABLE AS SELECT (see IMPALA-161).
that is supported you can do what you want in two steps:
CREATE TABLE temp_table (...);
INSERT INTO temp_table SELECT ...;
Keep your eye out for the .7 release, it should be coming very soon.
Software Engineer - Cloudera
That ticket shows the create table is to be added in 0.7, but it
not specify if the refresh is handled in that process.