Thanks for your quick response.
I think COMPUTE STATISTICS FOR COLUMNS command maybe only used for cost
based optimization (cbo). That means it is used to compute their execution
cost. One of them is choosed because of its smaller cost.
But I have used the hint [shuffle] to enforce using Partitioned joins to
test.
When debugging, I found both joins used the same hash join execution
method. This method scan left side once. Partitioned joins can use right
side to make two or more hash tables as you said. In my test, only one hash
table is built in Partitioned joins. How to do a test to make partitioned
joins produce more than one hash table?
The last question is that what is the meaning of hash_partitioned outputs
formats on scan nodes in execution plan from web debugserver.
Thank you,
Xiaomei Song
2013/9/9 Marcel Kornacker <
marcel@cloudera.com>
On Sun, Sep 8, 2013 at 2:44 AM, xiaomei song wrote:
As far as I know, broadcast joins also partition both the left and the right
input accoring to the hdfs blocks number
respectively. I found execution procedure of both (partitioned broadcast
joins) are very similar, according to the execution plans file in the annex.
I think, the only differences between both joins maybe that,
1、 partitioned joins use exechange node for left input and,
2、 outputs formats of both side scans are hash_partitioned althought I think
that they maybe have not been implemented.
If it is true as above, partitioned joins should take more cost on shuffing
data between backends than broadcast joins.
What is the advantage of partitioned joins relative to broadcast joins?
Partitioned joins partition both inputs (on the join expressions),
broadcast joins send the entire right input to each node that has data
for the left input (so the left input doesn't need to be sent over the
network). Whether one is better than the other depends on: a) the size
of the right input, b) the number of nodes. The former is determined
based on column stats (ANALYZE TABLE ... COMPUTE STATISTICS FOR
COLUMNS ... command); if column stats aren't available, it defaults to
a broadcast join (so it's in your interest to compute column stats at
least once).
2013/9/7 Marcel Kornacker <
marcel@cloudera.com>
On Thu, Sep 5, 2013 at 11:12 PM, wrote:
hi, dear Kornacker
How do partitioned joins partition the right input? how to determine
the
size of partiton?
In source codes(version 1.0.1),I cannot find out how to hash the scan
data
before sending it in scan nodes whose output indicates
hash_partitioned
format. I also cannot find out the priciple of partitioned hash joins
when
debug impala. What I missed?
If impala partitions right input into many partitions and processes
them
in
one backend, left input needs rescan?
Partitioned joins partition both the left and the right input on their
respective join expressions. The number of partitions is equal to the
number of nodes that execute the query (one partition per node).
Look forward for your response. thanks in advance.
song xiaomei
在 2013年4月20日星期六UTC+8上午12时31分41秒,Marcel Kornacker写道:
On Thu, Apr 18, 2013 at 11:00 PM, Jung-Yup Lee <ljy...@gmail.com>
wrote:
Thank you for your quick response.
Yes, the query would exceed a memory limit (either per-process or
per-query) and get aborted.
Is there any plan to resolve this limitation by materializing
intermediate
results on disk or some other methods?
Yes, that is on the roadmap, although we don't have a date for it at
this
point.
To unsubscribe from this group and stop receiving emails from it, send
an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send
an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.