One of the main problems with this query/plan is inverted join order. For
the inner most block you have:
from part p join lineitem l on p.p_partkey = l.l_partkey and p.p_brand =
'Brand#23' and p.p_container = 'MED BOX'
The corresponding plan fragment is:
join op: INNER JOIN (BROADCAST)
p.p_partkey = l.l_partkey
tuple ids: 1 2
tuple ids: 2
table=default.pq_part #partitions=1 size=891.57MB compact
predicates: p.p_brand = 'Brand#23', p.p_container = 'MED BOX'
tuple ids: 1
and execution details of that here:
HASH_JOIN_NODE (id=3):(Active: 2m41s, % non-child: 75.81%)
- BuildBuckets: 16.78M (16777216)
- BuildRows: 600.04M (600037902)
- BuildTime: 1m56s
- LoadFactor: 0.60
- MemoryUsed: 0.00
- ProbeRows: 4.01K (4006)
- ProbeTime: 92.515ms
- RowsReturned: 120.20K (120196)
- RowsReturnedRate: 744.00 /sec
We see from this that the right side of the join, the build side (lineitem)
returns 600M rows, and the left (probe) side (part) returns 4000 rows. We
can tell this join is inverted because the BuildRows side is bigger than
the ProbRows side. We want to have the most selective relation be the
build side of the hash join.
It's also noteworthy that since the join is inverted, not only are we
building using the "big" relation, it is a broadcast join so we are
broadcasting the larger relation which also contributes to poor execution.
On Tue, Jul 23, 2013 at 11:59 PM, Gautam wrote:
The query below was run with query data in kernel memory and without.
Please find attached query profiles for both. *Caveat: DISABLE_CODEGEN is
set to True due to IMPALA-477*
Still the difference in on-disk and in-memory query performance ought to
be similar to this with code generation.
insert overwrite table lineitem_tmp
l_partkey as t_partkey, 0.2 * avg(l_quantity) as t_avg_quantity
group by l_partkey;
insert overwrite table q17_small_quantity_order_revenue
sum(l_extendedprice) / 7.0 as avg_yearly
(select l_quantity, l_extendedprice, t_avg_quantity from
lineitem_tmp t join
l_quantity, l_partkey, l_extendedprice
part p join lineitem l
p.p_partkey = l.l_partkey
and p.p_brand = 'Brand#23'
and p.p_container = 'MED BOX'
) l1 on l1.l_partkey = t.t_partkey
where l_quantity < t_avg_quantity;
On Tue, Jul 23, 2013 at 11:40 PM, Greg Rahn wrote:
IMPALA-219 was resolved in 0.7 so must be something else. I'd also
recommend keeping speculation to a minimum until some data (a query
profile) is provided to do analysis on.
WRT understanding profiles, it's easiest to learn by example, so attach
(don't paste it into the email body) a query execution profile, or upload
it to gist.github.com, and we can take a look at it and offer some
On Tue, Jul 23, 2013 at 10:11 PM, Anty Rao wrote:
What version are you using? in impala 1.0.1, the hash-table-based
aggregation and join are inefficient ,due to defects in hash function they
are using. Go check out this JIRA IMPALA-219<https://issues.cloudera.org/browse/IMPALA-219
and IMPALA-219 <https://issues.cloudera.org/browse/IMPALA-219
I have a feeling that you encounter then same problem as me, and you can
search previous thread on aggregation performance.
On Wed, Jul 24, 2013 at 12:12 PM, Gautam wrote:
I'v been trying to grok the query profiles of some of the queries
we'r running while benchmarking Impala to better understand what's really
causing bottlenecks. There's a lot of good info in the query profiles and
i'd like to understand what the individual "nodes" mean viz : DataStreamSender,
AGGREGATION_NODE, HDFS_SCAN_NODE etc. under each plan fragment. For
most join queries we see HDFS_SCAN_NODE (essentially disk reads) aren't the
bottle neck at all (when run after cleaning kernel cache). But mot time
spent between DataStreamSender, AggregationNode and ExchangeNode depending
on the query complexity and nature. Some more light on these components
will be valuable for us.
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."