On Sat, Dec 29, 2012 at 6:00 PM, lyrebird1999 wrote:
In this case, Hive takes 5 minutes, less than Impala.
In the log file, the HDFS SCAN in one datanode is much faster than the other
Could anyone tell me why?
It's impossible to tell what's going on, given only the profile. What
else were you running on the machine? In particular, since you were
running this in a VM, were there other VMs accessing the same disk
while you were running this?
One scan node sees this:
- PerDiskReadThroughput: 56.53 MB/sec
The others see this:
- PerDiskReadThroughput: 21.62 MB/sec
While this is much lower than the first node, even 50 MB/sec is low
(~50% of what we'd expect). I suggest re-running this on the hardware
directly, without any VMs.
My test environment
node1 datanode(impalad) VM 4CPU 4G mem
node2 datanode(impalad) VM 4CPU 4G mem
node3 datanode(impalad) VM 4CPU 8G mem
my sql like this:
select avg(ss_quantity) agg1,
table store_sales is a text file, with a file size 39GB
the log shows: node1 takes 3m54s to finish the execution, but node2 takes
10m2s and node3 takes 10m8s to finish the execution.
I paste the log in node3(the coordinator), could anyone tell me why it
takes such a long time?