tracking to the hdfs.site of hadoop. Now the query cost only 3-4s,
very impressive!
在 2013年1月24日星期四UTC+8上午11时54分09秒,Marcel Kornacker写道:
Did you follow the setup instructions on
https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance
?
If so, what does this line in the info log
I0123 09:26:30.049484 29410 simple-scheduler.cc:174] SimpleScheduler
locality percentage 100% (148 out of 148)
look like for you? My first guess would be that you're doing remote scans.
Marcel
On Wed, Jan 23, 2013 at 7:07 PM, Feng Xu <wind...@gmail.com <javascript:>>
wrote:
--https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance
?
If so, what does this line in the info log
I0123 09:26:30.049484 29410 simple-scheduler.cc:174] SimpleScheduler
locality percentage 100% (148 out of 148)
look like for you? My first guess would be that you're doing remote scans.
Marcel
On Wed, Jan 23, 2013 at 7:07 PM, Feng Xu <wind...@gmail.com <javascript:>>
wrote:
We do a select count(*) query on a table with 15m records and 3.2g bytes, it
costs 220s milewhile hive do the same qurey only cost 100s-. We find when we
do the query with impala, every impalad node`s net bandwidth was used
up(100m). The cost of net bandwidth is 1/4 of node when do the qurey with
hive.
We do the count query on a table wich 10m records and 400m bytes,it costs
25s which is little faster then hive. The query cost 1/4 net bandwidth of
node.
I think the high net bandwidth cost does not make sence with select count.
The return data of query by each process node to coordinator node should be
little, then the net bandwidth cost is used by what?
environment:
5 nodes connected with a 100m switch, every node with amd a6-3650,16g
ram,4*1T disks.
node1: NameNode,ResourceManager,SecondaryNameNode,state_store
node2: DataNode,NodeManager,HMaster,HRegionServer,impalad
node3: DataNode,NodeManager,HRegionServer,impalad
node4: DataNode,NodeManager,HRegionServer,impalad
node5: DataNode,NodeManager,HRegionServer,hive,zookeeper,impalad
--
costs 220s milewhile hive do the same qurey only cost 100s-. We find when we
do the query with impala, every impalad node`s net bandwidth was used
up(100m). The cost of net bandwidth is 1/4 of node when do the qurey with
hive.
We do the count query on a table wich 10m records and 400m bytes,it costs
25s which is little faster then hive. The query cost 1/4 net bandwidth of
node.
I think the high net bandwidth cost does not make sence with select count.
The return data of query by each process node to coordinator node should be
little, then the net bandwidth cost is used by what?
environment:
5 nodes connected with a 100m switch, every node with amd a6-3650,16g
ram,4*1T disks.
node1: NameNode,ResourceManager,SecondaryNameNode,state_store
node2: DataNode,NodeManager,HMaster,HRegionServer,impalad
node3: DataNode,NodeManager,HRegionServer,impalad
node4: DataNode,NodeManager,HRegionServer,impalad
node5: DataNode,NodeManager,HRegionServer,hive,zookeeper,impalad
--