FAQ
The log shows the scan operation performed with only one DN (192.168.1.126<http://192.168.1.126:50020>
).

I think all data associated with table gd_xdr was not evenly stored in all
DNs.

This could be solved using the balancer which is a tool that balances disk
space usage on an *HDFS* cluster.


Thanks,

Jung-Yup

On Tuesday, March 12, 2013 11:41:35 AM UTC+9, zjp... wrote:

Thanks Marcel !

My create table statement is:
CREATE TABLE gb_xdr( Online_ID bigint, Session_ID string, Start_Date_Time
string, End_Date_Time string, SGSN_IP bigint, SGSN string, BSC_IP bigint,
BSC string, LAC int, CI int, RAC int, BVCI int, IMSI bigint, IMEI string,
APN string, Protocol string, Event_Type int, IP_LEN_UL bigint, IP_LEN_DL
bigint, Count_Packet_UL int, Count_Packet_DL int, Duration int, SOURCE_IP
bigint, DEST_IP bigint, SOURCE_PORT int, DEST_PORT int, URI string,
URI_Main string, Content_Type_Part1 string, Content_Type_Part2 string,
User_Agent_Main string, Service_Type int, Sub_Service_Type int, De_PDP_Acc
int, De_PDP_Acc_Direction string, Detach_Acc int, Detach_Acc_Direction
string, Abnormal_reason int, Request int, Repeat int, Reject int,
Reject_Cause int, Accept int, Accept_DelayFirst int, Complete int,
Complete_DelayFirst int, ACK int, ACK_Delay bigint, Last_ACK_Delay bigint,
BVCI_OLD int, BVCI_NEW int, LLC_frame_Disc int, OPP_LAC int, OPP_RAC int,
Result int, Result_Delayfirst bigint, Disconnect int, Abort int, Query_Name
string, Query_Type int, Query_OP int, Resp int, Resp_delayFirst bigint,
Resp_Cause int, FromX string, ToX string, Subject string, MMS_Cause int,
ErrorPacketWired int, ErrorPacketWireless int, GPRS_multislot_E int,
EDGE_multislot_E int, Umts_128TDD_E int, Umts_FDD_E int,
GPRS_multislot_1800 int, EDGE_multislot_1800 int, Umts_128TDD_1800 int,
Umts_FDD_1800 int, Reset int, Reset_Direction int, LLC_Frame_Count int,
Abort_Reason_User int, Abort_Reason_Provider int, Accept_Repeat int,
Abnormal_Type int, De_Cause int, RAU_Type int, RAU_Attr int,
Event_Direction int, Prefix_IMEI string, IsReassemble int,
IsPrivate_Service int, User_IP_LEN_UL bigint, User_IP_LEN_DL bigint,
Bssgp_Status_Cause int, Weight_Rate_UL bigint, Weight_Rate_DL bigint,
Weight_Rate_Total bigint, Brand string, Type string, ICMP_Type int,
ICMP_Code int, Gob_Src_IP int, Gob_Dst_IP int, Gob_IP_LEN int, Gob_Src_PORT
int, Gob_Dst_PORT int, Gob_UDP_LEN int) row format delimited fields
terminated by '\t';


The INFO logs dumped by impalad when i issue the SELECT query are:

--------------------------------------------------------------------------------------------------------------------------
......
13/03/10 22:56:21 INFO metastore.HiveMetaStore: 0: get_all_databases
13/03/10 22:56:21 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_all_databases
13/03/11 10:02:19 INFO service.Frontend: analyze query select count(*)
from mpos_gb.gb_xdr
13/03/11 10:02:19 INFO metastore.HiveMetaStore: 1: get_all_tables:
db=mpos_gb
13/03/11 10:02:19 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_all_tables: db=mpos_gb
13/03/11 10:02:19 INFO metastore.HiveMetaStore: 1: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
13/03/11 10:02:19 INFO metastore.ObjectStore: ObjectStore, initialize
called
13/03/11 10:02:19 INFO metastore.ObjectStore: Initialized ObjectStore
13/03/11 10:02:19 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so d oes not have its own datastore table.
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MColumnDescriptor [Table
: `CDS`, InheritanceStrategy : new-table]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MSerDeInfo [Table : `SER
DES`, InheritanceStrategy : new-table]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MStringList [Table : `SK
EWED_STRING_LIST`, InheritanceStrategy : new-table]
13/03/11 10:02:19 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does no t have its own datastore table.
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MStorageDescriptor [Tabl
e : `SDS`, InheritanceStrategy : new-table]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MTable [Table : `TBLS`,
InheritanceStrategy : new-table]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MSerDeInfo.parameters [T
able : `SERDE_PARAMS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStringList.internalList
[Table : `SKEWED_STRING_LIST_VALUES`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MTable.parameters [Table
: `TABLE_PARAMS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MTable.partitionKeys [Ta
ble : `PARTITION_KEYS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.bucke
tCols [Table : `BUCKETING_COLS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.param
eters [Table : `SD_PARAMS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.skewe
dColNames [Table : `SKEWED_COL_NAMES`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.skewe
dColValueLocationMaps [Table : `SKEWED_COL_VALUE_LOC_MAP`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.skewe
dColValues [Table : `SKEWED_VALUES`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MStorageDescriptor.sortC
ols [Table : `SORT_COLS`]
13/03/11 10:02:19 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MColumnDescriptor.cols [
Table : `COLUMNS_V2`]
13/03/11 10:02:20 INFO metastore.HiveMetaStore: 1: get_table : db=mpos_gb
tbl=gb_xdr
13/03/11 10:02:20 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_table : db=mpos_gb tbl=gb_xdr
13/03/11 10:02:20 INFO metastore.HiveMetaStore: 1: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
13/03/11 10:02:20 INFO metastore.ObjectStore: ObjectStore, initialize
called
13/03/11 10:02:20 INFO metastore.ObjectStore: Initialized ObjectStore
13/03/11 10:02:20 INFO DataNucleus.MetaData: Listener found initialisation
for persistable class org.apache.hadoop.hive.metastore.model.MColumn
Descriptor
13/03/11 10:02:20 INFO DataNucleus.MetaData: Listener found initialisation
for persistable class org.apache.hadoop.hive.metastore.model.MSerDeI
nfo
13/03/11 10:02:20 INFO DataNucleus.MetaData: Listener found initialisation
for persistable class org.apache.hadoop.hive.metastore.model.MStorag
eDescriptor
13/03/11 10:02:20 INFO DataNucleus.MetaData: Listener found initialisation
for persistable class org.apache.hadoop.hive.metastore.model.MTable
13/03/11 10:02:20 INFO DataNucleus.MetaData: Listener found initialisation
for persistable class org.apache.hadoop.hive.metastore.model.MFieldS
chema
13/03/11 10:02:20 INFO metastore.HiveMetaStore: 1: get_config_value: name=
hive.exec.default.partition.name defaultValue=__HIVE_DEFAULT_PARTITIO
N__
13/03/11 10:02:20 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_config_value: name=hive.exec.default.partition.name def
aultValue=__HIVE_DEFAULT_PARTITION__
13/03/11 10:02:20 INFO metastore.HiveMetaStore: 1: get_fields:
db=mpos_gbtbl=gb_xdr
13/03/11 10:02:20 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_fields: db=mpos_gbtbl=gb_xdr
13/03/11 10:02:20 INFO metastore.HiveMetaStore: 1: get_table : db=mpos_gb
tbl=gb_xdr
13/03/11 10:02:20 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_table : db=mpos_gb tbl=gb_xdr
13/03/11 10:02:21 INFO metastore.HiveMetaStore: 1: get_partitions :
db=mpos_gb tbl=gb_xdr
13/03/11 10:02:21 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr
cmd=get_partitions : db=mpos_gb tbl=gb_xdr
13/03/11 10:02:21 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so d oes not have its own datastore table.
13/03/11 10:02:21 INFO DataNucleus.Datastore: The class
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
"embedded-only" so does no t have its own datastore table.
13/03/11 10:02:21 INFO DataNucleus.Persistence: Managing Persistence of
Class : org.apache.hadoop.hive.metastore.model.MPartition [Table : `PAR
TITIONS`, InheritanceStrategy : new-table]
13/03/11 10:02:21 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MPartition.parameters [T
able : `PARTITION_PARAMS`]
13/03/11 10:02:21 INFO DataNucleus.Persistence: Managing Persistence of
Field : org.apache.hadoop.hive.metastore.model.MPartition.values [Table
: `PARTITION_KEY_VALS`]
13/03/11 10:02:21 INFO analysis.AggregateInfo: agg info:
AggregateInfo{grouping_exprs=, aggregate_exprs=(AggregateExpr{op=COUNT,
isStar=true, isDistinct=false, }), agg_tuple=TupleDescriptor{id=1, tbl=
null, byte_size=0, is_materialized=true,
slots=[SlotDescriptor{id=0, col=null, type=BIGINT, materialized=false,
byteSize=0, byteOffset=-1, null IndicatorByte=0,
nullIndicatorBit=0, slotIdx=0}]}, smap=smap(AggregateExpr{op=COUNT,
isStar=true, isDistinct=false, }:SlotRef{tblName=null, col
=null, id=0})}
mergeAggInfo:
AggregateInfo{grouping_exprs=, aggregate_exprs=(AggregateExpr{op=SUM,
isStar=false, isDistinct=false, (SlotRef{tblName=null, col=null, id=0})})
, agg_tuple=TupleDescriptor{id=1, tbl=null, byte_size=0,
is_materialized=true, slots=[SlotDescriptor{id=0, col=null, type=BIGINT,
materialized= false, byteSize=0, byteOffset=-1,
nullIndicatorByte=0, nullIndicatorBit=0, slotIdx=0}]},
smap=smap(AggregateExpr{op=COUNT, isStar=true, isDisti
nct=false, }:SlotRef{tblName=null, col=null, id=0})}
13/03/11 10:02:21 INFO service.Frontend: create plan
13/03/11 10:02:21 INFO planner.Planner: create single-node plan
13/03/11 10:02:21 INFO planner.Planner: create plan fragments
13/03/11 10:02:22 INFO planner.Planner: finalize plan fragments
13/03/11 10:02:22 INFO planner.HdfsScanNode: collecting partitions for
table gb_xdr
13/03/11 10:02:22 INFO service.Frontend: get scan range locations
13/03/11 10:02:26 INFO catalog.HdfsTable: loaded partiton
PartitionBlockMetadata{#blocks=864, #filenames=118, totalStringLen=11266}
13/03/11 10:02:27 INFO hdfs.BlockStorageLocationUtil: Failed to connect to
datanode 192.168.1.126:50020
13/03/11 10:02:27 INFO catalog.HdfsTable: loaded disk ids for
PartitionBlockMetadata{#blocks=864, #filenames=118, totalStringLen=11266}
13/03/11 10:02:27 INFO catalog.HdfsTable: block metadata cache:
CacheStats{hitCount=0, missCount=1, loadSuccessCount=1,
loadExceptionCount=0, t otalLoadTime=5080514950,
evictionCount=0}
13/03/11 10:02:27 INFO service.Frontend: create result set metadata
13/03/11 10:02:27 INFO service.JniFrontend: Plan Fragment 0
UNPARTITIONED
AGGREGATE
OUTPUT: SUM(<slot 0>)
GROUP BY:
TUPLE IDS: 1
EXCHANGE (2)
TUPLE IDS: 1

Plan Fragment 1
RANDOM
STREAM DATA SINK
EXCHANGE ID: 2
UNPARTITIONED

AGGREGATE
OUTPUT: COUNT(*)
GROUP BY:
TUPLE IDS: 1
SCAN HDFS table=mpos_gb.gb_xdr #partitions=1 size=50.50GB (0)
TUPLE IDS: 0

13/03/12 10:13:43 INFO service.Frontend: analyze query select * from
mpos_gb.gb_xdr where Start_Date_Time > '2012-09-05 13:00:00.000000' and
Start_Date_Time < '2012-09-05 15:30:10.000000' limit 10
13/03/12 10:13:43 INFO service.Frontend: create plan
13/03/12 10:13:43 INFO planner.Planner: create single-node plan
13/03/12 10:13:43 INFO planner.Planner: create plan fragments
13/03/12 10:13:43 INFO planner.Planner: finalize plan fragments
13/03/12 10:13:43 INFO planner.HdfsScanNode: collecting partitions for
table gb_xdr
13/03/12 10:13:43 INFO service.Frontend: get scan range locations
13/03/12 10:13:43 INFO catalog.HdfsTable: loaded partiton
PartitionBlockMetadata{#blocks=864, #filenames=118, totalStringLen=11210}
13/03/12 10:13:43 INFO catalog.HdfsTable: loaded disk ids for
PartitionBlockMetadata{#blocks=864, #filenames=118, totalStringLen=11210}
13/03/12 10:13:43 INFO catalog.HdfsTable: block metadata cache:
CacheStats{hitCount=0, missCount=2, loadSuccessCount=2,
loadExceptionCount=0, totalLoadTime=5617207689, evictionCount=1}
13/03/12 10:13:43 INFO service.Frontend: create result set metadata
13/03/12 10:13:43 INFO service.JniFrontend: Plan Fragment 0
UNPARTITIONED
EXCHANGE (1)
LIMIT: 10
TUPLE IDS: 0

Plan Fragment 1
RANDOM
STREAM DATA SINK
EXCHANGE ID: 1
UNPARTITIONED

SCAN HDFS table=mpos_gb.gb_xdr #partitions=1 size=50.50GB (0)
PREDICATES: Start_Date_Time > '2012-09-05 13:00:00.000000',
Start_Date_Time < '2012-09-05 15:30:10.000000'
LIMIT: 10
TUPLE IDS: 0

---------------------------------------------------------------------------------------------------------------------------------

And output from impala-shell:
[cloudera-host2:21000] > select count(*) from mpos_gb.gb_xdr;
Query: select count(*) from mpos_gb.gb_xdr
Query finished, fetching results ...
152193112
Returned 1 row(s) in 468.47s
[cloudera-host2:21000] >

[cloudera-host2:21000] > select * from mpos_gb.gb_xdr where
Start_Date_Time > '2012-09-05 13:00:00.000000' and Start_Date_Time <
'2012-09-05 15:30:10.000000' limit 10;
Query: select * from mpos_gb.gb_xdr where Start_Date_Time > '2012-09-05
13:00:00.000000' and Start_Date_Time < '2012-09-05 15:30:10.000000' limit 10
Query finished, fetching results ...

......
101509376 20906357 2012-09-05 13:00:46.841000 2012-09-05
13:00:47.202000 2796454154 SGSN206 52788746 JMSM8B2 9858
218821 3 460020502304239 2 20 NULL
NULL 2 1 361 1142202122 2885681162 48154
80 00 NULL NULL 0 1 NULL
NULL NULL NULL NULL NULL NULL 1 361 NULL
NULL NULL NULLNULL NULL 1 361 NULL NULL
NULL NULL 1 6 NULL NULL
NULL NULL 32 32 01 32 32 0 1 NULL
NULL 1 NULL NULL NULL 0 NULL NULL NULL
2 NULL 0 NULLNULL NULL 0 0 0
NULL NULL NULL NULL NULL NULL NULL
NULL
......

Returned 10 row(s) in 45.37s
[cloudera-host2:21000] >

PS: When I had finished my statement, press the enter key, the output
"Query finished, fetching results ..." appears immediately, But the results
returned very slow.
----------------------------------------------------

Thanks!

在 2013年3月12日星期二UTC上午1时04分18秒,Marcel Kornacker写道:
On Mon, Mar 11, 2013 at 1:13 AM, zjp... wrote:
Hi, All!
I have installed impala(version of impala is 0.6, CDH is 4.2) without
Cloudera Manager followed instructions on
https://ccp.cloudera.com/display/IMPALA10BETADOC/Installing+and+Using+Cloudera+Impala.
And my cluster consists of three data nodes and one name node, And each one
machine hardware is: RAM: 8GB, CPU: 2 cores, 4 threads, Disk: 1TB.
OS is CentOS6.3 64bit.

I have installed impala on eache data node, and run process of
statestore on
one data node, Following is how I start statestored and impalad daemons:
GLOG_v=1 nohup statestored -state_store_port=24000 & #note: run on
cloudera-host3
GLOG_v=1 nohup impalad -state_store_host=cloudera-host3
-nn=cloudera-host1
-nn_port=8020 -hostname=cloudera-host2 -ipaddress=192.168.1.nn & #note: nn
is part of ip address of datanode.

I have created a database and a table, And have loaded about 150 million
records(size of data is about 52GB, it is about 1.5 hours of data in my
business) into the table with Hive. When i run a query as "select count(*)
from dbname.tbname" in Hive, The query takes approx 600~689 seconds to come
back, When i run the same query in impala-shell, It takes approx 230~380
seconds to come back. And i run query as "select * from dbname.tbname where
string_column>'xxxx' and string_column<'yyyy' limit 10", The Hive takes
approx 28~35 seconds, and the impala-shell takes approx 44~50 seconds, Are
there expected? And i known that the same situation in Infobright data
warehouse application, The queries previous mentioned takes only about 10
seconds!

Who can give me some advises that whether have problems?

Thanks!
Could you post the exact statements you use to create and populate the
table, the SELECT statements, and the query profile from the Impala
INFO log?

Marcel

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedMar 12, '13 at 10:22p
activeMar 12, '13 at 10:22p
posts1
users1
websitecloudera.com
irc#hadoop

1 user in discussion

Jung-Yup Lee: 1 post

People

Translate

site design / logo © 2022 Grokbase