FAQ
thanks for your comment.

1) In my test , each query returns very small number of rows.
in fact, matching results is only one. ( 1 row is returned)
so i thinks scan & decompression is biggest performance factor.

2) in details , I have some more question.
in case of query - select col_2 from rcfile_test where col_1 in
('2002-01-01T00:00:00' ) , profile shows that ByteRead is

BytesRead: 5.85 GB
BytesRead: 6.00 GB
BytesRead: 5.95 GB
BytesRead: 5.83 GB
BytesRead: 5.68 GB
BytesRead: 5.79 GB.
but in case of select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10,
col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18, col_19,
col_20, col_21 from rcfile_test where col_1 in ('2002-01-01T00:00:00' ) , it also shows that ByteRead is
BytesRead: 5.85 GB
BytesRead: 6.00 GB
BytesRead: 5.95 GB
BytesRead: 5.83 GB
BytesRead: 5.68 GB
BytesRead: 5.79 GB.


*no matter how many field i select , byte read is always same on RCFile. is this right?*

in other test on parquet , increasing number of selecting field, ByteReads is also linearly increased.

thanks.


2013년 7월 8일 월요일 오전 9시 26분 24초 UTC+9, Nong 님의 말:
There are a couple reasons why you could be seeing the results you are.

1) It looks to me like the query you are running is returning a lot of
rows. This means
Impala will spend a noticeable amount of time sending results back to the
client (shell),
and having the client process it. You will be able to focus more on the
scan performance
by changing your query. One option is to do "select <cols> from table
order by col1 limit 1".
This forces Impala to scan all the rows but the client does not become a
bottleneck.

2) RCFiles are not structured in a way that makes it easy to save IO. The
data needed
to process a single column is spread too sparsely making it inefficient
(excessive seeks)
to skip around and just read the column data. This means that Impala does
not get IO
savings from RCFiles. In our test machines which are not IO bound (10
disks, 16 cores), we
observe better query performance scanning fewer columns and RCFile to out
perform the row
based Hadoop formats (i.e. sequence file). This limitation in RCFiles is
one of the major reasons why we developed Parquet (
https://github.com/Parquet/parquet-format).

Nong



2013/7/2 여인재 <yin...@gmail.com <javascript:>>
*hi.*
*
*
*I tested impala with rcfile and result is strange.*
*
*
*I sent the query after flush the os cache ( run "sync ; echo 3 >
/proc/sys/vm/drop_caches" command in all datanode) and I did same job 6
times to get average. *
*
*
*It's my test result.*

query - select col_2 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 30.61s
[1] - 32.92s
[2] - 32.97s
[3] - 36.30s
[4] - 30.47s
[5] - 37.71s
query - select col_2, col_3 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 35.00s
[1] - 31.25s
[2] - 32.90s
[3] - 31.09s
[4] - 31.18s
[5] - 31.97s
query - select col_2, col_3, col_4 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 50.85s
[1] - 33.30s
[2] - 34.96s
[3] - 34.65s
[4] - 32.25s
[5] - 30.69s
query - select col_2, col_3, col_4, col_5 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 36.00s
[1] - 32.04s
[2] - 31.12s
[3] - 32.00s
[4] - 31.87s
[5] - 35.49s
query - select col_2, col_3, col_4, col_5, col_6 from rcfile_test where
col_1 in ('2002-01-01T00:00:00' )
[0] - 32.62s
[1] - 40.22s
[2] - 33.57s
[3] - 32.74s
[4] - 32.36s
[5] - 31.80s
query - select col_2, col_3, col_4, col_5, col_6, col_7 from rcfile_test
where col_1 in ('2002-01-01T00:00:00' )
[0] - 34.07s
[1] - 36.17s
[2] - 30.76s
[3] - 35.43s
[4] - 34.48s
[5] - 31.52s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8 from
rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 34.62s
[1] - 31.97s
[2] - 32.46s
[3] - 32.22s
[4] - 33.15s
[5] - 31.93s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9
from rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 33.02s
[1] - 32.11s
[2] - 35.23s
[3] - 31.54s
[4] - 33.34s
[5] - 33.82s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10 from rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 33.58s
[1] - 37.11s
[2] - 32.62s
[3] - 31.95s
[4] - 33.36s
[5] - 32.43s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11 from rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 33.13s
[1] - 32.51s
[2] - 35.70s
[3] - 59.26s
[4] - 35.38s
[5] - 31.85s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 32.57s
[1] - 34.11s
[2] - 31.37s
[3] - 33.68s
[4] - 35.59s
[5] - 36.61s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 32.56s
[1] - 33.60s
[2] - 33.57s
[3] - 32.88s
[4] - 34.87s
[5] - 35.98s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 31.26s
[1] - 35.84s
[2] - 33.69s
[3] - 31.35s
[4] - 39.56s
[5] - 33.29s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15 from rcfile_test where col_1
in ('2002-01-01T00:00:00' )
[0] - 33.75s
[1] - 32.72s
[2] - 32.55s
[3] - 33.18s
[4] - 33.29s
[5] - 34.40s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16 from rcfile_test
where col_1 in ('2002-01-01T00:00:00' )
[0] - 34.15s
[1] - 40.15s
[2] - 33.60s
[3] - 37.38s
[4] - 32.34s
[5] - 36.15s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17 from
rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 33.53s
[1] - 31.17s
[2] - 35.39s
[3] - 32.55s
[4] - 38.41s
[5] - 37.55s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18 from
rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 33.76s
[1] - 32.84s
[2] - 35.02s
[3] - 36.26s
[4] - 30.96s
[5] - 34.63s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18,
col_19 from rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 36.82s
[1] - 35.93s
[2] - 33.83s
[3] - 35.59s
[4] - 31.68s
[5] - 34.84s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18,
col_19, col_20 from rcfile_test where col_1 in ('2002-01-01T00:00:00' )
[0] - 34.19s
[1] - 32.30s
[2] - 35.99s
[3] - 33.33s
[4] - 35.99s
[5] - 34.96s
query - select col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9,
col_10, col_11, col_12, col_13, col_14, col_15, col_16, col_17, col_18,
col_19, col_20, col_21 from rcfile_test where col_1 in
('2002-01-01T00:00:00' )
[0] - 44.53s
[1] - 34.51s
[2] - 37.36s
[3] - 36.15s
[4] - 35.12s
[5] - 32.35s

result shows that no matter how many fields I selected, computation time
is same.

my expectation is that computation time would be linearly increased.

what did I do wrong ? or impala could not use the feature of columnar
storage in rcfile?

p.s. rcfile was generate using hive.
p.s.2 here is sample log.

+---------------------+--------+----------+------------+-------------+-------+-------+------------+------------+------------+--------+--------+--------+--------+--------+--------+-----------------+--------+--------+--------+--------+--------+--------+--------+
col_1 | col_2 | col_3 | col_4 | col_5 |
col_6 | col_7 | col_8 | col_9 | col_10 | col_11 | col_12 |
col_13 | col_14 | col_15 | col_16 | col_17 | col_18 | col_19 |
col_20 | col_21 | col_22 | col_23 | col_24 |

+---------------------+--------+----------+------------+-------------+-------+-------+------------+------------+------------+--------+--------+--------+--------+--------+--------+-----------------+--------+--------+--------+--------+--------+--------+--------+
2002-02-10T05:38:29 | GXG504 | 23086251 | 2535810071 | 0109483332 | 1
S | 3822187912 | 8754023859 | OA00387474 | Y | PB8006 |
PK2089 | PT5797 | 1 | 3000 | 224.226.243.154 | A | N | N
N | | D | 9 |
2002-02-10T05:38:30 | GXG612 | 13876500 | 1958035090 | 01091001193 | 1
S | 3962706790 | 6667184225 | OA00166125 | Y | PB2177 |
PK8604 | PT4278 | 1 | 3000 | 125.170.184.100 | A | N | N
N | | D | 1 |
2002-02-10T05:38:31 | GXG898 | 40343227 | 2184294475 | 01038271283 | 0
S | 4150571126 | 2931530703 | OA00328656 | Y | PB5258 |
PK5369 | PT834 | 1 | 3000 | 204.136.230.173 | A | N | N
N | | D | 9 |
2002-02-10T05:38:32 | GXG341 | 35258814 | 1852804248 | 01063887069 | 1
S | 2809627662 | 9387041220 | OA00394953 | Y | PB5565 |
PK5731 | PT6249 | 1 | 3000 | 211.170.145.119 | A | N | N
N | | D | 3 |
2002-02-10T05:38:33 | GXG110 | 1332847 | 9137304412 | 01022591692 | 1
S | 5149479520 | 5909357290 | OA00867226 | Y | PB1306 |
PK5176 | PT1244 | 1 | 3000 | 146.145.142.160 | A | N | N
N | | D | 3 |
2002-02-10T05:38:34 | GXG786 | 1357754 | 7797368284 | 01085732060 | 0
S | 8534441736 | 4251199061 | OA00474327 | Y | PB8045 |
PK7414 | PT7118 | 1 | 3000 | 137.127.129.203 | A | N | N
N | | D | 6 |
2002-02-10T05:38:35 | GXG622 | 19436903 | 6155564104 | 01068988331 | 1
S | 6984015922 | 1847944579 | OA00871160 | Y | PB6985 | PK57
PT7379 | 1 | 3000 | 203.196.138.252 | A | N | N |
N | | D | 1 |
2002-02-10T05:38:36 | GXG737 | 31590936 | 6268125615 | 01074834717 | 1
S | 8381360698 | 4218653526 | OA00281083 | Y | PB1995 |
PK8267 | PT6729 | 1 | 3000 | 167.254.184.127 | A | N | N
N | | D | 9 |
2002-02-10T05:38:37 | GXG904 | 8117989 | 6611804920 | 01054277716 | 0
S | 8585871301 | 6588284899 | OA00632480 | Y | PB3047 |
PK8291 | PT5983 | 1 | 3000 | 200.222.167.228 | A | N | N
N | | D | 7 |
2002-02-10T05:38:38 | GXG432 | 4088504 | 3799298972 | 01030694151 | 1
S | 2923002263 | 4708863755 | OA00933396 | Y | PB1222 |
PK560 | PT1787 | 1 | 3000 | 211.171.200.194 | A | N | N
N | | D | 7 |
+---------------------+--------+----------+------------+-------------+-------+-------+------------+------------+------------+--------+--------+--------+--------+--------+--------+-----------------+--------+--------+--------+--------+--------+--------+--------+

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedJul 9, '13 at 4:17a
activeJul 9, '13 at 4:17a
posts1
users1
websitecloudera.com
irc#hadoop

1 user in discussion

여인재: 1 post

People

Translate

site design / logo © 2022 Grokbase