FAQ
Hi Ben,

Impala haven't optimized the scanning performance for HBase. I really don't
recommend using HBase to scan that much data. Is there any reason that you
can't put the data in Parquet? Is it because you need to update or frequent
insert?

Thanks,
Alan

On Sat, Jan 18, 2014 at 3:26 PM, Benjamin Kim wrote:

I would like to know why querying an HBase table takes so long. If I run
the same query in Hive, it takes far less time. We are trying to read 1
day's worth of event logs data. The dataset has 502M rows and sizes at
690GB of storage. I ran some tests using a simple COUNT DISTINCT query on
one of the columns. Here are the results.

AVRO
Size: >205GB
Mem per Host: 3.76GB
Durations:
– single Hive run: 22min 15.89s
– single Impala runs: 9min 51.60s, 6min 59.22s, 4min 49.99s

PARQUET
Size: >187GB
Mem per Host: 98MB
Durations:
– single Hive run: 20min 10.42s
– single Impala runs: 59.47s, 58.03s, 1min 6.48s

HBASE
Size: >690GB
Mem per Host: 1.01GB
Durations:
– single Hive run: 15min 42.26s
– single Impala run: 1hr 32min 43.30s

Can someone help? I tuned the Hive run by setting
hbase.client.scanner.caching = 5000. I tried to tune Impala by setting
hbase_caching = 1000, then 5000, 10000, etc. I could not get any
improvement.

Thanks,
Ben

To unsubscribe from this group and stop receiving emails from it, send an
email to impala-user+unsubscribe@cloudera.org.
To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 5 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedJan 18, '14 at 11:27p
activeJan 20, '14 at 7:55p
posts5
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase