FAQ
HI Nong and Alan,

All issues are resolved.

#3. I didn't changed the config (dfs.datanode.hdfs-blocks-metadata.enabled
was set to true in hdfs-site.xml already), but it was probably a permission
issue. I changed the ownership of the files from "endomine:supergroup" to
"endomine:endomine" and ran the impala services as the impala user. Now,
the "Unknown disk id" does not appear in the log anymore.

#2. I changed the partition scheme use the first letter of the test, and
now Impala is faster than hive again.

Thanks for your support!!
On Monday, December 10, 2012 1:25:47 PM UTC-5, Alan wrote:

Hi David,

For (3), you need to set dfs.datanode.hdfs-blocks-metadata.enabled to
true. See this page for details:


https://ccp.cloudera.com/display/IMPALA10BETADOC/Configuring+Impala+for+Performance+without+Cloudera+Manager

Thanks,
Alan


On Sat, Dec 8, 2012 at 1:38 PM, David Lauzon <davido...@gmail.com<javascript:>
wrote:
Hi Nong,

1. ss4. Unfortunately, I can't do anything about this for the demo app,
but it's interesting to know. I'll keep this in mind for future projects.

2. Large number of files and small blocks. Yes, I was aware this was
pretty bad, I did the partitions quickly to try it out. The distribution of
data is very uneven accross testids, yielding to a few tests having most of
the data and other tests with just few records. I didn't think it could
affect Impala's performance that much for a single test. I've been really
busy last week, but i'll definetely change the partition scheme.

3. Block locations. Any pointer on this? So what is happening exactly, it
means Impala has to query the NN first instead of fetching the data
directly from the DN ?

4. Block Compression. Yes, I'm planning to use Snappy.

Thanks,
David
On Friday, December 7, 2012 3:25:50 PM UTC-5, Nong wrote:

David,

I finally got a chance to look through the logs and noticed a few
things. The performance is certainly not what we expect.

- This machine does not have sse4 support. Both HDFS and impala
use this for significant performance benefits. HDFS uses this
for check-summing, which from my quick tests, improves read throughput by
just under 2x. Impala uses it for a lot of the text parsing and string
processing. The IO subsystem is tuned (# of reading threads, # of io
buffers, etc) assuming fast check-sums are present.
- Your table has a very large number of files and very small blocks.
Your dataset is ~5GB with ~25000 blocks. This is roughly 210KB per block.
In general, we're expecting blocks to be in the 64MB+ range. Having such
small blocks is also bad for IO throughput. There is much less sequential
IO.
- Block locations is not enabled. I'm not sure if you were able to
resolve this. Did you restart the DNs/impalad after changing the config?
- Sequence files without block compression is not a format we've
spent much time optimizing. It's much more common for us to see sequence
file with snappy/gzip block compression. You might see some benefits if
you switch.

The time for hive is also very low (~5MB/s on this node). The first
two items will likely improve hive performance as well.

Hope that helps,
Nong
On Tuesday, December 4, 2012 1:16:00 PM UTC-8, David Lauzon wrote:

Hi Nong,

I've attached the v2 log results to previous message. The format is
BZIP2, but the forum wouldn't allow me to upload it (error 340).

[endomine@008920a ~]$ date; impala-shell --verbose -i localhost -q
'select COUNT(*) FROM stay_order_results'; date; echo "DONE"
Tue Dec 4 14:52:39 EST 2012
Connected to localhost:21000
Query: select COUNT(*) FROM stay_order_results
Unknown Exception : [Errno 104] Connection reset by peer
Query aborted, unable to fetch data
Could not execute command: select COUNT(*) FROM stay_order_results
Tue Dec 4 15:07:29 EST 2012
DONE


And this crashed the daemon, this time there is warning : "Unknown disk
id. This will negatively affect performance. Check your hdfs settings to
enable block location metadata."

Hum, i've got dfs.datanode.hdfs-blocks-metadata.enabled set to true,
and dfs.block.local-path-access.user set to "impala,hdfs,mapred,endomine".
Data is owned by endomine:supergroup, and daemon is running under endomine
user as well.

-D

Le mardi 4 décembre 2012 16:01:22 UTC-5, David Lauzon a écrit :


Le mardi 4 décembre 2012 13:49:08 UTC-5, Nong a écrit :
It looks like you've hit a perf issue with our partition pruning.
Looking at the logs, we spend 1min26sec on planning and 500ms
executing the query.

I'll see if I can repro this and get back to you.

This doesn't explain why the count(*) over the whole table is also
behaving oddly. Can you turn up the logging (GLOG_v=2) and
run the query for a few minutes and attach the logs?

Thanks
Nong
On Mon, Dec 3, 2012 at 10:00 PM, David Lauzon wrote:

I'll send you the log files offline (google servers seems busy).
Thanks for looking into this Nong!
Let me know if you need anything else

BTW, is there a simple /etc/init.d service for impala ? (i've
installed cdh4 + impala with the rpm)

-D

On Tuesday, December 4, 2012 12:32:00 AM UTC-5, Nong wrote:

Thanks David. It looks like the partition pruning has kicked in.
You don't see the predicates in the plan indicating
we've handled them by pruning. The two things you highlighted are
internal to how we do distributed computation
and not related to partition pruning.

Can you send the log of the slow query that does finish? Please
run impalad with GLOG_v=1 and send us the log.

Thanks
Nong

On Mon, Dec 3, 2012 at 8:01 PM, David Lauzon wrote:

Sure Alan,

Here is the plan:

Explain query: SELECT COUNT(*) FROM stay_order_results WHERE
yrmonth = 201202 AND testid="HGH"
Plan Fragment 0
UNPARTITIONED
AGGREGATE
OUTPUT: SUM(<slot 2>)
GROUP BY:
TUPLE IDS: 1
EXCHANGE (2)
TUPLE IDS: 1

Plan Fragment 1
RANDOM
STREAM DATA SINK
EXCHANGE ID: 2
UNPARTITIONED

AGGREGATE
OUTPUT: COUNT(*)
GROUP BY:
TUPLE IDS: 1
SCAN HDFS table=default.stay_order_results (0)
TUPLE IDS: 0


Table DDL is attached.

Le lundi 3 décembre 2012 20:42:35 UTC-5, Alan a écrit :
Hi David,

Impala does take advantage of Hive's partitions. If you don't
mind sharing your explain plan and table definition, we can verify if
Impala does partition pruning correct.

Thanks,
Alan


On Mon, Dec 3, 2012 at 5:19 PM, David Lauzon <davido...@gmail.com
wrote:
Hello! I am doing a very simple query on a table partitionned by
(yrmonth int, testid string) but it looks like Impala is reading the whole
table regardless of whether I specify the partition fields or not.

The table is about 5 GB, but I am only looking up a small
partition :

hadoop fs -du -s
/user/hive/warehouse/stay_order_results/yrmonth=201202/testid=HGH
13638
/user/hive/warehouse/stay_order_results/yrmonth=201202/testid=HGH



hadoop fs -du -s /user/hive/warehouse/stay_order_results
5562785204 /user/hive/warehouse/stay_order_results


impala-shell --verbose -i localhost -q 'select COUNT(*) FROM
stay_order_results WHERE yrmonth = 201202 AND testid="HGH";'
17
Returned 1 row(s) in 90.87s

# For the same query hive is much faster
hive -e 'select COUNT(*) FROM stay_order_results WHERE yrmonth =
201202 AND testid="HGH";'
Total MapReduce CPU Time Spent: 4 seconds 40 msec
OK
17
Time taken: 18.185 seconds


If I try to count(*) the whole table, hive returns a count of 7M
rows in 18 minutes, while impala seems to hang indefinetely. Looking at top
the Impala process takes up to 3GB of RAM and 386% CPU (there is 2 Xeons
with 2-cores). Eventually the process abrt-hook-ccpp appears in top, and
impala cpu % decreases but does not return an answer.

The impala daemon shows this error :

terminate called after throwing an instance of
'boost::thread_resource_error'
what(): boost::thread_resource_error


The ouptut of top shows this :

top - 19:23:29 up 9:38, 5 users, load average: 173.24, 65.97,
24.68
Tasks: 215 total, 2 running, 213 sleeping, 0 stopped, 0
zombie
Cpu(s): 12.6%us, 19.6%sy, 0.0%ni, 39.7%id, 28.1%wa, 0.0%hi,
0.0%si, 0.0%st
Mem: 16331600k total, 13850712k used, 2480888k free, 218228k
buffers
Swap: 3906552k total, 0k used, 3906552k free, 8124528k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
7439 root 20 0 13384 928 792 R 99.3 0.0 0:32.73
abrt-hook-ccpp
6524 endomine 20 0 324g 3.0g 26m D 37.9 19.5 11:37.95
impalad
6732 root 20 0 15196 1308 940 R 0.7 0.0 0:00.80 top
1176 root 20 0 0 0 0 S 0.3 0.0 0:01.40
kjournald


I am using CDH4 with CentOS 6.2 on single-node.

Are you able to tell if this a bug, a misconfiguration ?

David
--
--
--
--

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedDec 20, '12 at 12:01a
activeDec 20, '12 at 12:01a
posts1
users1
websitecloudera.com
irc#hadoop

1 user in discussion

David Lauzon: 1 post

People

Translate

site design / logo © 2022 Grokbase