On Fri, Jul 1, 2011 at 1:07 PM, Shuja Rehman wrote:
I have job where i need to read from 1 hbase table, perform aggregations
and writing back to other hbase table. For it, I am using
TableMapReduceUtil.initTableReducerJob. In reducer, if I use
put.setWriteToWAL(false), then job completes within seconds but without it,
it takes 30 mins approximately. Why there is so huge difference in
performance? I wish that I can complete the same job within seconds while
using put.setWriteToWAL(true) to prevent the data loss. So kindly let me
know what other optimizations I can do?
Don't disable WAL. You are just going to shoot yourself in the foot
if you leave it off.
The difference in perf is that you are writing every edit to the
filesystem first before anything else is done.
Try playing with deferred sync'ing of writes. You need to set your
table do to deferred flushes by setting the DEFERRED_LOG_FLUSH table
attribute on your table. Once set, rather than sync every write,
we'll sync on a period. The default is to sync every second. Here is
the setting in hbase-default.xml
<description>Sync the HLog to the HDFS after this interval if it has not
accumulated enough entries to trigger a sync. Default 1 second. Units:
Now if you crash, instead of losing massive chunks of your job,
instead you will lose up to the last second worth of writes but in
compensation you should see faster writing.
Also, what is slow? The writes or the reads? How many reducers? If
you up the number does that help?
Try playing with hbase.regionserver.optionallogflushinterval. If you
set your table so it does deferred flushes