Hi folks. I'd like to run the following data loss scenario by you to see if
we are doing something obviously wrong with our setup here.
- Hadoop 0.20.1
- HBase 0.20.3
- 1 Master Node running Nameserver, SecondaryNameserver, JobTracker,
HMaster and 1 Zookeeper (no zookeeper quorum right now)
- 4 child nodes running a Datanode, TaskTracker and RegionServer each
- dfs.replication is set to 2
- Host: Amazon EC2
Up until yesterday, we were frequently experiencing
which kept bringing our RegionServers down. What we realized though is that
we were losing data (a few hours worth) with just one out of four
regionservers going down. This is problematic since we are supposed to
replicate at x2 out of 4 nodes, so at least one other node should be able to
theoretically serve the data that the downed regionserver can't.
- When a regionserver goes down unexpectedly, the only data that
theoretically gets lost was whatever didn't make it to the WAL, right? Or
- We ran a hadoop fsck on our cluster and verified the replication factor
as well as that the were no under replicated blocks. So why was our data not
available from another node?
- If the log gets rolled every 60 minutes by default (we haven't touched
the defaults), how can we lose data from up to 24 hours ago?
- When the downed regionserver comes back up, shouldn't that data be
available again? Ours wasn't.
- In such scenarios, is there a recommended approach for restoring the
regionserver that goes down? We just brought them back up by logging on the
node itself an manually restarting them first. Now we have automated crons
that listen for their ports and restart them if they go down within two
- Are there way to recover such lost data?
- Are versions 0.89 / 0.90 addressing any of these issues?
- Curiosity question: when a regionserver goes down, does the master try
to replicate that node's data on another node to satisfy the dfs.replication
For now, we have upgraded our HBase to 0.20.6, which is supposed to contain
the HBASE-2077 <https://issues.apache.org/jira/browse/HBASE-2077> fix (but
no one has verified yet). Lars' blog also suggests that Hadoop 0.21.0 is the
way to go to avoid the file append issues but it's not production ready
yet. Should we stick to 0.20.1? Upgrade to 0.20.2?
Any tips here are definitely appreciated. I'll be happy to provide more
information as well.