Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 0.20-append
Reporter: Thanh Do
- Component: data node
- Version: 0.20-append
- Summary: we found a case that when a block is truncated during updateBlock,
the length on the ongoingCreates is not updated, hence leading to failed append.
# disks / datanode = 3
# failures = 2
failure type = crash
When/where failure happens = (see below)
1) Client writes to dn1-dn2-dn3. Write successes.
2) Now client tried to append. It first call dn1.recoverBlock().This recoverBlock succeeds.
3) Suppose the pipeline is dn3-dn2-dn1. Client sends packet to dn3.
dn3 forwards the packet to dn2 and writes to its disk (i.e dn3's disk).
Now, *dn2 crashes*, so that dn1 has not received this packet yet.
4) Client calls dn1.recoverBlock() again, this time with dn3-dn1 in the pipeline.
dn1 then calls dn3.startBlockRecovery() to terminate the writer thread in dn3.
get the *in memory* metadata info of the block, and verify that info with
the real file on disk.
dn3 maintains an in-memory data structure call *ongoingCreates* to record
information about currently-being-created block. If a block is finalized, then
its info is removed from *ongoingCreates*.
Now suppose that at the time dn3 receives startBlockRecovery() request from dn1,
+ finished writing data to disk (hence, the block length on disk is 1024)
+ set visible in memory length (hence, in memory length is also 1024)
but it *has not* finalized the block, hence the block info is still in the *ongoingCreates*.
(Note: the interruption of writer thread makes the finalization never happens)
Because of all above stuff, dn3 gives dn1 info about the block with length 1024.
5. Now dn1 calls its own startBlockRecovery() successfully (because the on-disk
file length and memory file length match, both are 512 byte).
6. Now, dn1 has a sync list (block_X_GS1 at dn1 with length 512, block_X_GS1 at dn3 with length 1024).
it needs to make sure all dn in the pipeline agree on new GS and length.
dn1 calls NN.nextGS() to get new GS2. It form new block_X_GS2 with length 512, and
call updateBlock on dn3 and itself.
7. dn3, receiving updateBlock request from dn1, does:
+ rename the block from block_X_GS1 ==> block_X_GS2
+ truncate the block file length from 1024 to 512
But, here is the key, it *does not update the length of the block kept in ongoingCreates*
+ return to dn1 successfully
8. Now, dn1 call its own updateBlock and *crashes*.
9. From client point of view, dn1.recoverBlock fails.
It retries call dn1.recoverBlock six times, and declare dn1 as bad.
10. Client now calls dn3.recoverBlock()
11. Dn3 in turns calls its startBlockRecovery() to
+ interrupt block writer threads if any
+ getBlockMetadataInfo (as part of forming the syncList, and updateBlock later)
it first look into ongoingCreates to see the block info is there,and found it (because the block is not finalized).
Hence, in-memory length is 1024 (even though truncateBlock is called before)
verify if the in-memory length (1024) with on-disk length (512)Hence, the *un-matched file length exception*
12. From client point of view, recoverBlock fails, because *All data nodes are bad*
Client retries calling dn3.recoverBlock five more times and gets the same exception,
Hence, append fails.
- to fix it, i think when truncating the file, we need to update the ongoingCreates too
(but i am not sure, if we fix thing like this, is there any other workload may affect)
- interestingly, NN.leaseRecovery fails because of the exact exception at dn3.
- until dead node restarts and NN.leaseRecovery is triggered again, NN is still the lease holder of the file
This bug was found by our Failure Testing Service framework:
For questions, please email us: Thanh Do (email@example.com) and
Haryadi Gunawi (firstname.lastname@example.org
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.