[
https://issues.apache.org/jira/browse/HADOOP-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583460#action_12583460 ]
Jim Kellerman commented on HADOOP-3113:
---------------------------------------
dhruba borthakur - 29/Mar/08 11:03 PM
I can make the configurable to be per file, but maybe it makes more sense to make it applicable to the
entire system. The reason being that datanodes do not know much about the name of the HDFS file
that a block belongs to. To make this configurable "per file" would need lots of protocol change.
I don't think it needs to be per file. Aside from our redo log, other files are written and then immediately
closed and re-opened for read.
If a client dies while writing to the last block of that file, that block is not yet part of the blocksmap in the
namenode. (A block gets inserted in the blocksmap when a complete block is received by the datanode
and it sends a blockReceived message to the namenode). If the lease for this file on the namenode
expires before the block report from the datanode arrives, then the namenode will erroneously think
that no datanodes have a copy of that block. As part of lease recovery, the namenode will delete the
last block of the file because it has no entry in the blocksMap. To prevent this from occuring, the block
report periodicity should be set to 30 minutes.
I think this is ok, but let me give a scenario to verify that my understanding is correct.
We open our redo log and flush it either every N seconds or after M records have been written.
If the process writing the log crashes, we will notice much sooner than the file lease timeout.
At that point another process should be able to open the file for read, and all flushed data
will be visible, unflushed data will not. Since the amount of unflushed data should be small
the amount of data lost should be minimal. Once the redo log has been read and processed,
the file will be deleted by the process reading the file.
If this is how this patch works, +1.
Provide a configurable way for DFSOututStream.flush() to flush data to real block file on DataNode.
---------------------------------------------------------------------------------------------------
Key: HADOOP-3113
URL:
https://issues.apache.org/jira/browse/HADOOP-3113Project: Hadoop Core
Issue Type: Bug
Components: dfs
Reporter: dhruba borthakur
Assignee: dhruba borthakur
Attachments: noTmpFile.patch
DFSOutputStream has a method called flush() that persists block locations on the namenode and sends all outstanding data to all datanodes in the pipeline. However, this data goes to the tmp file on the datanode(s). When the block is closed, the tmp files is renamed to be the real block file. If the datanode(s) dies before the block is compete, then entire block is lost. This behaviour wil be fixed in HADOOP-1700.
However, in the short term, a configuration paramater can be used to allow datanodes to write to the real block file directly, thereby avoiding writing to the tmp file. This means that data that is flushed successfully by a client does not get lost even if the datanode(s) or client dies.
The Namenode already has code to pick the largest replica (if multiple datanodes have different sizes of this block). Also, the namenode has code to not trigger replication request if the file is still being written to.
The only caveat that I can think of is that the block report periodicity should be much much smaller that the lease timeout period. A block report adds the being-written-to blocks to the blocksMap thereby avoiding any cleanup that a lease expiry processing might have otherwise done.
Not all requirements specified by HADOOP-1700 are supported by this approach, but it could still be helpful (in the short term) for a wide range of applications.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.