FAQ
The current checksum implementation writes CRC32 values to a parallel
file. Unfortunately these parallel files pollute the namespace. In
particular, this places a heavier burden on the HDFS namenode.

Perhaps we should consider placing checksums inline in file data. For
example, we might write the data as a sequence of fixed-size
<checksum><payload> entries. This could be implemented as a FileSystem
wrapper, ChecksummedFileSystem. The create() method would return a
stream that uses a small buffer that checksums data as it arrives, then
writes the checksums in front of the data as the buffer is flushed. The
open() method could similarly check each buffer as it is read. The
seek() and length() methods would adjust for the interpolated checksums.

Checksummed files could have their names suffixed internally with
something like ".hcs0". Checksum processing would be skipped for files
without this suffix, for back-compatibility and interoperability.
Directory listings would be modified to remove this suffix.

Existing checksum code in FileSystem.java could be removed, including
all 'raw' methods.

HDFS would use ChecksummedFileSystem. If block names were modified to
encode the checksum version, then datanodes could validate checksums.
(We could ensure that checksum boundaries are aligned with block
boundaries.)

We could have two versions of the local filesystem: one with checksums
and one without. The DFS shell could use the checksumless version for
exporting files, while MapReduce could use the checksummed version for
intermediate data.

S3 might use this, or might not, if we think that Amazon already
provides sufficient data integrity.

Thoughts?

Doug

Search Discussions

  • Hairong Kuang at Jan 24, 2007 at 4:16 am
    Another option is to create a checksum file per block at the data node where
    the block is placed. This approach clearly separates data and checksums and
    does not requires too much changes for open(), seek() and length(). For
    create, when a block is written to a data node, the data node creates a
    checksum file at the same time.

    We could have the same checksum file naming convention as it is now. A
    checksum file is named after its block file name with a "." prefix and a
    ".crc" suffix.

    When upgrade, a name node removes all checksum files from its namespace.
    Data nodes create a checksum file per block if the checksum file does not
    exist.

    Hairong

    -----Original Message-----
    From: Doug Cutting
    Sent: Tuesday, January 23, 2007 3:51 PM
    To: hadoop-dev@lucene.apache.org
    Subject: inline checksums

    The current checksum implementation writes CRC32 values to a parallel file.
    Unfortunately these parallel files pollute the namespace. In particular,
    this places a heavier burden on the HDFS namenode.

    Perhaps we should consider placing checksums inline in file data. For
    example, we might write the data as a sequence of fixed-size
    <checksum><payload> entries. This could be implemented as a FileSystem
    wrapper, ChecksummedFileSystem. The create() method would return a stream
    that uses a small buffer that checksums data as it arrives, then writes the
    checksums in front of the data as the buffer is flushed. The
    open() method could similarly check each buffer as it is read. The
    seek() and length() methods would adjust for the interpolated checksums.

    Checksummed files could have their names suffixed internally with something
    like ".hcs0". Checksum processing would be skipped for files without this
    suffix, for back-compatibility and interoperability.
    Directory listings would be modified to remove this suffix.

    Existing checksum code in FileSystem.java could be removed, including all
    'raw' methods.

    HDFS would use ChecksummedFileSystem. If block names were modified to
    encode the checksum version, then datanodes could validate checksums.
    (We could ensure that checksum boundaries are aligned with block
    boundaries.)

    We could have two versions of the local filesystem: one with checksums and
    one without. The DFS shell could use the checksumless version for exporting
    files, while MapReduce could use the checksummed version for intermediate
    data.

    S3 might use this, or might not, if we think that Amazon already provides
    sufficient data integrity.

    Thoughts?

    Doug
  • Doug Cutting at Jan 24, 2007 at 4:25 am

    Hairong Kuang wrote:
    Another option is to create a checksum file per block at the data node where
    the block is placed.
    Yes, but then we'd need a separate checksum implementation for
    intermediate data, and for other distributed filesystems that don't
    already guarantee end-to-end data integrity. Also, a checksum per block
    would not permit checksums on randomly accessed data without
    re-checksumming the entire block. Finally, the checksum wouldn't be
    end-to-end. We really want to checksum data as close to its source as
    possible, then validate that checksum as close to its use as possible.

    Doug
  • Jim White at Jan 24, 2007 at 9:26 am

    Doug Cutting wrote:

    Hairong Kuang wrote:
    Another option is to create a checksum file per block at the data node
    where
    the block is placed.
    Yes, but then we'd need a separate checksum implementation for
    intermediate data, and for other distributed filesystems that don't
    already guarantee end-to-end data integrity. Also, a checksum per block
    would not permit checksums on randomly accessed data without
    re-checksumming the entire block. Finally, the checksum wouldn't be
    end-to-end. We really want to checksum data as close to its source as
    possible, then validate that checksum as close to its use as possible.
    I'm guessing the big impediment is lack of support in Java, but it seems
    like this would be good application for extended attributes/alternate
    forks/streams that so many file systems support these days.

    JSR-203 ("NIO.2") adds multiple fork support and was approved more than
    three years ago:

    http://jcp.org/en/jsr/detail?id=203

    At the time it was slated for JDK 1.5, but then got deferred until Java
    7. The story is tortured:

    http://forums.java.net/jive/thread.jspa?threadID=298&messageID=12696
    http://en.wikipedia.org/wiki/New_I/O

    With the Open Sourcing of Java, it seems like the code for NIO.2 should
    be available now or soon though.

    Ahh, here we go, Eclipse File System.

    http://eclipsezone.com/eclipse/forums/t83786.html

    This makes it sound like it might actually work:

    http://wiki.eclipse.org/index.php/EFS#Local_file_system

    Egad, there are more:

    Extended Filesystem API (WebNFS)
    http://docs.sun.com/app/docs/doc/806-1067/6jacl3e6g?a=view

    NetBeans Filesystem API
    http://www.netbeans.org/download/dev/javadoc/org-openide-filesystems/org/openide/filesystems/doc-files/api.html

    Apache Commons VFS
    http://jakarta.apache.org/commons/vfs/index.html

    New I/O: Improved filesystem interface
    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4313887

    JSR 203: More New I/O APIs for the Java Platform ("NIO.2")
    http://jcp.org/en/jsr/detail?id=203

    IBM's AIO4 looks like a partial implementation, but focused on the
    asynchronous portion of the new API.

    http://alphaworks.ibm.com/tech/aio4j

    Somewhere in all that seems like there should be a nifty way to handle
    this. But I can see that sorting it out is a big job. What a mess.

    *sigh*

    Jim
  • Hairong Kuang at Jan 24, 2007 at 6:24 pm
    If end-to-end is a concern, we could let the client generate the checksums
    and send it to the data node following the block data.

    Hairong

    -----Original Message-----
    From: Doug Cutting
    Sent: Tuesday, January 23, 2007 8:26 PM
    To: hadoop-dev@lucene.apache.org
    Subject: Re: inline checksums

    Hairong Kuang wrote:
    Another option is to create a checksum file per block at the data node
    where the block is placed.
    Yes, but then we'd need a separate checksum implementation for intermediate
    data, and for other distributed filesystems that don't already guarantee
    end-to-end data integrity. Also, a checksum per block would not permit
    checksums on randomly accessed data without re-checksumming the entire
    block. Finally, the checksum wouldn't be end-to-end. We really want to
    checksum data as close to its source as possible, then validate that
    checksum as close to its use as possible.

    Doug
  • Doug Cutting at Jan 25, 2007 at 7:28 pm

    Hairong Kuang wrote:
    If end-to-end is a concern, we could let the client generate the checksums
    and send it to the data node following the block data.
    I created an issue in Jira related to this issue:

    https://issues.apache.org/jira/browse/HADOOP-928

    The idea there is to first make it possible for FileSystems to opt in or
    out of the current checksum mechanism. Then, if a filesystem wishes to
    implement end-to-end checksums more efficiently than the provided
    generic implementation, it can do so. In HDFS it would be best to keep
    checksums aligned and stored with blocks, so that they can be validated
    on datanodes, and it would also be better if they don't consume names
    and blockids. These goals would be difficult to meet with a generic
    implementation.

    Doug
  • Raghu Angadi at Jan 24, 2007 at 6:29 pm

    Doug Cutting wrote:
    Hairong Kuang wrote:
    Another option is to create a checksum file per block at the data node
    where
    the block is placed.
    Yes, but then we'd need a separate checksum implementation for
    intermediate data, and for other distributed filesystems that don't
    already guarantee end-to-end data integrity. Also, a checksum per block
    would not permit checksums on randomly accessed data without
    re-checksumming the entire block. Finally, the checksum wouldn't be
    end-to-end. We really want to checksum data as close to its source as
    possible, then validate that checksum as close to its use as possible.
    DFS checksum need not be for entire block. It could be maintained by
    clients for for every 64k (as in GFS paper) which avoids reading the
    whole block. This ensures that there is no data corruption on disk,
    inside dfs etc. Client protocol can be extended so that client and
    datanode exchange and verify checksums. And clients like map/reduce can
    further verify checksums received from DFSClient. This could still be
    end-to-end, verified hop-to-hop.

    How is above different from your original proposal of inline checksums?
    Only difference I see is that inline moves checksum management to client
    and block checksum moves the management to datanode. With block checksum
    every file gets advantage of checksums.

    Raghu.
    Doug
  • Sameer Paranjpye at Jan 24, 2007 at 6:40 pm
    A checksum file per block would have many CRCs, one per 64k chunk or so
    in the block. So it would still permit random access. The datanode would
    only checksum the data accessed plus on average an extra 32k.

    Also, if datanodes were to send the checksum after the data on a read,
    the client could validate the checksum and it would be end-to-end. The
    same would apply for writes.

    A checksummed filesystem that embeds checksums into data makes the data
    unapproachable by tools that don't anticipate checksums. In HDFS data is
    accessible only via the HDFS client so this is not an issue and the
    checksums can be stripped out before they reach clients. But for Local
    and S3 where data is accessible without going through Hadoops Filesystem
    implementations this is a problem.

    I'd much prefer a ChecksummedFile implementation which applications can
    use when working with filesystems that don't do checksums. Even in this
    implementation it's probably better to write a side checksum file as is
    done currently rather than pushing checksums into the data.


    Doug Cutting wrote:
    Hairong Kuang wrote:
    Another option is to create a checksum file per block at the data node
    where
    the block is placed.
    Yes, but then we'd need a separate checksum implementation for
    intermediate data, and for other distributed filesystems that don't
    already guarantee end-to-end data integrity. Also, a checksum per block
    would not permit checksums on randomly accessed data without
    re-checksumming the entire block. Finally, the checksum wouldn't be
    end-to-end. We really want to checksum data as close to its source as
    possible, then validate that checksum as close to its use as possible.

    Doug
  • Tom White at Jan 24, 2007 at 9:52 pm

    A checksummed filesystem that embeds checksums into data makes the data
    unapproachable by tools that don't anticipate checksums. In HDFS data is
    accessible only via the HDFS client so this is not an issue and the
    checksums can be stripped out before they reach clients. But for Local
    and S3 where data is accessible without going through Hadoops Filesystem
    implementations this is a problem.
    For S3, it strikes me that we could put a checksum in the metadata for
    the block - this would be ignored by tools that aren't aware of it (if
    the data is also not block based - see
    http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00695.html).
    Blocks are written to temporary files on disk before being sent to S3,
    so it would be straightforward to checksum them before calling S3.

    S3 actually provides MD5 hashs of obejcts, but this isn't guaranteed
    to be supported in the future
    (http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51645),
    so we should use our own checksum metadata.

    Tom

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 23, '07 at 11:51p
activeJan 25, '07 at 7:28p
posts9
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase