FAQ
Hi All,

This is question regarding "HDFS checksum" computation.

I understood that When we read a file from HDFS by default it verifies the checksum and your read would not succeed if the file is corrupted. Also CRC is internal to hadoop.

Here are my questions:
1. How can I use "hadoop dfs -get [-ignoreCrc] [-crc] <src> <localdst>" command?

2. I used "get" command for a .gz file with -crc option ( "hadoop dfs -get -crc input1/test.gz /home/hadoop/test/. " ).
Does this check for .crc file created in hadoop? When I tried this, I got an error as
"-crc option is not valid when source file system does not have crc files. Automatically turn the option off." means that hadoop does not create crc for this file?
Is this correct?

3. How can I enable hadoop to create CRC file?

Regards,
Thamil

Regards,

Thamizhannal P

Search Discussions

  • Harsh J at Apr 9, 2011 at 9:51 am
    Hello Thamizh,

    Perhaps the discussion in the following link can shed some light on
    this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
    On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
    Hi All,

    This is question regarding "HDFS checksum" computation.
    --
    Harsh J
  • Thamizh at Apr 9, 2011 at 2:38 pm
    Hi Harsh ,
    Thanks a lot for your reference.
    I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.

    Regards,

    Thamizhannal P

    --- On Sat, 9/4/11, Harsh J wrote:

    From: Harsh J <harsh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Date: Saturday, 9 April, 2011, 3:20 PM

    Hello Thamizh,

    Perhaps the discussion in the following link can shed some light on
    this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
    On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
    Hi All,

    This is question regarding "HDFS checksum" computation.
    --
    Harsh J
  • Josh Patterson at Apr 11, 2011 at 2:23 pm
    Thamizh,
    For a much older project I wrote a demo tool that computed the hadoop
    style checksum locally:

    https://github.com/jpatanooga/IvoryMonkey

    Checksum generator is a single threaded replica of Hadoop's internal
    Distributed hash-checksum mechanic.

    What its actually doing is saving the CRC32 of every 512 bytes (per
    block) and then doing a MD5 hash on that. Then when the
    "getFileChecksum()" method is called, each block for a file sends its
    md5 hash to a collector which are gathered together and a md5 hash is
    calc'd for all of the block hashes.

    My version includes code that can calculate the hash on the client
    side (but breaks things up in the same way that hdfs does and will
    calc it the same way).

    During development, we also discovered and filed:

    https://issues.apache.org/jira/browse/HDFS-772

    To invoke this method, use my shell wrapper:

    https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

    Hope this provides some reference information for you.
    On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote:
    Hi Harsh ,
    Thanks a lot for your reference.
    I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.

    Regards,

    Thamizhannal P

    --- On Sat, 9/4/11, Harsh J wrote:

    From: Harsh J <harsh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Date: Saturday, 9 April, 2011, 3:20 PM

    Hello Thamizh,

    Perhaps the discussion in the following link can shed some light on
    this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
    On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
    Hi All,

    This is question regarding "HDFS checksum" computation.
    --
    Harsh J


    --
    Twitter: @jpatanooga
    Solution Architect @ Cloudera
    hadoop: http://www.cloudera.com
    blog: http://jpatterson.floe.tv
  • Thamizh at Apr 12, 2011 at 11:33 am
    Thanks of lot Josh.

    I have been given a .gz file and been told that it has been downloaded from HDFS.

    When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result.

    I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop.

    Regards,

    Thamizhannal P

    --- On Mon, 11/4/11, Josh Patterson wrote:

    From: Josh Patterson <josh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Cc: "Thamizh" <tcegrid@yahoo.co.in>
    Date: Monday, 11 April, 2011, 7:53 PM

    Thamizh,
    For a much older project I wrote a demo tool that computed the hadoop
    style checksum locally:

    https://github.com/jpatanooga/IvoryMonkey

    Checksum generator is a single threaded replica of Hadoop's internal
    Distributed hash-checksum mechanic.

    What its actually doing is saving the CRC32 of every 512 bytes (per
    block) and then doing a MD5 hash on that. Then when the
    "getFileChecksum()" method is called, each block for a file sends its
    md5 hash to a collector which are gathered together and a md5 hash is
    calc'd for all of the block hashes.

    My version includes code that can calculate the hash on the client
    side (but breaks things up in the same way that hdfs does and will
    calc it the same way).

    During development, we also discovered and filed:

    https://issues.apache.org/jira/browse/HDFS-772

    To invoke this method, use my shell wrapper:

    https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

    Hope this provides some reference information for you.
    On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote:
    Hi Harsh ,
    Thanks a lot for your reference.
    I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.

    Regards,

    Thamizhannal P

    --- On Sat, 9/4/11, Harsh J wrote:

    From: Harsh J <harsh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Date: Saturday, 9 April, 2011, 3:20 PM

    Hello Thamizh,

    Perhaps the discussion in the following link can shed some light on
    this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
    On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
    Hi All,

    This is question regarding "HDFS checksum" computation.
    --
    Harsh J


    --
    Twitter: @jpatanooga
    Solution Architect @ Cloudera
    hadoop: http://www.cloudera.com
    blog: http://jpatterson.floe.tv
  • Josh Patterson at Apr 12, 2011 at 2:06 pm
    If you take a look at:

    https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

    you'll see a single process version of what HDFS does under the hood,
    albeit in a highly distributed fashion. Whats going on here is that
    for every 512 bytes a CRC32 is calc'd and saved at each local datanode
    for that block. when the "checksum" is requested, these CRC32's are
    pulled together and MD5 hashed, which is sent to the client process.
    The client process then MD5 hashes all of these hashes together to
    produce a final hash.

    For some context: Our purpose on the openPDC project for this was we
    had some legacy software writing to HDFS through a FTP proxy bridge:

    https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

    Since the openPDC data was ultra critical in that we could not lose
    *any* data, and the team wanted to use a simple FTP client lib to
    write to HDFS (least amount of work for them, standard libs), we
    needed a way to make sure that no corruption occurred during the "hop"
    through the FTP bridge (acted as intermediary to DFSClient, something
    could fail, and the file might be slightly truncated, yet hard to
    detect this). In the FTP bridge we allowed a custom FTP command to
    call the now exposed "hdfs-checksum" command, and the sending agent
    could then compute the hash locally (in the case of the openPDC it was
    done in C#), and make sure the file made it there intact. This system
    has been in production for over a year now storing and maintaining
    smart grid data and has been highly reliable.

    I say all of this to say: After having dug through HDFS's checksumming
    code I am pretty confident that its Good Stuff, although I dont
    proclaim to be a filesystem expert by any means. It may be just some
    simple error or oversight in your process, possibly?
    On Tue, Apr 12, 2011 at 7:32 AM, Thamizh wrote:

    Thanks of lot Josh.

    I have been given a .gz file and been told that it has been downloaded from HDFS.

    When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result.

    I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop.

    Regards,

    Thamizhannal P

    --- On Mon, 11/4/11, Josh Patterson wrote:

    From: Josh Patterson <josh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Cc: "Thamizh" <tcegrid@yahoo.co.in>
    Date: Monday, 11 April, 2011, 7:53 PM

    Thamizh,
    For a much older project I wrote a demo tool that computed the hadoop
    style checksum locally:

    https://github.com/jpatanooga/IvoryMonkey

    Checksum generator is a single threaded replica of Hadoop's internal
    Distributed hash-checksum mechanic.

    What its actually doing is saving the CRC32 of every 512 bytes (per
    block) and then doing a MD5 hash on that. Then when the
    "getFileChecksum()" method is called, each block for a file sends its
    md5 hash to a collector which are gathered together and a md5 hash is
    calc'd for all of the block hashes.

    My version includes code that can calculate the hash on the client
    side (but breaks things up in the same way that hdfs does and will
    calc it the same way).

    During development, we also discovered and filed:

    https://issues.apache.org/jira/browse/HDFS-772

    To invoke this method, use my shell wrapper:

    https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

    Hope this provides some reference information for you.
    On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote:
    Hi Harsh ,
    Thanks a lot for your reference.
    I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.

    Regards,

    Thamizhannal P

    --- On Sat, 9/4/11, Harsh J wrote:

    From: Harsh J <harsh@cloudera.com>
    Subject: Re: Reg HDFS checksum
    To: common-user@hadoop.apache.org
    Date: Saturday, 9 April, 2011, 3:20 PM

    Hello Thamizh,

    Perhaps the discussion in the following link can shed some light on
    this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
    On Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
    Hi All,

    This is question regarding "HDFS checksum" computation.
    --
    Harsh J


    --
    Twitter: @jpatanooga
    Solution Architect @ Cloudera
    hadoop: http://www.cloudera.com
    blog: http://jpatterson.floe.tv


    --
    Twitter: @jpatanooga
    Solution Architect @ Cloudera
    hadoop: http://www.cloudera.com
    blog: http://jpatterson.floe.tv
  • Steve Loughran at Apr 12, 2011 at 3:53 pm

    On 12/04/2011 07:06, Josh Patterson wrote:
    If you take a look at:

    https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

    you'll see a single process version of what HDFS does under the hood,
    albeit in a highly distributed fashion. Whats going on here is that
    for every 512 bytes a CRC32 is calc'd and saved at each local datanode
    for that block. when the "checksum" is requested, these CRC32's are
    pulled together and MD5 hashed, which is sent to the client process.
    The client process then MD5 hashes all of these hashes together to
    produce a final hash.

    For some context: Our purpose on the openPDC project for this was we
    had some legacy software writing to HDFS through a FTP proxy bridge:

    https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

    Since the openPDC data was ultra critical in that we could not lose
    *any* data, and the team wanted to use a simple FTP client lib to
    write to HDFS (least amount of work for them, standard libs), we
    needed a way to make sure that no corruption occurred during the "hop"
    through the FTP bridge (acted as intermediary to DFSClient, something
    could fail, and the file might be slightly truncated, yet hard to
    detect this). In the FTP bridge we allowed a custom FTP command to
    call the now exposed "hdfs-checksum" command, and the sending agent
    could then compute the hash locally (in the case of the openPDC it was
    done in C#), and make sure the file made it there intact. This system
    has been in production for over a year now storing and maintaining
    smart grid data and has been highly reliable.

    I say all of this to say: After having dug through HDFS's checksumming
    code I am pretty confident that its Good Stuff, although I dont
    proclaim to be a filesystem expert by any means. It may be just some
    simple error or oversight in your process, possibly?
    Assuming it came down over HTTP, it's perfectly conceivable that
    something went wrong on the way, especially if a proxy server get
    involved. All HTTP checks is that the (optional) content length is
    consistent with what arrived -it relies on TCP checksums, which verify
    the network links work, but not the other bits of the system in the way
    (like any proxy server)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 8, '11 at 12:18p
activeApr 12, '11 at 3:53p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase