If you take a look at:
https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.javayou'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the "checksum" is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.
For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:
https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the "hop"
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed "hdfs-checksum" command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.
I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?
On Tue, Apr 12, 2011 at 7:32 AM, Thamizh wrote:Thanks of lot Josh.
I have been given a .gz file and been told that it has been downloaded from HDFS.
When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result.
I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop.
Regards,
Thamizhannal P
--- On Mon, 11/4/11, Josh Patterson wrote:
From: Josh Patterson <josh@cloudera.com>
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Cc: "Thamizh" <tcegrid@yahoo.co.in>
Date: Monday, 11 April, 2011, 7:53 PM
Thamizh,
For a much older project I wrote a demo tool that computed the hadoop
style checksum locally:
https://github.com/jpatanooga/IvoryMonkeyChecksum generator is a single threaded replica of Hadoop's internal
Distributed hash-checksum mechanic.
What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.
My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).
During development, we also discovered and filed:
https://issues.apache.org/jira/browse/HDFS-772To invoke this method, use my shell wrapper:
https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.javaHope this provides some reference information for you.
On Sat, Apr 9, 2011 at 10:38 AM, Thamizh wrote:Hi Harsh ,
Thanks a lot for your reference.
I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.
Regards,
Thamizhannal P
--- On Sat, 9/4/11, Harsh J wrote:
From: Harsh J <harsh@cloudera.com>
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Date: Saturday, 9 April, 2011, 3:20 PM
Hello Thamizh,
Perhaps the discussion in the following link can shed some light on
this:
http://getsatisfaction.com/cloudera/topics/hadoop_fs_crcOn Fri, Apr 8, 2011 at 5:47 PM, Thamizh wrote:
Hi All,
This is question regarding "HDFS checksum" computation.
--
Harsh J
--
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop:
http://www.cloudera.comblog:
http://jpatterson.floe.tv --
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop:
http://www.cloudera.comblog:
http://jpatterson.floe.tv