[ https://issues.apache.org/jira/browse/HADOOP-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476235 ]

Doug Cutting commented on HADOOP-928:
the reason that I set the inner buffer very small is to by-pass the inner buffer and hence avoid one more data copy
Yes, that makes sense, thanks for clarifying. But unless I missed something, in ChecksumFileSystem#create(Path, int bufferSize), the inner and outer buffers are both bufferSize.

Also, a competing concern is that data not sit in buffers too long before it is checksummed. Since we use many long-lived multi-megabyte buffers when sorting, this is a real concern. So another strategy might be to use a small outer buffer and a large inner buffer, and assume that the cost of the extra copy is negligible (or at least warranted). That way data would be checksummed sooner, and memory corruption in the client could be more reliably detected, but it does require an extra copy. That was the strategy I assumed when I suggested using large inner buffers and small outer buffers. It's probably worth benchmarking this at some point, although I'd rather not hold up this issue any longer.

So can you please just check whether my analysis of ChecksumFileSystem#create(Path, int bufferSize) above is correct? Thanks!
make checksums optional per FileSystem

Key: HADOOP-928
URL: https://issues.apache.org/jira/browse/HADOOP-928
Project: Hadoop
Issue Type: Improvement
Components: fs
Reporter: Doug Cutting
Assigned To: Hairong Kuang
Attachments: checksum.patch, checksum1.patch, checksum2.patch

Checksumming is currently built into the base FileSystem class. It should instead be optional, with each FileSystem implementation electing whether to use the Hadoop-provided checksum system, or to disable it, or to implement its own custom checksum system.
To implement this, a ChecksumFileSystem implementation can be provided that wraps another FileSystem implementation, implementing checksums as in Hadoop's current mandatory implementation (i.e., as a separate crc file per file that's elided from directory listings). The 'raw' FileSystem methods would be removed. FSDataInputStream and FSDataOutputStream would be made interfaces.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 21 of 29 | next ›
Discussion Overview
groupcommon-dev @
postedJan 25, '07 at 4:19a
activeMay 17, '07 at 11:28a

1 user in discussion

Hadoop QA (JIRA): 29 posts



site design / logo © 2023 Grokbase