In Hadoop, whenever possible, we read directly to user buffer. E.g. in
ChecksumFileSystem we read into user buffer and then do a checksum, I do
the same in new Block level CRCs. This is very useful since this avoids
an extra copy in most cases.
We don't define skip() for our extensions of InputStream since we know
default implementation calls read(). But the problem is that
InputStream.skip() uses a *static* byte buffer (from its perspective, it
makes sense). So if we have two parallel skip() on unrelated streams,
we will surely get checksum errors.
When this happened with Block level CRCs, I wasted time trying to find a
bug in the new code.
My prefered fix would be to implement skip() in Hadoop() level. Always
copying to user buffer would be very defensive fix.
Raghu.