FAQ
In Hadoop, whenever possible, we read directly to user buffer. E.g. in
ChecksumFileSystem we read into user buffer and then do a checksum, I do
the same in new Block level CRCs. This is very useful since this avoids
an extra copy in most cases.

We don't define skip() for our extensions of InputStream since we know
default implementation calls read(). But the problem is that
InputStream.skip() uses a *static* byte buffer (from its perspective, it
makes sense). So if we have two parallel skip() on unrelated streams,
we will surely get checksum errors.

When this happened with Block level CRCs, I wasted time trying to find a
bug in the new code.

My prefered fix would be to implement skip() in Hadoop() level. Always
copying to user buffer would be very defensive fix.

Raghu.

Search Discussions

  • Raghu Angadi at May 25, 2007 at 10:57 pm
    Also, reading from block supports 'real skip', ie, it does not check
    checksum if an entire checksum block (usually 512 bytes) falls within
    the skip range. Another reason to implement our own skip.

    Raghu Angadi wrote:
    In Hadoop, whenever possible, we read directly to user buffer. E.g. in
    ChecksumFileSystem we read into user buffer and then do a checksum, I do
    the same in new Block level CRCs. This is very useful since this avoids
    an extra copy in most cases.

    We don't define skip() for our extensions of InputStream since we know
    default implementation calls read(). But the problem is that
    InputStream.skip() uses a *static* byte buffer (from its perspective, it
    makes sense). So if we have two parallel skip() on unrelated streams,
    we will surely get checksum errors.

    When this happened with Block level CRCs, I wasted time trying to find a
    bug in the new code.

    My prefered fix would be to implement skip() in Hadoop() level. Always
    copying to user buffer would be very defensive fix.

    Raghu.
  • Doug Cutting at May 29, 2007 at 6:13 pm

    Raghu Angadi wrote:
    Also, reading from block supports 'real skip', ie, it does not check
    checksum if an entire checksum block (usually 512 bytes) falls within
    the skip range. Another reason to implement our own skip.
    Yes, I don't see an alternative to implementing skip ourselves. The
    optimization in InputStream#skip(), of using a static buffer, requires
    this. Hopefully this method, like much of the checksum code, will be
    shared between the generic ChecksumFileSystem and DFS's optimized
    checksum implementation.

    Doug
  • Raghu Angadi at May 29, 2007 at 10:34 pm
    We should force subclasses of FSInputStream to implement skip if skip is
    expected to be used. Only way I could think of achieving is to define

    long skip(long len) throws IOException {
    throw new IOException("Subclasses of FSInputStream should implement
    skip");
    }

    This makes sense for all the current FSInputStreams. But not sure if
    there is any way for subclasses to call InputStream.skip() is that makes
    sense.

    Raghu.

    Doug Cutting wrote:
    Raghu Angadi wrote:
    Also, reading from block supports 'real skip', ie, it does not check
    checksum if an entire checksum block (usually 512 bytes) falls within
    the skip range. Another reason to implement our own skip.
    Yes, I don't see an alternative to implementing skip ourselves. The
    optimization in InputStream#skip(), of using a static buffer, requires
    this. Hopefully this method, like much of the checksum code, will be
    shared between the generic ChecksumFileSystem and DFS's optimized
    checksum implementation.

    Doug
  • Raghu Angadi at May 31, 2007 at 7:00 pm
    Better one:

    // There are differences in how skip and seek behave
    // when trying to skip more bytes than available in a file
    // but ...
    long skip(long len) throws IOException {
    if ( len > 0 ) {
    seek(getPos() + len);
    return len;
    }
    return ( len < 0 ) ? -1 : 0;
    }

    This should fix HADOOP-1428, if not very efficiently.

    Raghu.

    Raghu Angadi wrote:
    We should force subclasses of FSInputStream to implement skip if skip is
    expected to be used. Only way I could think of achieving is to define

    long skip(long len) throws IOException {
    throw new IOException("Subclasses of FSInputStream should implement
    skip");
    }

    This makes sense for all the current FSInputStreams. But not sure if
    there is any way for subclasses to call InputStream.skip() is that makes
    sense.

    Raghu.

    Doug Cutting wrote:
    Raghu Angadi wrote:
    Also, reading from block supports 'real skip', ie, it does not check
    checksum if an entire checksum block (usually 512 bytes) falls within
    the skip range. Another reason to implement our own skip.
    Yes, I don't see an alternative to implementing skip ourselves. The
    optimization in InputStream#skip(), of using a static buffer, requires
    this. Hopefully this method, like much of the checksum code, will be
    shared between the generic ChecksumFileSystem and DFS's optimized
    checksum implementation.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMay 25, '07 at 10:49p
activeMay 31, '07 at 7:00p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Raghu Angadi: 4 posts Doug Cutting: 1 post

People

Translate

site design / logo © 2022 Grokbase