Grokbase Groups HBase dev July 2011
FAQ
Is there an open issue for this? How hard will this be? :)

Search Discussions

  • Ryan Rawson at Jul 9, 2011 at 12:35 am
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 1:19 am
    I don't think the object pointer overhead is very much given it's
    usually pointing at a full block? Perhaps we can implement a nicer
    class like Lucene's BytesRef [1]. Then we can have our own class that
    may wrap a byte[] or ByteBuffer.

    1. http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/util/BytesRef.html
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 1:20 am
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ryan Rawson at Jul 9, 2011 at 1:26 am
    The overhead in a byte buffer is the extra integers to keep track of the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on top of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 1:47 am
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on top of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ryan Rawson at Jul 9, 2011 at 2:06 am
    Hey,

    When running on top of Mapr, hbase has fast cached access to locally stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.

    I wouldn't support mmap, it would require 2 different read path
    implementations. You will never know when a read is not local.

    Hdfs needs to provide faster local reads imo. Managing the block cache in
    not heap might work but you also might get there and find the dbb accounting
    overhead kills.
    On Jul 8, 2011 6:47 PM, "Jason Rutherglen" wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 2:19 am

    When running on top of Mapr, hbase has fast cached access to locally stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.
    Agreed! However I don't see how that's possible today. Eg, it'd
    require more of a byte buffer type of API to HDFS, random reads not
    using streams. It's easy to add.

    I think the biggest win for HBase with MapR is the lack of the
    NameNode issues and snapshotting. In particular, snapshots are pretty
    much a standard RDBMS feature.
    Managing the block cache in not heap might work but you also might get there and find the dbb accounting
    overhead kills.
    Lucene uses/abuses ref counting so I'm familiar with the downsides.
    When it works, it's great, when it doesn't it's a nightmare to debug.
    It is possible to make it work though. I don't think there would be
    overhead from it, ie, any pool of objects implements ref counting.

    It'd be nice to not have a block cache however it's necessary for
    caching compressed [on disk] blocks.
    On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson wrote:
    Hey,

    When running on top of Mapr, hbase has fast cached access to locally stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.

    I wouldn't support mmap, it would require 2 different read path
    implementations. You will never know when a read is not local.

    Hdfs needs to provide faster local reads imo. Managing the block cache in
    not heap might work but you also might get there and find the dbb accounting
    overhead kills.
    On Jul 8, 2011 6:47 PM, "Jason Rutherglen" wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • M. C. Srivas at Jul 9, 2011 at 7:25 pm

    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen wrote:

    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file (the "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an alternate
    replica ... but that option is lost when mmap'ing and the RS will crash
    immediately
    (c) security is completely lost, but that is minor given hbase's current
    status

    For those hbase deployments that don't care about the absence of the (a) and
    (b), especially (b), its definitely a viable option that gives good perf.

    At MapR, we did consider similar direct-access capability and rejected it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on top of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ryan Rawson at Jul 9, 2011 at 10:13 pm
    I think my general point is we could hack up the hbase source, add
    refcounting, circumvent the gc, etc or we could demand more from the dfs.

    If a variant of hdfs-347 was committed, reads could come from the Linux
    buffer cache and life would be good.

    The choice isn't fast hbase vs slow hbase, there are elements of bugs there
    as well.
    On Jul 9, 2011 12:25 PM, "M. C. Srivas" wrote:
    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file (the "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an alternate
    replica ... but that option is lost when mmap'ing and the RS will crash
    immediately
    (c) security is completely lost, but that is minor given hbase's current
    status

    For those hbase deployments that don't care about the absence of the (a) and
    (b), especially (b), its definitely a viable option that gives good perf.

    At MapR, we did consider similar direct-access capability and rejected it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Doug Meil at Jul 10, 2011 at 1:05 am
    re: "If a variant of hdfs-347 was committed,"

    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements. Anything else is coding
    around the bigger I/O issue.


    On 7/9/11 6:13 PM, "Ryan Rawson" wrote:

    I think my general point is we could hack up the hbase source, add
    refcounting, circumvent the gc, etc or we could demand more from the dfs.

    If a variant of hdfs-347 was committed, reads could come from the Linux
    buffer cache and life would be good.

    The choice isn't fast hbase vs slow hbase, there are elements of bugs
    there
    as well.
    On Jul 9, 2011 12:25 PM, "M. C. Srivas" wrote:
    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file (the
    "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an alternate
    replica ... but that option is lost when mmap'ing and the RS will crash
    immediately
    (c) security is completely lost, but that is minor given hbase's current
    status

    For those hbase deployments that don't care about the absence of the (a) and
    (b), especially (b), its definitely a viable option that gives good
    perf.

    At MapR, we did consider similar direct-access capability and rejected
    it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen"
    <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap
    using
    direct byte buffers or some other more native-like facility (if
    DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ryan Rawson at Jul 10, 2011 at 2:12 am
    No lines of hbase were changed to run on Mapr. Mapr implements the hdfs API
    and uses jni to get local data. If hdfs wanted to it could use more
    sophisticated methods to get data rapidly from local disk to a client's
    memory space...as Mapr does.
    On Jul 9, 2011 6:05 PM, "Doug Meil" wrote:

    re: "If a variant of hdfs-347 was committed,"

    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements. Anything else is coding
    around the bigger I/O issue.


    On 7/9/11 6:13 PM, "Ryan Rawson" wrote:

    I think my general point is we could hack up the hbase source, add
    refcounting, circumvent the gc, etc or we could demand more from the dfs.

    If a variant of hdfs-347 was committed, reads could come from the Linux
    buffer cache and life would be good.

    The choice isn't fast hbase vs slow hbase, there are elements of bugs
    there
    as well.
    On Jul 9, 2011 12:25 PM, "M. C. Srivas" wrote:
    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file (the
    "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an alternate
    replica ... but that option is lost when mmap'ing and the RS will crash
    immediately
    (c) security is completely lost, but that is minor given hbase's current
    status

    For those hbase deployments that don't care about the absence of the (a) and
    (b), especially (b), its definitely a viable option that gives good
    perf.

    At MapR, we did consider similar direct-access capability and rejected
    it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen"
    <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap
    using
    direct byte buffers or some other more native-like facility (if
    DBB's
    don't work).

    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Andrew Purtell at Jul 10, 2011 at 4:26 pm

    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements.  Anything else is coding
    around the bigger I/O issue.

    The Facebook code drop, not the 0.20-append branch with its clean history but rather the hairball without (shame), has a HDFS patched with the same approach as Ryan's HDFS-347 but in addition it also checksums the blocks and caches NameNode metadata. I might swap out Ryan's HDFS-347 patch locally with an extraction of these changes.

    I've also been considering back porting the (stale) HADOOP-4801/HADOOP-6311 approach. Jason, it looks like you've recently updated those issues?

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ----- Original Message -----
    From: Doug Meil <doug.meil@explorysmedical.com>
    To: "dev@hbase.apache.org" <dev@hbase.apache.org>
    Cc:
    Sent: Saturday, July 9, 2011 6:04 PM
    Subject: Re: Converting byte[] to ByteBuffer


    re:  "If a variant of hdfs-347 was committed,"

    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements.  Anything else is coding
    around the bigger I/O issue.


    On 7/9/11 6:13 PM, "Ryan Rawson" wrote:

    I think my general point is we could hack up the hbase source, add
    refcounting, circumvent the gc, etc or we could demand more from the dfs.

    If a variant of hdfs-347 was committed, reads could come from the Linux
    buffer cache and life would be good.

    The choice isn't fast hbase vs slow hbase, there are elements of bugs
    there
    as well.
    On Jul 9, 2011 12:25 PM, "M. C. Srivas" wrote:
    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    There are couple of things here, one is direct byte buffers to put
    the
    blocks outside of heap, the other is MMap'ing the blocks
    directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's
    solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file
    (the
    "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an
    alternate
    replica ... but that option is lost when mmap'ing and the RS will
    crash
    immediately
    (c) security is completely lost, but that is minor given hbase's
    current
    status

    For those hbase deployments that don't care about the absence of
    the (a)
    and
    (b), especially (b), its definitely a viable option that gives good
    perf.

    At MapR, we did consider similar direct-access capability and rejected
    it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson
    wrote:
    The overhead in a byte buffer is the extra integers to keep
    track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the
    way to
    go.
    Getting faster local dfs reads is important, and if you run
    hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen"
    <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of
    main heap
    using
    direct byte buffers or some other more native-like
    facility (if
    DBB's
    don't work).

    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson
    <ryanobjc@gmail.com>
    wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56
    bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen"
    <
    jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will
    this be? :)
  • Jason Rutherglen at Jul 10, 2011 at 9:53 pm
    Andrew,

    I fully agree. I opened HDFS-2004 to this end however it was (oddly)
    shot down. I think HBase usage of HDFS is divergent from the
    traditional MapReduce usage. MapR addresses these issues, as do some
    of the Facebook related work.

    I think HBase should work at a lower level than the traditional HDFS
    APIs, thus the only patches required for HDFS are ones that make it
    more malleable for the requirements of HBase.
    Ryan's HDFS-347 but in addition it also checksums the blocks and caches NameNode metadata
    Sounds good, I'm interested in checking that out.
    On Sun, Jul 10, 2011 at 9:25 AM, Andrew Purtell wrote:
    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements.  Anything else is coding
    around the bigger I/O issue.

    The Facebook code drop, not the 0.20-append branch with its clean history but rather the hairball without (shame), has a HDFS patched with the same approach as Ryan's HDFS-347 but in addition it also checksums the blocks and caches NameNode metadata. I might swap out Ryan's HDFS-347 patch locally with an extraction of these changes.

    I've also been considering back porting the (stale) HADOOP-4801/HADOOP-6311 approach. Jason, it looks like you've recently updated those issues?

    Best regards,


    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ----- Original Message -----
    From: Doug Meil <doug.meil@explorysmedical.com>
    To: "dev@hbase.apache.org" <dev@hbase.apache.org>
    Cc:
    Sent: Saturday, July 9, 2011 6:04 PM
    Subject: Re: Converting byte[] to ByteBuffer


    re:  "If a variant of hdfs-347 was committed,"

    I agree with what Ryan is saying here, and I'd like to second (third?
    fourth?) keep pushing for HDFS improvements.  Anything else is coding
    around the bigger I/O issue.


    On 7/9/11 6:13 PM, "Ryan Rawson" wrote:

    I think my general point is we could hack up the hbase source, add
    refcounting, circumvent the gc, etc or we could demand more from the dfs.

    If a variant of hdfs-347 was committed, reads could come from the Linux
    buffer cache and life would be good.

    The choice isn't fast hbase vs slow hbase, there are elements of bugs
    there
    as well.
    On Jul 9, 2011 12:25 PM, "M. C. Srivas" <mcsrivas@gmail.com>
    wrote:
    On Fri, Jul 8, 2011 at 6:47 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    There are couple of things here, one is direct byte buffers to put
    the
    blocks outside of heap, the other is MMap'ing the blocks
    directly from
    the underlying HDFS file.
    I think they both make sense. And I'm not sure MapR's
    solution will
    be that much better if the latter is implemented in HBase.
    There're some major issues with mmap'ing the local hdfs file
    (the
    "block")
    directly:
    (a) no checksums to detect data corruption from bad disks
    (b) when a disk does fail, the dfs could start reading from an
    alternate
    replica ... but that option is lost when mmap'ing and the RS will
    crash
    immediately
    (c) security is completely lost, but that is minor given hbase's
    current
    status

    For those hbase deployments that don't care about the absence of
    the (a)
    and
    (b), especially (b), its definitely a viable option that gives good
    perf.

    At MapR, we did consider similar direct-access capability and rejected
    it
    due to the above concerns.


    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson
    wrote:
    The overhead in a byte buffer is the extra integers to keep
    track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the
    way to
    go.
    Getting faster local dfs reads is important, and if you run
    hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen"
    <jason.rutherglen@gmail.com>
    wrote:
    Also, it's for a good cause, moving the blocks out of
    main heap
    using
    direct byte buffers or some other more native-like
    facility (if
    DBB's
    don't work).

    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson
    <ryanobjc@gmail.com>
    wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56
    bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen"
    <
    jason.rutherglen@gmail.com
    wrote:
    Is there an open issue for this? How hard will
    this be? :)
  • Li Pi at Jul 9, 2011 at 1:31 am
    if you do that, you'll have to do a bit of reference counting. i'm working
    on a slab allocated solution.
    On Fri, Jul 8, 2011 at 6:20 PM, Jason Rutherglen wrote:

    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 1:48 am
    Reference counting is doable. Can you describe what the advantages
    are of the slab allocated solution?
    On Fri, Jul 8, 2011 at 6:30 PM, Li Pi wrote:
    if you do that, you'll have to do a bit of reference counting. i'm working
    on a slab allocated solution.

    On Fri, Jul 8, 2011 at 6:20 PM, Jason Rutherglen <jason.rutherglen@gmail.com
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ryan Rawson at Jul 9, 2011 at 2:32 am

    On Jul 8, 2011 7:19 PM, "Jason Rutherglen" wrote:
    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.
    Agreed! However I don't see how that's possible today. Eg, it'd
    require more of a byte buffer type of API to HDFS, random reads not
    using streams. It's easy to add.
    I don't think its as easy as you say. And even using the stream API Mapr
    delivers a lot more performance. And this is from my own tests not a white
    paper.
    I think the biggest win for HBase with MapR is the lack of the
    NameNode issues and snapshotting. In particular, snapshots are pretty
    much a standard RDBMS feature.
    That is good too - if you are using hbase in real time prod you need to look
    at Mapr.

    But even beyond that the performance improvements are insane. We are talking
    like 8-9x perf on my tests. Not to mention substantially reduced latency.

    I'll repeat again, local accelerated access is going to be a required
    feature. It already is.

    I investigated using dbb once upon a time, I concluded that managing the ref
    counts would be a nightmare, and the better solution was to copy keyvalues
    out of the dbb during scans.

    Injecting refcount code seems like a worse remedy than the problem. Hbase
    doesn't have as many bugs but explicit ref counting everywhere seems
    dangerous. Especially when a perf solution is already here. Use Mapr or
    hdfs-347/local reads.
    Managing the block cache in not heap might work but you also might get
    there and find the dbb accounting
    overhead kills.
    Lucene uses/abuses ref counting so I'm familiar with the downsides.
    When it works, it's great, when it doesn't it's a nightmare to debug.
    It is possible to make it work though. I don't think there would be
    overhead from it, ie, any pool of objects implements ref counting.

    It'd be nice to not have a block cache however it's necessary for
    caching compressed [on disk] blocks.
    On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson wrote:
    Hey,

    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.

    I wouldn't support mmap, it would require 2 different read path
    implementations. You will never know when a read is not local.

    Hdfs needs to provide faster local reads imo. Managing the block cache
    in
    not heap might work but you also might get there and find the dbb
    accounting
    overhead kills.
    On Jul 8, 2011 6:47 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to
    go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Jason Rutherglen at Jul 9, 2011 at 2:52 am

    Especially when a perf solution is already here. Use Mapr or
    hdfs-347/local reads.
    Right. It goes back to avoiding GC and performing memory deallocation
    manually (like C). I think this makes sense given the number of
    issues people have with HBase and GC (more so than Lucene for
    example). MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
    But even beyond that the performance improvements are insane. We are talking
    like 8-9x perf on my tests. Not to mention substantially reduced latency.
    Was the comparison against HDFS-347?
    On Fri, Jul 8, 2011 at 7:31 PM, Ryan Rawson wrote:
    On Jul 8, 2011 7:19 PM, "Jason Rutherglen" wrote:

    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.
    Agreed!  However I don't see how that's possible today.  Eg, it'd
    require more of a byte buffer type of API to HDFS, random reads not
    using streams.  It's easy to add.
    I don't think its as easy as you say. And even using the stream API Mapr
    delivers a lot more performance. And this is from my own tests not a white
    paper.
    I think the biggest win for HBase with MapR is the lack of the
    NameNode issues and snapshotting.  In particular, snapshots are pretty
    much a standard RDBMS feature.
    That is good too - if you are using hbase in real time prod you need to look
    at Mapr.

    But even beyond that the performance improvements are insane. We are talking
    like 8-9x perf on my tests. Not to mention substantially reduced latency.

    I'll repeat again, local accelerated access is going to be a required
    feature. It already is.

    I investigated using dbb once upon a time, I concluded that managing the ref
    counts would be a nightmare, and the better solution was to copy keyvalues
    out of the dbb during scans.

    Injecting refcount code seems like a worse remedy than the problem. Hbase
    doesn't have as many bugs but explicit ref counting everywhere seems
    dangerous. Especially when a perf solution is already here. Use Mapr or
    hdfs-347/local reads.
    Managing the block cache in not heap might work but you also might get
    there and find the dbb accounting
    overhead kills.
    Lucene uses/abuses ref counting so I'm familiar with the downsides.
    When it works, it's great, when it doesn't it's a nightmare to debug.
    It is possible to make it work though.  I don't think there would be
    overhead from it, ie, any pool of objects implements ref counting.

    It'd be nice to not have a block cache however it's necessary for
    caching compressed [on disk] blocks.
    On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson wrote:
    Hey,

    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the kernel
    block cache.

    I wouldn't support mmap, it would require 2 different read path
    implementations. You will never know when a read is not local.

    Hdfs needs to provide faster local reads imo. Managing the block cache
    in
    not heap might work but you also might get there and find the dbb
    accounting
    overhead kills.
    On Jul 8, 2011 6:47 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    There are couple of things here, one is direct byte buffers to put the
    blocks outside of heap, the other is MMap'ing the blocks directly from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to
    go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap using
    direct byte buffers or some other more native-like facility (if DBB's
    don't work).
    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Li Pi at Jul 9, 2011 at 2:54 am
    I have a slab allocated cache coded up, testing in YCSB right now :).
    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen wrote:

    Especially when a perf solution is already here. Use Mapr or
    hdfs-347/local reads.
    Right. It goes back to avoiding GC and performing memory deallocation
    manually (like C). I think this makes sense given the number of
    issues people have with HBase and GC (more so than Lucene for
    example). MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
    But even beyond that the performance improvements are insane. We are talking
    like 8-9x perf on my tests. Not to mention substantially reduced latency.
    Was the comparison against HDFS-347?
    On Fri, Jul 8, 2011 at 7:31 PM, Ryan Rawson wrote:
    On Jul 8, 2011 7:19 PM, "Jason Rutherglen" <jason.rutherglen@gmail.com>
    wrote:
    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the
    kernel
    block cache.
    Agreed! However I don't see how that's possible today. Eg, it'd
    require more of a byte buffer type of API to HDFS, random reads not
    using streams. It's easy to add.
    I don't think its as easy as you say. And even using the stream API Mapr
    delivers a lot more performance. And this is from my own tests not a white
    paper.
    I think the biggest win for HBase with MapR is the lack of the
    NameNode issues and snapshotting. In particular, snapshots are pretty
    much a standard RDBMS feature.
    That is good too - if you are using hbase in real time prod you need to look
    at Mapr.

    But even beyond that the performance improvements are insane. We are talking
    like 8-9x perf on my tests. Not to mention substantially reduced latency.

    I'll repeat again, local accelerated access is going to be a required
    feature. It already is.

    I investigated using dbb once upon a time, I concluded that managing the ref
    counts would be a nightmare, and the better solution was to copy keyvalues
    out of the dbb during scans.

    Injecting refcount code seems like a worse remedy than the problem. Hbase
    doesn't have as many bugs but explicit ref counting everywhere seems
    dangerous. Especially when a perf solution is already here. Use Mapr or
    hdfs-347/local reads.
    Managing the block cache in not heap might work but you also might get
    there and find the dbb accounting
    overhead kills.
    Lucene uses/abuses ref counting so I'm familiar with the downsides.
    When it works, it's great, when it doesn't it's a nightmare to debug.
    It is possible to make it work though. I don't think there would be
    overhead from it, ie, any pool of objects implements ref counting.

    It'd be nice to not have a block cache however it's necessary for
    caching compressed [on disk] blocks.
    On Fri, Jul 8, 2011 at 7:05 PM, Ryan Rawson wrote:
    Hey,

    When running on top of Mapr, hbase has fast cached access to locally
    stored
    files, the Mapr client ensures that. Likewise, hdfs should also ensure
    that
    local reads are fast and come out of cache as necessary. Eg: the
    kernel
    block cache.

    I wouldn't support mmap, it would require 2 different read path
    implementations. You will never know when a read is not local.

    Hdfs needs to provide faster local reads imo. Managing the block cache
    in
    not heap might work but you also might get there and find the dbb
    accounting
    overhead kills.
    On Jul 8, 2011 6:47 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com>
    wrote:
    There are couple of things here, one is direct byte buffers to put
    the
    blocks outside of heap, the other is MMap'ing the blocks directly
    from
    the underlying HDFS file.

    I think they both make sense. And I'm not sure MapR's solution will
    be that much better if the latter is implemented in HBase.
    On Fri, Jul 8, 2011 at 6:26 PM, Ryan Rawson wrote:
    The overhead in a byte buffer is the extra integers to keep track of
    the
    mark, position, limit.

    I am not sure that putting the block cache in to heap is the way to
    go.
    Getting faster local dfs reads is important, and if you run hbase on
    top
    of
    Mapr, these things are taken care of for you.
    On Jul 8, 2011 6:20 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com
    wrote:
    Also, it's for a good cause, moving the blocks out of main heap
    using
    direct byte buffers or some other more native-like facility (if
    DBB's
    don't work).

    On Fri, Jul 8, 2011 at 5:34 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    Where? Everywhere? An array is 24 bytes, bb is 56 bytes. Also the
    API
    is...annoying.
    On Jul 8, 2011 4:51 PM, "Jason Rutherglen" <
    jason.rutherglen@gmail.com>
    wrote:
    Is there an open issue for this? How hard will this be? :)
  • Ted Dunning at Jul 9, 2011 at 6:19 pm
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.
    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen wrote:

    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
  • Jason Rutherglen at Jul 9, 2011 at 10:49 pm
    I'm a little confused, I was told none of the HBase code changed with MapR,
    if the HBase (not the OS) block cache has a JNI implementation then that
    part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
  • Ted Dunning at Jul 10, 2011 at 6:15 am
    No. The JNI is below the HDFS compatible API. Thus the changed code is in
    the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make data
    available faster.

    The cache involved includes the cache of disk blocks (not HBase memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in HBase
    makes the overall system faster are not completely worked out, but the
    general outlines are pretty clear. There are likely several factors at work
    in any case including less GC cost due to smaller memory foot print, caching
    compressed blocks instead of Java structures and simplification due to a
    clean memory hand-off with associated strong demarcation of where different
    memory allocators have jurisdiction.
    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen wrote:

    I'm a little confused, I was told none of the HBase code changed with MapR,
    if the HBase (not the OS) block cache has a JNI implementation then that
    part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
  • Jonathan Gray at Jul 10, 2011 at 8:00 am
    There are plenty of arguments in both directions for caching above the DB, in the DB, or under the DB/in the FS. I have significant interest in supporting large heaps and reducing GC issues within the HBase RegionServer and I am already running with local fs reads. I don't think a faster dfs makes HBase caching irrelevant or the conversation a non-starter.

    To get back to the original question, I ended up trying this once. I wrote a rough implementation of a slab allocator a few months ago to dive in and see what it would take. The big challenge is KeyValue and its various comparators. The ByteBuffer API can be maddening at times but it can be done. I ended up somewhere slightly more generic, where KeyValue was taking a ByteBlock which contained ref counting and a reference to the allocator it came from, in addition to a ByteBuffer.

    The easy way to rely on DirectByteBuffers and the like would be to make a copy on read into a normal byte[], and then no need to worry about ref counting and revamping KV. Of course, at the cost of short-term allocations. In my experience, you can tune the GC around this and the cost really becomes CPU.

    I'm in the process of re-implementing some of this stuff on top of the HFile v2 that is coming soon. Once that goes in, this gets much easier at the HFile and block cache level (a new wrapper around ByteBuffer called HFileBlock which can be used for refc and such, instead of introducing huge changes for caching stuff)

    JG

    -----Original Message-----
    From: Ted Dunning
    Sent: Saturday, July 09, 2011 11:14 PM
    To: dev@hbase.apache.org
    Subject: Re: Converting byte[] to ByteBuffer

    No. The JNI is below the HDFS compatible API. Thus the changed code is in
    the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make data
    available faster.

    The cache involved includes the cache of disk blocks (not HBase memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in HBase
    makes the overall system faster are not completely worked out, but the
    general outlines are pretty clear. There are likely several factors at work in
    any case including less GC cost due to smaller memory foot print, caching
    compressed blocks instead of Java structures and simplification due to a
    clean memory hand-off with associated strong demarcation of where
    different memory allocators have jurisdiction.

    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen
    <jason.rutherglen@gmail.com
    wrote:
    I'm a little confused, I was told none of the HBase code changed with
    MapR, if the HBase (not the OS) block cache has a JNI implementation
    then that part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface
    into an external block cache.

    Typical configurations with MapR trim HBase down to the minimal
    viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI interface
    into an external block cache then that'd be a different story. :)
    And I'm sure it's quite doable.
  • Jason Rutherglen at Jul 10, 2011 at 10:05 pm
    Ted,

    Interesting. I think we need to take a deeper look at why essentially
    turning off the caching of uncompressed blocks doesn't [seem to]
    matter. My guess is it's cheaper to decompress on the fly than hog
    from the system IO cache with JVM heap usage.

    Ie, CPU is cheaper than disk IO.

    Further, (I asked this previously), where is the general CPU usage in
    HBase? Binary search on keys for seeking, skip list reads and writes,
    and [maybe] MapReduce jobs? The rest should more or less be in the
    noise (or is general Java overhead).

    I'd be curious to know the avg CPU consumption of an active HBase system.
    On Sat, Jul 9, 2011 at 11:14 PM, Ted Dunning wrote:
    No.  The JNI is below the HDFS compatible API.  Thus the changed code is in
    the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make data
    available faster.

    The cache involved includes the cache of disk blocks (not HBase memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in HBase
    makes the overall system faster are not completely worked out, but the
    general outlines are pretty clear.  There are likely several factors at work
    in any case including less GC cost due to smaller memory foot print, caching
    compressed blocks instead of Java structures and simplification due to a
    clean memory hand-off with associated strong demarcation of where different
    memory allocators have jurisdiction.

    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen <jason.rutherglen@gmail.com
    wrote:
    I'm a little confused, I was told none of the HBase code changed with MapR,
    if the HBase (not the OS) block cache has a JNI implementation then that
    part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
  • Jonathan Gray at Jul 11, 2011 at 7:19 pm
    In my experience, CPU usage on HBase is very high for highly concurrent applications. You can expect the CMS GC to chew up 2-3 cores at sufficient throughput and the remaining cores to be spent in CSLM/MemStore, KeyValue comparators, queues, etc.
    -----Original Message-----
    From: Jason Rutherglen
    Sent: Sunday, July 10, 2011 3:05 PM
    To: dev@hbase.apache.org
    Subject: Re: Converting byte[] to ByteBuffer

    Ted,

    Interesting. I think we need to take a deeper look at why essentially turning
    off the caching of uncompressed blocks doesn't [seem to] matter. My guess
    is it's cheaper to decompress on the fly than hog from the system IO cache
    with JVM heap usage.

    Ie, CPU is cheaper than disk IO.

    Further, (I asked this previously), where is the general CPU usage in HBase?
    Binary search on keys for seeking, skip list reads and writes, and [maybe]
    MapReduce jobs? The rest should more or less be in the noise (or is general
    Java overhead).

    I'd be curious to know the avg CPU consumption of an active HBase system.
    On Sat, Jul 9, 2011 at 11:14 PM, Ted Dunning wrote:
    No.  The JNI is below the HDFS compatible API.  Thus the changed code
    is in the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make
    data available faster.

    The cache involved includes the cache of disk blocks (not HBase
    memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in
    HBase makes the overall system faster are not completely worked out,
    but the general outlines are pretty clear.  There are likely several
    factors at work in any case including less GC cost due to smaller
    memory foot print, caching compressed blocks instead of Java
    structures and simplification due to a clean memory hand-off with
    associated strong demarcation of where different memory allocators have
    jurisdiction.
    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen
    <jason.rutherglen@gmail.com
    wrote:
    I'm a little confused, I was told none of the HBase code changed with
    MapR, if the HBase (not the OS) block cache has a JNI implementation
    then that part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface
    into an external block cache.

    Typical configurations with MapR trim HBase down to the minimal
    viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI interface
    into an external block cache then that'd be a different story. :)
    And I'm sure it's quite doable.
  • Andrew Purtell at Jul 11, 2011 at 8:31 pm

    Further, (I asked this previously), where is the general CPU usage in
    HBase?  Binary search on keys for seeking, skip list reads and writes,
    and [maybe] MapReduce jobs?
    If you are running colocated MapReduce jobs, then it could be the user code of course.

    Otherwise it depends on workload.

    For our apps I observe the following top line items when profiling:

    - KV comparators: By far the most common operation, searching keys, writing HFiles, etc.

    - MemStore CSLM ops: Especially if upserting

    - Servicing RPCs: Writable marshall/unmarshall, monitors

    - Concurrent GC

    It generally looks good but MemStore can be improved, especially for the upsert case.

    Reminds me I need to profile the latest. It's been a few weeks.

    Best regards,

    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Jason Rutherglen <jason.rutherglen@gmail.com>
    To: dev@hbase.apache.org
    Sent: Sunday, July 10, 2011 3:05 PM
    Subject: Re: Converting byte[] to ByteBuffer

    Ted,

    Interesting.  I think we need to take a deeper look at why essentially
    turning off the caching of uncompressed blocks doesn't [seem to]
    matter.  My guess is it's cheaper to decompress on the fly than hog
    from the system IO cache with JVM heap usage.

    Ie, CPU is cheaper than disk IO.

    Further, (I asked this previously), where is the general CPU usage in
    HBase?  Binary search on keys for seeking, skip list reads and writes,
    and [maybe] MapReduce jobs?  The rest should more or less be in the
    noise (or is general Java overhead).

    I'd be curious to know the avg CPU consumption of an active HBase system.
    On Sat, Jul 9, 2011 at 11:14 PM, Ted Dunning wrote:
    No.  The JNI is below the HDFS compatible API.  Thus the changed code is in
    the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make data
    available faster.

    The cache involved includes the cache of disk blocks (not HBase memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in HBase
    makes the overall system faster are not completely worked out, but the
    general outlines are pretty clear.  There are likely several factors at work
    in any case including less GC cost due to smaller memory foot print, caching
    compressed blocks instead of Java structures and simplification due to a
    clean memory hand-off with associated strong demarcation of where different
    memory allocators have jurisdiction.

    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen <jason.rutherglen@gmail.com
    wrote:
    I'm a little confused, I was told none of the HBase code changed with MapR,
    if the HBase (not the OS) block cache has a JNI implementation then that
    part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.
  • Jason Rutherglen at Jul 12, 2011 at 6:11 am
    - MemStore CSLM ops: Especially if upserting
    I quick thought on that one, perhaps it'd be helped by limiting the
    aggregate size of the CSLM, eg, skip lists at too large a size start
    to degrade in performance. Something like multiple CSLMs could work?
    Grow a CSLM to a given size, then start a new one.
    On Mon, Jul 11, 2011 at 1:30 PM, Andrew Purtell wrote:
    Further, (I asked this previously), where is the general CPU usage in
    HBase?  Binary search on keys for seeking, skip list reads and writes,
    and [maybe] MapReduce jobs?
    If you are running colocated MapReduce jobs, then it could be the user code of course.

    Otherwise it depends on workload.

    For our apps I observe the following top line items when profiling:

    - KV comparators: By far the most common operation, searching keys, writing HFiles, etc.

    - MemStore CSLM ops: Especially if upserting

    - Servicing RPCs: Writable marshall/unmarshall, monitors

    - Concurrent GC

    It generally looks good but MemStore can be improved, especially for the upsert case.

    Reminds me I need to profile the latest. It's been a few weeks.

    Best regards,

    - Andy

    Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

    ________________________________
    From: Jason Rutherglen <jason.rutherglen@gmail.com>
    To: dev@hbase.apache.org
    Sent: Sunday, July 10, 2011 3:05 PM
    Subject: Re: Converting byte[] to ByteBuffer

    Ted,

    Interesting.  I think we need to take a deeper look at why essentially
    turning off the caching of uncompressed blocks doesn't [seem to]
    matter.  My guess is it's cheaper to decompress on the fly than hog
    from the system IO cache with JVM heap usage.

    Ie, CPU is cheaper than disk IO.

    Further, (I asked this previously), where is the general CPU usage in
    HBase?  Binary search on keys for seeking, skip list reads and writes,
    and [maybe] MapReduce jobs?  The rest should more or less be in the
    noise (or is general Java overhead).

    I'd be curious to know the avg CPU consumption of an active HBase system.
    On Sat, Jul 9, 2011 at 11:14 PM, Ted Dunning wrote:
    No.  The JNI is below the HDFS compatible API.  Thus the changed code is in
    the hadoop.jar and associated jars and .so's that MapR supplies.

    The JNI still runs in the HBase memory image, though, so it can make data
    available faster.

    The cache involved includes the cache of disk blocks (not HBase memcache
    blocks) in the JNI and in the filer sub-system.

    The detailed reasons why more caching in the file system and less in HBase
    makes the overall system faster are not completely worked out, but the
    general outlines are pretty clear.  There are likely several factors at work
    in any case including less GC cost due to smaller memory foot print, caching
    compressed blocks instead of Java structures and simplification due to a
    clean memory hand-off with associated strong demarcation of where different
    memory allocators have jurisdiction.

    On Sat, Jul 9, 2011 at 3:48 PM, Jason Rutherglen <jason.rutherglen@gmail.com
    wrote:
    I'm a little confused, I was told none of the HBase code changed with MapR,
    if the HBase (not the OS) block cache has a JNI implementation then that
    part of the HBase code changed.
    On Jul 9, 2011 11:19 AM, "Ted Dunning" wrote:
    MapR does help with the GC because it *does* have a JNI interface into an
    external block cache.

    Typical configurations with MapR trim HBase down to the minimal viable size
    and increase the file system cache correspondingly.

    On Fri, Jul 8, 2011 at 7:52 PM, Jason Rutherglen <
    jason.rutherglen@gmail.com
    wrote:
    MapR doesn't help with the GC issues. If MapR had a JNI
    interface into an external block cache then that'd be a different
    story. :) And I'm sure it's quite doable.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedJul 8, '11 at 11:51p
activeJul 12, '11 at 6:11a
posts27
users8
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase