FAQ
If I use md5 hash + timestamp rowkey would hbase automatically detect the
difference in ranges and peforms split? How does split work in such cases
or is it still advisable to manually split the regions.

Search Discussions

  • Michael Stack at Aug 30, 2012 at 4:19 am

    On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia wrote:
    If I use md5 hash + timestamp rowkey would hbase automatically detect the
    difference in ranges and peforms split? How does split work in such cases
    or is it still advisable to manually split the regions.
    Yes.

    On how split works, when a region hits the maximum configured size, it
    splits in two.

    Manual splitting can be useful when you know your distribution and
    you'd save on hbase doing it for you. It can speed up bulk loads for
    instance.

    St.Ack
  • Mohit Anchlia at Aug 30, 2012 at 4:39 am

    On Wed, Aug 29, 2012 at 9:19 PM, Stack wrote:
    On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia wrote:
    If I use md5 hash + timestamp rowkey would hbase automatically detect the
    difference in ranges and peforms split? How does split work in such cases
    or is it still advisable to manually split the regions.
    What logic would you recommend to split the table into multiple regions
    when using md5 hash?

    Yes.

    On how split works, when a region hits the maximum configured size, it
    splits in two.

    Manual splitting can be useful when you know your distribution and
    you'd save on hbase doing it for you. It can speed up bulk loads for
    instance.

    St.Ack
  • Michael Stack at Aug 30, 2012 at 5:50 am

    On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia wrote:
    On Wed, Aug 29, 2012 at 9:19 PM, Stack wrote:

    On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mohitanchlia@gmail.com>
    wrote:
    If I use md5 hash + timestamp rowkey would hbase automatically detect the
    difference in ranges and peforms split? How does split work in such cases
    or is it still advisable to manually split the regions.
    What logic would you recommend to split the table into multiple regions
    when using md5 hash?
    Its hard to know how well your inserts will spread over the md5
    namespace ahead of time. You could try sampling or just let HBase
    take care of the splits for you (Is there a problem w/ your letting
    HBase do the splits?)

    St.Ack
  • Mohit Anchlia at Aug 30, 2012 at 2:36 pm

    On Wed, Aug 29, 2012 at 10:50 PM, Stack wrote:
    On Wed, Aug 29, 2012 at 9:38 PM, Mohit Anchlia wrote:
    On Wed, Aug 29, 2012 at 9:19 PM, Stack wrote:

    On Wed, Aug 29, 2012 at 3:56 PM, Mohit Anchlia <mohitanchlia@gmail.com
    wrote:
    If I use md5 hash + timestamp rowkey would hbase automatically detect
    the
    difference in ranges and peforms split? How does split work in such
    cases
    or is it still advisable to manually split the regions.
    What logic would you recommend to split the table into multiple regions
    when using md5 hash?
    Its hard to know how well your inserts will spread over the md5
    namespace ahead of time. You could try sampling or just let HBase
    take care of the splits for you (Is there a problem w/ your letting
    HBase do the splits?)

    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.

    St.Ack
  • Michael Stack at Aug 30, 2012 at 10:46 pm

    On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia wrote:
    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.
    Where did you read that?
    St.Ack
  • Ian Varley at Aug 30, 2012 at 11:27 pm
    The Facebook devs have mentioned in public talks that they pre-split their tables and don't use automated region splitting. But as far as I remember, the reason for that isn't predictability of spreading load, so much as predictability of uptime & latency (they don't want an automated split to happen at a random busy time). Maybe that's what you mean, Mohit?

    Ian

    On Aug 30, 2012, at 5:45 PM, Stack wrote:

    On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia wrote:
    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.


    Where did you read that?
    St.Ack
  • Amandeep Khurana at Aug 30, 2012 at 11:31 pm
    Also, you might have read that an initial loading of data can be better
    distributed across the cluster if the table is pre-split rather than
    starting with a single region and splitting (possibly aggressively,
    depending on the throughput) as the data loads in. Once you are in a stable
    state with regions distributed across the cluster, there is really no
    benefit in terms of spreading load by managing splitting manually v/s
    letting HBase do it for you. At that point it's about what Ian mentioned -
    predictability of latencies by avoiding splits happening at a busy time.
    On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley wrote:

    The Facebook devs have mentioned in public talks that they pre-split their
    tables and don't use automated region splitting. But as far as I remember,
    the reason for that isn't predictability of spreading load, so much as
    predictability of uptime & latency (they don't want an automated split to
    happen at a random busy time). Maybe that's what you mean, Mohit?

    Ian

    On Aug 30, 2012, at 5:45 PM, Stack wrote:

    On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia wrote:
    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.


    Where did you read that?
    St.Ack
  • Mohit Anchlia at Aug 31, 2012 at 12:05 am
    In general isn't it better to split the regions so that the load can be
    spread accross the cluster to avoid HotSpots?

    I read about pre-splitting here:

    http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
    On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana wrote:

    Also, you might have read that an initial loading of data can be better
    distributed across the cluster if the table is pre-split rather than
    starting with a single region and splitting (possibly aggressively,
    depending on the throughput) as the data loads in. Once you are in a stable
    state with regions distributed across the cluster, there is really no
    benefit in terms of spreading load by managing splitting manually v/s
    letting HBase do it for you. At that point it's about what Ian mentioned -
    predictability of latencies by avoiding splits happening at a busy time.
    On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley wrote:

    The Facebook devs have mentioned in public talks that they pre-split their
    tables and don't use automated region splitting. But as far as I remember,
    the reason for that isn't predictability of spreading load, so much as
    predictability of uptime & latency (they don't want an automated split to
    happen at a random busy time). Maybe that's what you mean, Mohit?

    Ian

    On Aug 30, 2012, at 5:45 PM, Stack wrote:

    On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mohitanchlia@gmail.com
    wrote:
    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.


    Where did you read that?
    St.Ack
  • Michael Stack at Aug 31, 2012 at 6:52 am

    On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia wrote:
    In general isn't it better to split the regions so that the load can be
    spread accross the cluster to avoid HotSpots?
    Time series data is a particular case [1] and the sematextians have
    tools to help w/ that particular loading pattern. Is time series your
    loading pattern? If so, yes, you need to employ some smarts (tsdb
    schema and write tricks or hbasewd tool) to avoid hotspotting. But
    hotspotting is an issue apart from splts; you can split all you want
    and if your row keys are time series, splitting won't undo them.

    You would split to distribute load over the cluster and HBase should
    be doing this for you w/o need of human intervention (caveat the
    reasons you might want to manually split as listed above by AK and
    Ian).

    St.Ack
    1. http://hbase.apache.org/book.html#rowkey.design
  • Mohit Anchlia at Aug 31, 2012 at 2:55 pm

    On Thu, Aug 30, 2012 at 11:52 PM, Stack wrote:
    On Thu, Aug 30, 2012 at 5:04 PM, Mohit Anchlia wrote:
    In general isn't it better to split the regions so that the load can be
    spread accross the cluster to avoid HotSpots?
    Time series data is a particular case [1] and the sematextians have
    tools to help w/ that particular loading pattern. Is time series your
    loading pattern? If so, yes, you need to employ some smarts (tsdb
    schema and write tricks or hbasewd tool) to avoid hotspotting. But
    hotspotting is an issue apart from splts; you can split all you want
    and if your row keys are time series, splitting won't undo them.

    My data is timeseries and to get random distribution and still have the
    keys in the same region for a user I am thinking of using
    md5(userid)+reversetimestamp as a row key. But with this type of key how
    can one do pre-splits? I have 30 nodes.

    You would split to distribute load over the cluster and HBase should
    be doing this for you w/o need of human intervention (caveat the
    reasons you might want to manually split as listed above by AK and
    Ian).

    St.Ack
    1. http://hbase.apache.org/book.html#rowkey.design
  • Michael Stack at Aug 31, 2012 at 3:32 pm

    On Fri, Aug 31, 2012 at 7:55 AM, Mohit Anchlia wrote:
    My data is timeseries and to get random distribution and still have the
    keys in the same region for a user I am thinking of using
    md5(userid)+reversetimestamp as a row key. But with this type of key how
    can one do pre-splits? I have 30 nodes.
    If you don't know the key spread ahead of time, let HBase do the
    splitting for you?
    St.Ack
  • Doug Meil at Aug 31, 2012 at 1:09 pm
    Stack, re: "Where did you read that?", I think he might also be referring
    to this...

    http://hbase.apache.org/book.html#important_configurations





    On 8/30/12 8:04 PM, "Mohit Anchlia" wrote:

    In general isn't it better to split the regions so that the load can be
    spread accross the cluster to avoid HotSpots?

    I read about pre-splitting here:

    http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting
    -despite-writing-records-with-sequential-keys/
    On Thu, Aug 30, 2012 at 4:30 PM, Amandeep Khurana wrote:

    Also, you might have read that an initial loading of data can be better
    distributed across the cluster if the table is pre-split rather than
    starting with a single region and splitting (possibly aggressively,
    depending on the throughput) as the data loads in. Once you are in a
    stable
    state with regions distributed across the cluster, there is really no
    benefit in terms of spreading load by managing splitting manually v/s
    letting HBase do it for you. At that point it's about what Ian
    mentioned -
    predictability of latencies by avoiding splits happening at a busy time.

    On Thu, Aug 30, 2012 at 4:26 PM, Ian Varley <ivarley@salesforce.com>
    wrote:
    The Facebook devs have mentioned in public talks that they pre-split their
    tables and don't use automated region splitting. But as far as I remember,
    the reason for that isn't predictability of spreading load, so much as
    predictability of uptime & latency (they don't want an automated split to
    happen at a random busy time). Maybe that's what you mean, Mohit?

    Ian

    On Aug 30, 2012, at 5:45 PM, Stack wrote:

    On Thu, Aug 30, 2012 at 7:35 AM, Mohit Anchlia <mohitanchlia@gmail.com
    wrote:
    From what I;ve read it's advisable to do manual splits since you are able
    to spread the load in more predictable way. If I am missing something
    please let me know.


    Where did you read that?
    St.Ack
  • Michael Stack at Aug 31, 2012 at 3:30 pm

    On Fri, Aug 31, 2012 at 6:09 AM, Doug Meil wrote:

    Stack, re: "Where did you read that?", I think he might also be referring
    to this...

    http://hbase.apache.org/book.html#important_configurations
    I'd say we need to revist that paragraph. It gives a 'wrong'
    impression. It starts out w/ a blanket statement that user should do
    manual splitting. I filed
    https://issues.apache.org/jira/browse/HBASE-6701.

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedAug 29, '12 at 10:57p
activeAug 31, '12 at 3:32p
posts14
users5
websitehbase.apache.org

People

Translate

site design / logo © 2019 Grokbase