FAQ
Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like: orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds) within certain
time period. I am not sure if it is possible to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out. But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.


Thanks in advance,

James

Search Discussions

  • Ian Varley at Feb 14, 2012 at 6:01 pm
    James,

    Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).

    Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.

    Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.

    A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).

    Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.

    Ian

    On Feb 14, 2012, at 11:45 AM, James Young wrote:

    Hi there,

    I am pretty new to HBase and i am trying to understand the best
    practice to do the scan based on two/multiple partial scans for the
    row key.

    For example, I have a row key like: orderId-timeStamp-item. The
    orderId has nothing to with the timeStamp and i have a requirement to
    scan rows for certain orderIds ( a range of orderIds) within certain
    time period. I am not sure if it is possible to perform two
    partial scan: one is for orderId and another one is for the timeStamp.

    Also, doing regular expression on the row key might work out. But it
    is more expensive. so I am wondering what would be the best practice
    for solving such a problem.


    Thanks in advance,

    James
  • James Young at Feb 15, 2012 at 2:31 am
    Thank you Ian! Yes, the orderIds are ordered.

    I might try timeStamp filter. But it still doesn't provide the early
    out feature. not sure how the performance it could be. Do you think it
    might be worth having a custom filter to do two partial scans?

    Thanks again.
    James
    On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley wrote:
    James,

    Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows based on a prefix of the row key that's just the length of the orderid).

    Another question: are the time ranges you're scanning a big or small proportion of all the rows for each order id? If you generally expect to return a majority of the rows per each order, then a single scan (starting with the lowest orderId, and proceeding to the highest) is possibly still a good fit. You can also apply timestamp filters (which enables an optimization to exclude storefiles that couldn't possibly contain values in that timestamp range); that only works if the timestamps on your cells match the timestamp in the row key.

    Alternately, if you expect to return only a small portion of the records (i.e. you keep a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve a small set of them), you might want to do one scan per orderId. You can choose how much parallelism to put into it by controlling that yourself (i.e. use a thread per scan on the client side); you could theoretically do a thread per order id, but of course, if you have a very large number of them, that could be harmful.

    A regular expression doesn't get you past the fundamental requirement, which is that at the server side, it has to look at every row (excepting special optimizations like the timestamp one I mentioned above).

    Your best bet is to implement it a couple ways, with real data, and see which ones seem to work the fastest.

    Ian

    On Feb 14, 2012, at 11:45 AM, James Young wrote:

    Hi there,

    I am pretty new to HBase and i am trying to understand the best
    practice to do the scan based on two/multiple partial scans for the
    row key.

    For example, I have a row key like:  orderId-timeStamp-item. The
    orderId has nothing to with the timeStamp and i have a requirement to
    scan rows for certain orderIds ( a range of orderIds)  within certain
    time period.    I am not sure if it is possible  to perform two
    partial scan: one is for orderId and another one is for the timeStamp.

    Also, doing regular expression on the row key might work out.  But it
    is more expensive. so I am wondering what would be the best practice
    for solving such a problem.


    Thanks in advance,

    James
  • NNever at Feb 15, 2012 at 2:35 pm
    Hi James, I'm new to HBase too.
    How about this:

    with "a range of orderIds", select the first id.
    Step1 : set this ID as startRow, then checkout the closest id(Only fetch
    one),
    Step2: then with this fetched ID, setStartRow(fetchedID-startTimestamp),
    setEndRow(fetchedID-endTimestamp),
    Step3: then use this fetchedID as newStartRow, then checkout the closest
    id(Only fetch one),
    Then loop Step2 and Step1 util reaching the End range of IDs

    I think without using Filter, this operation will be fast (It only rely on
    the dictionary-turns). THE ONLY problem is there may be too many RPC calls.
    For Sloving this problem, you can use Endpoint to do those Scans on the
    RegionServer and combine results through single RPC call.


    2012/2/15 James Young <breathing@gmail.com>
    Thank you Ian! Yes, the orderIds are ordered.

    I might try timeStamp filter. But it still doesn't provide the early
    out feature. not sure how the performance it could be. Do you think it
    might be worth having a custom filter to do two partial scans?

    Thanks again.
    James
    On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley wrote:
    James,

    Are your orderIds ordered? You say "a range of orderIds", which implies
    that (i.e. they're sequential numbers like 001, 002, etc, not hashes or
    random values). If so, then a single scan can hit the rows for multiple
    contiguous orderIds (you'd set the start and stop rows based on a prefix of
    the row key that's just the length of the orderid).
    Another question: are the time ranges you're scanning a big or small
    proportion of all the rows for each order id? If you generally expect to
    return a majority of the rows per each order, then a single scan (starting
    with the lowest orderId, and proceeding to the highest) is possibly still a
    good fit. You can also apply timestamp filters (which enables an
    optimization to exclude storefiles that couldn't possibly contain values in
    that timestamp range); that only works if the timestamps on your cells
    match the timestamp in the row key.
    Alternately, if you expect to return only a small portion of the records
    (i.e. you keep a lot of items with a wide range of timestamps in each
    orderId, but you only want to retrieve a small set of them), you might want
    to do one scan per orderId. You can choose how much parallelism to put into
    it by controlling that yourself (i.e. use a thread per scan on the client
    side); you could theoretically do a thread per order id, but of course, if
    you have a very large number of them, that could be harmful.
    A regular expression doesn't get you past the fundamental requirement,
    which is that at the server side, it has to look at every row (excepting
    special optimizations like the timestamp one I mentioned above).
    Your best bet is to implement it a couple ways, with real data, and see
    which ones seem to work the fastest.
    Ian

    On Feb 14, 2012, at 11:45 AM, James Young wrote:

    Hi there,

    I am pretty new to HBase and i am trying to understand the best
    practice to do the scan based on two/multiple partial scans for the
    row key.

    For example, I have a row key like: orderId-timeStamp-item. The
    orderId has nothing to with the timeStamp and i have a requirement to
    scan rows for certain orderIds ( a range of orderIds) within certain
    time period. I am not sure if it is possible to perform two
    partial scan: one is for orderId and another one is for the timeStamp.

    Also, doing regular expression on the row key might work out. But it
    is more expensive. so I am wondering what would be the best practice
    for solving such a problem.


    Thanks in advance,

    James

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedFeb 14, '12 at 5:46p
activeFeb 15, '12 at 2:35p
posts4
users3
websitehbase.apache.org

3 users in discussion

James Young: 2 posts NNever: 1 post Ian Varley: 1 post

People

Translate

site design / logo © 2022 Grokbase