FAQ
Hello,

can anybody share some insights on how timerange/timestamp filters are
processed?

Basically we intend to use timerange/timestamp filters to process rather
new data from an insertion timestamp POV

- How does the process of skipping records and/or regions work, if one
use timerange filters?
- I also wonder, do timestamp change when e.g. running a major
compaction?
- If data grows over the years, is there any chance that regions with
"older" rows keep "stable" in a way, that they can be skipped very
quickly when querying data with a timerange filter of e.g. the last
three yours?

Thanks,
Thomas

Search Discussions

  • Stuart Smith at Dec 14, 2011 at 5:39 pm
    Hello Thomas,

       Someone here could probably provide more help, but to start you off, the only way I've filtered timestamps is to do a scan, and just filter out rows one by one. This definitely sounds like something coprocessors could help with, but I don't really understand those yet, so someone else will have to step up.. or you can really dig into the documentation about them (AFAIK, it's a little bit of custom code that runs on the regionservers that can pre-process your gets.. but don't quote me on that!).

    But I can say that a major compaction should not affect them - I've never seen it happen, and if it does, I believe that's a bug.

    Take care,
      -stu



    ________________________________
    From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how timerange/timestamp filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters are
    processed?

    Basically we intend to use timerange/timestamp filters to process rather
    new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if one
    use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions with
    "older" rows keep "stable" in a way, that they can be skipped very
    quickly when querying data with a timerange filter of e.g. the last
    three yours?

    Thanks,
    Thomas
  • Carson Hoffacker at Dec 14, 2011 at 6:30 pm
    The timerange scan is able to leverage metadata in each of the HFiles. Each
    HFile should store information about the timerange associated with the data
    within the HFile. If the the timerange associated with the HFile is
    different than the timerange you are interested in, that hfile will be
    skipped completely. This can significantly increase scan performance.

    However, when these files get compacted and the data is merged into a
    smaller number of files, the time range associated with each file
    increases. I don't think it works this way out of the box, but I believe
    you can be smart about how you manage compactions over time to get the
    behavior that you want. You could have compactions compact all the data
    from January 2011 into a single file, and then compact all the data from
    February 2011 into a different file.

    -Carson
    On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:

    Hello Thomas,

    Someone here could probably provide more help, but to start you off,
    the only way I've filtered timestamps is to do a scan, and just filter out
    rows one by one. This definitely sounds like something coprocessors could
    help with, but I don't really understand those yet, so someone else will
    have to step up.. or you can really dig into the documentation about them
    (AFAIK, it's a little bit of custom code that runs on the regionservers
    that can pre-process your gets.. but don't quote me on that!).

    But I can say that a major compaction should not affect them - I've never
    seen it happen, and if it does, I believe that's a bug.

    Take care,
    -stu



    ________________________________
    From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how timerange/timestamp
    filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters are
    processed?

    Basically we intend to use timerange/timestamp filters to process rather
    new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if one
    use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions with
    "older" rows keep "stable" in a way, that they can be skipped very
    quickly when querying data with a timerange filter of e.g. the last
    three yours?

    Thanks,
    Thomas
  • Sam Seigal at Dec 14, 2011 at 7:35 pm
    That is an interesting comment. How would you enforce this in practice
    ? Can you give more details.
    On Wed, Dec 14, 2011 at 10:29 AM, Carson Hoffacker wrote:
    The timerange scan is able to leverage metadata in each of the HFiles. Each
    HFile should store information about the timerange associated with the data
    within the HFile. If the the timerange associated with the HFile is
    different than the timerange you are interested in, that hfile will be
    skipped completely. This can significantly increase scan performance.

    However, when these files get compacted and the data is merged into a
    smaller number of files, the time range associated with each file
    increases. I don't think it works this way out of the box, but I believe
    you can be smart about how you manage compactions over time to get the
    behavior that you want. You could have compactions compact all the data
    from January 2011 into a single file, and then compact all the data from
    February 2011 into a different file.

    -Carson
    On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:

    Hello Thomas,

       Someone here could probably provide more help, but to start you off,
    the only way I've filtered timestamps is to do a scan, and just filter out
    rows one by one. This definitely sounds like something coprocessors could
    help with, but I don't really understand those yet, so someone else will
    have to step up.. or you can really dig into the documentation about them
    (AFAIK, it's a little bit of custom code that runs on the regionservers
    that can pre-process your gets.. but don't quote me on that!).

    But I can say that a major compaction should not affect them - I've never
    seen it happen, and if it does, I believe that's a bug.

    Take care,
      -stu



    ________________________________
     From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how timerange/timestamp
    filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters are
    processed?

    Basically we intend to use timerange/timestamp filters to process rather
    new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if one
    use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions with
    "older" rows keep "stable" in a way, that they can be skipped very
    quickly when querying data with a timerange filter of e.g. the last
    three yours?

    Thanks,
    Thomas
  • Stuart Smith at Dec 14, 2011 at 11:37 pm
    Ah. Thanks for clarifying my wrong answer.. !

    The only time I had to deal with timestamps I had to go through the thrift API ...
    Never noticed the setTimeRange in the Scan() java API :)

    So now I'm curious.. If I use this and it can't skip HFiles.. is there any performance gain from doing this vs doing it client side?
    Or is it basically the same amount of work - a full scan checking & skipping timestamps.. ?


    Take care,
      -stu



    ________________________________
    From: Carson Hoffacker <[email protected]>
    To: [email protected]; Stuart Smith <[email protected]>
    Sent: Wednesday, December 14, 2011 10:29 AM
    Subject: Re: Questions on timestamps, insights on how timerange/timestamp filter are processed?

    The timerange scan is able to leverage metadata in each of the HFiles. Each
    HFile should store information about the timerange associated with the data
    within the HFile. If the the timerange associated with the HFile is
    different than the timerange you are interested in, that hfile will be
    skipped completely. This can significantly increase scan performance.

    However, when these files get compacted and the data is merged into a
    smaller number of files, the time range associated with each file
    increases. I don't think it works this way out of the box, but I believe
    you can be smart about how you manage compactions over time to get the
    behavior that you want. You could have compactions compact all the data
    from January 2011 into a single file, and then compact all the data from
    February 2011 into a different file.

    -Carson
    On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:

    Hello Thomas,

        Someone here could probably provide more help, but to start you off,
    the only way I've filtered timestamps is to do a scan, and just filter out
    rows one by one. This definitely sounds like something coprocessors could
    help with, but I don't really understand those yet, so someone else will
    have to step up.. or you can really dig into the documentation about them
    (AFAIK, it's a little bit of custom code that runs on the regionservers
    that can pre-process your gets.. but don't quote me on that!).

    But I can say that a major compaction should not affect them - I've never
    seen it happen, and if it does, I believe that's a bug.

    Take care,
      -stu



    ________________________________
      From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how timerange/timestamp
    filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters are
    processed?

    Basically we intend to use timerange/timestamp filters to process rather
    new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if one
    use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions with
    "older" rows keep "stable" in a way, that they can be skipped very
    quickly when querying data with a timerange filter of e.g. the last
    three yours?

    Thanks,
    Thomas
  • Carson Hoffacker at Dec 15, 2011 at 4:36 am
    I believe it's the same amount of work.
    On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith wrote:

    Ah. Thanks for clarifying my wrong answer.. !

    The only time I had to deal with timestamps I had to go through the thrift
    API ...
    Never noticed the setTimeRange in the Scan() java API :)

    So now I'm curious.. If I use this and it can't skip HFiles.. is there any
    performance gain from doing this vs doing it client side?
    Or is it basically the same amount of work - a full scan checking &
    skipping timestamps.. ?


    Take care,
    -stu



    ________________________________
    From: Carson Hoffacker <[email protected]>
    To: [email protected]; Stuart Smith <[email protected]>
    Sent: Wednesday, December 14, 2011 10:29 AM
    Subject: Re: Questions on timestamps, insights on how timerange/timestamp
    filter are processed?

    The timerange scan is able to leverage metadata in each of the HFiles. Each
    HFile should store information about the timerange associated with the data
    within the HFile. If the the timerange associated with the HFile is
    different than the timerange you are interested in, that hfile will be
    skipped completely. This can significantly increase scan performance.

    However, when these files get compacted and the data is merged into a
    smaller number of files, the time range associated with each file
    increases. I don't think it works this way out of the box, but I believe
    you can be smart about how you manage compactions over time to get the
    behavior that you want. You could have compactions compact all the data
    from January 2011 into a single file, and then compact all the data from
    February 2011 into a different file.

    -Carson
    On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:

    Hello Thomas,

    Someone here could probably provide more help, but to start you off,
    the only way I've filtered timestamps is to do a scan, and just filter out
    rows one by one. This definitely sounds like something coprocessors could
    help with, but I don't really understand those yet, so someone else will
    have to step up.. or you can really dig into the documentation about them
    (AFAIK, it's a little bit of custom code that runs on the regionservers
    that can pre-process your gets.. but don't quote me on that!).

    But I can say that a major compaction should not affect them - I've never
    seen it happen, and if it does, I believe that's a bug.

    Take care,
    -stu



    ________________________________
    From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how timerange/timestamp
    filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters are
    processed?

    Basically we intend to use timerange/timestamp filters to process rather
    new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if one
    use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions with
    "older" rows keep "stable" in a way, that they can be skipped very
    quickly when querying data with a timerange filter of e.g. the last
    three yours?

    Thanks,
    Thomas
  • Steinmaurer Thomas at Dec 15, 2011 at 9:32 am
    In our tests, filtering rows with timestamps was much faster than using
    a filter which results in a full table scan. But, I question the
    reliability to use the internal timestamp to detect new data and if this
    still scales with a growing amount of data over the years.

    Regards,
    Thomas

    -----Original Message-----
    From: Carson Hoffacker
    Sent: Donnerstag, 15. Dezember 2011 05:36
    To: [email protected]; Stuart Smith
    Subject: Re: Questions on timestamps, insights on how
    timerange/timestamp filter are processed?

    I believe it's the same amount of work.
    On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith wrote:

    Ah. Thanks for clarifying my wrong answer.. !

    The only time I had to deal with timestamps I had to go through the
    thrift API ...
    Never noticed the setTimeRange in the Scan() java API :)

    So now I'm curious.. If I use this and it can't skip HFiles.. is there
    any performance gain from doing this vs doing it client side?
    Or is it basically the same amount of work - a full scan checking &
    skipping timestamps.. ?


    Take care,
    -stu



    ________________________________
    From: Carson Hoffacker <[email protected]>
    To: [email protected]; Stuart Smith <[email protected]>
    Sent: Wednesday, December 14, 2011 10:29 AM
    Subject: Re: Questions on timestamps, insights on how
    timerange/timestamp filter are processed?

    The timerange scan is able to leverage metadata in each of the HFiles.
    Each HFile should store information about the timerange associated
    with the data within the HFile. If the the timerange associated with
    the HFile is different than the timerange you are interested in, that
    hfile will be skipped completely. This can significantly increase scan
    performance.
    However, when these files get compacted and the data is merged into a
    smaller number of files, the time range associated with each file
    increases. I don't think it works this way out of the box, but I
    believe you can be smart about how you manage compactions over time to
    get the behavior that you want. You could have compactions compact all
    the data from January 2011 into a single file, and then compact all
    the data from February 2011 into a different file.

    -Carson
    On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:

    Hello Thomas,

    Someone here could probably provide more help, but to start you
    off, the only way I've filtered timestamps is to do a scan, and just
    filter out
    rows one by one. This definitely sounds like something coprocessors
    could help with, but I don't really understand those yet, so someone
    else will have to step up.. or you can really dig into the
    documentation about them (AFAIK, it's a little bit of custom code
    that runs on the regionservers that can pre-process your gets.. but
    don't quote me on that!).
    But I can say that a major compaction should not affect them - I've
    never seen it happen, and if it does, I believe that's a bug.

    Take care,
    -stu



    ________________________________
    From: Steinmaurer Thomas <[email protected]>
    To: [email protected]
    Sent: Wednesday, December 14, 2011 12:38 AM
    Subject: Questions on timestamps, insights on how
    timerange/timestamp filter are processed?

    Hello,

    can anybody share some insights on how timerange/timestamp filters
    are processed?

    Basically we intend to use timerange/timestamp filters to process
    rather new data from an insertion timestamp POV

    - How does the process of skipping records and/or regions work, if
    one use timerange filters?
    - I also wonder, do timestamp change when e.g. running a major
    compaction?
    - If data grows over the years, is there any chance that regions
    with "older" rows keep "stable" in a way, that they can be skipped
    very quickly when querying data with a timerange filter of e.g. the
    last three yours?

    Thanks,
    Thomas

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedDec 14, '11 at 8:59a
activeDec 15, '11 at 9:32a
posts7
users4
websitehbase.apache.org

People

Translate

site design / logo © 2023 Grokbase