In our tests, filtering rows with timestamps was much faster than using
a filter which results in a full table scan. But, I question the
reliability to use the internal timestamp to detect new data and if this
still scales with a growing amount of data over the years.
Regards,
Thomas
-----Original Message-----
From: Carson Hoffacker
Sent: Donnerstag, 15. Dezember 2011 05:36
To:
[email protected]; Stuart Smith
Subject: Re: Questions on timestamps, insights on how
timerange/timestamp filter are processed?
I believe it's the same amount of work.
On Wed, Dec 14, 2011 at 3:37 PM, Stuart Smith wrote:Ah. Thanks for clarifying my wrong answer.. !
The only time I had to deal with timestamps I had to go through the
thrift API ...
Never noticed the setTimeRange in the Scan() java API :)
So now I'm curious.. If I use this and it can't skip HFiles.. is there
any performance gain from doing this vs doing it client side?
Or is it basically the same amount of work - a full scan checking &
skipping timestamps.. ?
Take care,
-stu
________________________________
From: Carson Hoffacker <
[email protected]>
To:
[email protected]; Stuart Smith <
[email protected]>
Sent: Wednesday, December 14, 2011 10:29 AM
Subject: Re: Questions on timestamps, insights on how
timerange/timestamp filter are processed?
The timerange scan is able to leverage metadata in each of the HFiles.
Each HFile should store information about the timerange associated
with the data within the HFile. If the the timerange associated with
the HFile is different than the timerange you are interested in, that
hfile will be skipped completely. This can significantly increase scan
performance.
However, when these files get compacted and the data is merged into a
smaller number of files, the time range associated with each file
increases. I don't think it works this way out of the box, but I
believe you can be smart about how you manage compactions over time to
get the behavior that you want. You could have compactions compact all
the data from January 2011 into a single file, and then compact all
the data from February 2011 into a different file.
-Carson
On Wed, Dec 14, 2011 at 9:39 AM, Stuart Smith wrote:
Hello Thomas,
Someone here could probably provide more help, but to start you
off, the only way I've filtered timestamps is to do a scan, and just
filter out
rows one by one. This definitely sounds like something coprocessors
could help with, but I don't really understand those yet, so someone
else will have to step up.. or you can really dig into the
documentation about them (AFAIK, it's a little bit of custom code
that runs on the regionservers that can pre-process your gets.. but
don't quote me on that!).
But I can say that a major compaction should not affect them - I've
never seen it happen, and if it does, I believe that's a bug.
Take care,
-stu
________________________________
From: Steinmaurer Thomas <
[email protected]>
To:
[email protected]Sent: Wednesday, December 14, 2011 12:38 AM
Subject: Questions on timestamps, insights on how
timerange/timestamp filter are processed?
Hello,
can anybody share some insights on how timerange/timestamp filters
are processed?
Basically we intend to use timerange/timestamp filters to process
rather new data from an insertion timestamp POV
- How does the process of skipping records and/or regions work, if
one use timerange filters?
- I also wonder, do timestamp change when e.g. running a major
compaction?
- If data grows over the years, is there any chance that regions
with "older" rows keep "stable" in a way, that they can be skipped
very quickly when querying data with a timerange filter of e.g. the
last three yours?
Thanks,
Thomas