I've been looking at doing the following TODO item:

Allow ORDER BY ... LIMIT # to select high/low value without sort or index
using a sequential scan for highest/lowest values

Right now, if no index exists, ORDER BY ... LIMIT # requires we sort all
values to return the high/low value. Instead The idea is to do a
sequential scan to find the high/low value, thus avoiding the sort.
MIN/MAX already does this, but not for LIMIT > 1.

I think this is pretty important to cover at some point because really _not_
doing this just wrong. We're simply not supporting the correct plan for this
type of query. Currently we're doing a O(nlogn) plan when the right plan would
usually be O(n). (As in, it's actually O(nlogm) where m is usually small and
not interesting).

The way I see to do this is to still use a Sort node and use a tuplesort but
to arrange to get the information of the maximum number of tuples needed to
the tuplesort so it can throw out tuples as it sorts.

My plan is to have tuplesort reuse the existing heap code it uses for tape
merges only keep the memtuples array in a max-heap (instead of the min-heap it
uses now -- that means having a tuplesortstate flag indicating which order and
having the tuplesort_heap* functions check that flag). When it reaches the
limit it can throw away either the new element or the top element on every
insert.

I considered using a simple insertion-sort instead but was worried that the
performance would degrade as the limit clause grows large. I don't think
that's a major use case but I don't like the idea of a O(n^2) algorithm lying
in wait to ambush someone.

Also, because heap sort is slower than qsort (on average anyways) it makes
sense to not bother with the heap until the number of tuples grows well beyond
the limit or until it would otherwise spill to disk.

To actually get the information to the tuplesort the information has to be fed
down to the SortState from the LimitState somehow. This I'm not sure how to
do. There isn't currently any abstract interface between nodes to pass
information like this.

The simple solution is that ExecLimit could just peek at its outerPlanState
and if it's a SortState it can set some fields so the SortState can know to
pass the information to the tuplesort.

I've also considered a more abstract interface such as adding an ExecAdvise()
function that would pass some sort of structure (an node?) down with the
information. This seems like overkill for a single integer but I wonder if
there would be other consumers of such an interface.

The current eflags could be turned swallowed by this, though I don't see any
particular advantage. More realistically a Unique node could also inform a
Sort node that it can throw away duplicates as it sorts. A limit could even be
passed *through* a unique node as long as the Sort understands how to handle
the combination properly. In other areas, a Hash Aggregate can start throw
away elements once the number of elements in the hash grows to the limit.

Alternatively we could have Limit(Sort()), Unique(Sort()), and
Limit(Unique(Sort())) be handled by new types of Sort nodes entirely and not
introduce the Limit and Unique nodes at all. I would worry about duplicated
code in that case though, in particular it seems like there would be cases
where we still want to use qsort rather than throw away unneeded tuples. But
not throwing away unneeded tuples means reimplementing all of nodeLimit in
nodeSort for those cases. And that doesn't help with other cases like
Hash Aggregate.

Or am I overthinking this and having some state nodes peek inside other state
nodes is normal?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Search Discussions

  • Tom Lane at Sep 15, 2006 at 4:53 pm

    Gregory Stark writes:
    I've been looking at doing the following TODO item:
    Allow ORDER BY ... LIMIT # to select high/low value without sort or index
    using a sequential scan for highest/lowest values
    I think this is pretty important to cover at some point because really _not_
    doing this just wrong.
    I can't get all *that* excited about it, since an index solves the
    problem.
    The way I see to do this is to still use a Sort node and use a tuplesort but
    to arrange to get the information of the maximum number of tuples needed to
    the tuplesort so it can throw out tuples as it sorts.
    The implementation that was proposed in the earlier discussion did not
    involve hacking the sort code beyond recognition ;-).

    I believe a better way to think about this would be as an aggregate that
    remembers the top N rows. It can't quite be an aggregate as it stands
    (unless we want to invent aggregates that can return SETOF?) but I
    think there might be some useful overlap with the SQL2003
    window-function concept.

    regards, tom lane
  • Andrew Dunstan at Sep 15, 2006 at 5:00 pm

    Tom Lane wrote:
    (unless we want to invent aggregates that can return SETOF?)
    Doesn't sound like a bad idea at all ...

    cheers

    andrew
  • Gregory Stark at Sep 15, 2006 at 7:12 pm

    Tom Lane writes:

    I believe a better way to think about this would be as an aggregate that
    remembers the top N rows.
    Wouldn't such a thing just be a reimplementation of a tuplestore though? I
    mean, it's storing tuples you feed it, sorting them, and spitting them back
    out in sorted order.

    What would you do if the set of tuples turned out to be larger than you
    expected and not fit in memory? Create a tuplesort and pass them on to it?

    I've already looked at tuplesort and the changes there are minimal. The hard
    part is what to do in the planner and executor to get the information to the
    tuplestore. Do we want the plan to look the way it does now or use some new
    sort of node that consolidates the limit and the sort in the same place.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Alvaro Herrera at Sep 15, 2006 at 7:24 pm

    Gregory Stark wrote:
    Tom Lane <tgl@sss.pgh.pa.us> writes:
    I believe a better way to think about this would be as an aggregate that
    remembers the top N rows.
    Wouldn't such a thing just be a reimplementation of a tuplestore though? I
    mean, it's storing tuples you feed it, sorting them, and spitting them back
    out in sorted order.
    I don't know if this is the same thing you are talking about, but Oleg
    talked to me on the conference about "partial sort", which AFAICS it's
    about the same thing you are talking about. I think Teodor submitted a
    patch to implement it, which was rejected because of not being general
    enough.

    --
    Alvaro Herrera http://www.CommandPrompt.com/
    PostgreSQL Replication, Consulting, Custom Development, 24x7 support
  • Gregory Stark at Sep 15, 2006 at 10:05 pm

    Alvaro Herrera writes:

    I don't know if this is the same thing you are talking about, but Oleg
    talked to me on the conference about "partial sort", which AFAICS it's
    about the same thing you are talking about. I think Teodor submitted a
    patch to implement it, which was rejected because of not being general
    enough.

    Oof, you have a long memory. Oleg does reference such a thing in his 2002 post
    that ended up resulting in the TODO item. I can't find the original patch but
    I doubt any patch against 7.1 is going to be all that helpful in understanding
    what to do today.

    I'm also confused how he only saw a factor of 6 improvement in reading the top
    100 out of a million. I would expect much better.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Nicolas Barbier at Sep 16, 2006 at 9:34 am

    2006/9/16, Gregory Stark <stark@enterprisedb.com>:

    Alvaro Herrera <alvherre@commandprompt.com> writes:
    I don't know if this is the same thing you are talking about, but Oleg
    talked to me on the conference about "partial sort", which AFAICS it's
    about the same thing you are talking about. I think Teodor submitted a
    patch to implement it, which was rejected because of not being general
    enough.
    Oof, you have a long memory. Oleg does reference such a thing in his 2002 post
    that ended up resulting in the TODO item. I can't find the original patch but
    I doubt any patch against 7.1 is going to be all that helpful in understanding
    what to do today.

    I'm also confused how he only saw a factor of 6 improvement in reading the top
    100 out of a million. I would expect much better.
    For example, consider the case in which 6 passes are needed to do the
    full sort. Then, for a "partial sort", at least the first of these
    passes has to be fully executed, because one needs to read at least
    all the data once to find the "top n".

    greetings,
    Nicolas
  • Gregory Stark at Sep 15, 2006 at 7:22 pm

    Tom Lane writes:

    Gregory Stark <stark@enterprisedb.com> writes:
    I think this is pretty important to cover at some point because really _not_
    doing this just wrong.
    I can't get all *that* excited about it, since an index solves the
    problem.
    Well I'm not all *that* excited about it either, it's just another plan and
    there are an infinite number of possible plans out there we could infinite for
    various corner cases.

    But just in case it's not clear for anyone the usual use case for this paging
    results on a web page. As much as I normally try to convince people they don't
    want to do it that way they usually do end up with it implemented using
    limit/offset. And Postgres currently is absolutely *awful* at running those
    queries.

    Often the killer requirement that makes it infeasible to create an index is
    precisely that they want to be able to sort on any of a long list of possible
    keys. Creating dozens of keys on every table isn't too appealing.

    And in any case the query is often a join where the data in the sort key isn't
    even all coming from the same table or where you need to use other indexes to
    fetch the data prior to the sort.

    I won't discourage anyone from working on OLAP queries and this is indeed a
    similar idea. I suspect the same functionality in tuplesort of being able to
    set a maximum number of tuples to keep will be useful there too.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Mark at Sep 15, 2006 at 8:41 pm

    On Fri, Sep 15, 2006 at 08:22:50PM +0100, Gregory Stark wrote:
    But just in case it's not clear for anyone the usual use case for
    this paging results on a web page. As much as I normally try to
    convince people they don't want to do it that way they usually do
    end up with it implemented using limit/offset. And Postgres
    currently is absolutely *awful* at running those queries.
    I'm curious, as I may be such an offender. What alternatives exist?

    I think you mean the concept of showing a page of information that
    you can navigate forwards and backwards a page, or select a range.

    What alternatives to limit/offset exist? If there are thousands or
    more results, I have trouble with an idea that the entire results
    should be queried, and cached, displaying only a fraction.

    I think most or all of the times I do this, an index is available,
    so perhaps that is why I don't find this issue problematic?

    For implementation, I think something simple is best:

    - limit X offset Y

    This means keeping a set of X+Y tuples as follows:

    1) If set of X+Y tuples still has room, insert using a binary search
    that retains the ordering characteristics that would be had if
    limit/offset had not been used.

    2) If the row is less than the X+Y'th tuple, remove the X+Y'th
    tuple from the set, and insert the row as per 1).

    3) Ignore the tuple.

    At the end, you return from the set starting at Y+1, and ending at Y+X.

    If X+Y tuples would take up too much memory, this plan should be
    skipped, and the general routines used instead.

    Cheers,
    mark

    --
    mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
    . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
    \/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
    \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
    One ring to rule them all, one ring to find them, one ring to bring them all
    and in the darkness bind them...

    http://mark.mielke.cc/
  • Gregory Stark at Sep 15, 2006 at 9:06 pm

    mark@mark.mielke.cc writes:

    On Fri, Sep 15, 2006 at 08:22:50PM +0100, Gregory Stark wrote:

    I'm curious, as I may be such an offender. What alternatives exist?

    I think you mean the concept of showing a page of information that
    you can navigate forwards and backwards a page, or select a range.

    What alternatives to limit/offset exist? If there are thousands or
    more results, I have trouble with an idea that the entire results
    should be queried, and cached, displaying only a fraction.

    I think most or all of the times I do this, an index is available,
    so perhaps that is why I don't find this issue problematic?
    If you have a unique index and instead of using OFFSET you pass along the last
    key of the previous page then you can use a WHERE clause on the indexed column
    to go straight to the correct page rather than using OFFSET.

    So for example if you're displaying bank transactions sorted by transaction_id
    you have the "next page" button pass along the "start_transaction_id=nnn"
    where nnn is the last transaction_id of the previous page. Then on the next
    page you do a query with "WHERE transaction_id > ?" and pass that column. You
    still use ORDER BY transaction_id and LIMIT.

    This has the upside that your query always takes the same amount of time.

    Using OFFSET means later pages take longer, possibly much longer, than earlier
    pages. Possibly so much longer that it causes enough i/o to bring down your
    web server etc.

    It does have lots of downsides as well. You cannot provide direct links to the
    later pages aside from the next page. It's difficult to provide a proper
    "previous page" button. etc. Early in the web's history these were reasonable
    trade-offs but nowadays it's hard to convince people that their $bazillion
    machine can't handle sorting a couple thousand records for each page view and
    they should sacrifice the user experience for that. So I've pretty much given
    up on that battle.
    For implementation, I think something simple is best:
    [...]

    You just described using an insertion sort. Even if I went with insertion sort
    instead of heap sort I think it makes sense to do it inside tuplesort.
    If X+Y tuples would take up too much memory, this plan should be
    skipped, and the general routines used instead.
    And that's a big part of why...


    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com
  • Mark at Sep 16, 2006 at 12:31 am

    On Fri, Sep 15, 2006 at 10:06:16PM +0100, Gregory Stark wrote:
    I'm curious, as I may be such an offender. What alternatives exist?
    ...
    What alternatives to limit/offset exist? If there are thousands or
    more results, I have trouble with an idea that the entire results
    should be queried, and cached, displaying only a fraction.
    If you have a unique index and instead of using OFFSET you pass
    along the last key of the previous page then you can use a WHERE
    clause on the indexed column to go straight to the correct page
    rather than using OFFSET. So for example if you're displaying bank
    transactions sorted by transaction_id you have the "next page"
    button pass along the "start_transaction_id=nnn" where nnn is the
    last transaction_id of the previous page. Then on the next page you
    do a query with "WHERE transaction_id > ?" and pass that column. You
    still use ORDER BY transaction_id and LIMIT.
    I found benefits to doing things this way that were not related to
    performance. If the number of items leading up to your page changes,
    remembering the offset can result in listing a completely different
    page than you intended when paging forward or backwards. On my pages,
    I prefer to define one of the items as the item I am looking at, and
    page seeking is always +/- 1 page from that item. This means that I
    am pretty close to what you are suggesting - except - because I do
    this for functional reasons, and not for performance reasons, I am
    doing something worse.

    I use COUNT(*) and WHERE as you describe above to map this identifier
    to an offset, and then a second SELECT with LIMIT/OFFSET to describe
    the object and the those that follow on the page.

    According to your suggestion, I think this means I should track the
    identifier with the last offset, displaying the offset to the user for
    information purposes only, not using it for any queries, and then use
    WHERE and LIMIT?

    I tried this out. EXPLAIN ANALYZE tells me that for a random
    offset=200, limit=20 case I tried, the simple change changes it from
    index scanning 207 rows to find 7 rows, to index scanning 7 rows to
    find 7 rows. Sweet. Unfortunately, the time to complete is unchanged
    around 1.3+/-0.2 milliseconds. Looks like my system has bigger
    bottlenecks. :-)

    Thanks,
    mark

    --
    mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
    . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
    \/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
    \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
    One ring to rule them all, one ring to find them, one ring to bring them all
    and in the darkness bind them...

    http://mark.mielke.cc/
  • Martijn van Oosterhout at Sep 15, 2006 at 7:22 pm

    On Fri, Sep 15, 2006 at 05:30:27PM +0100, Gregory Stark wrote:
    Also, because heap sort is slower than qsort (on average anyways) it makes
    sense to not bother with the heap until the number of tuples grows well beyond
    the limit or until it would otherwise spill to disk.
    The thought that comes to mind is to leave the sorting as is, but
    change the code that writes to the tapes to stop writing once it hits
    the limit. So each tape will never have more than N tuples, where N is
    the limit. This would be fairly unobtrusive because none of the other
    code actually needs to care.
    Alternatively we could have Limit(Sort()), Unique(Sort()), and
    Limit(Unique(Sort())) be handled by new types of Sort nodes entirely and not
    introduce the Limit and Unique nodes at all.
    I don't think it's easy to merge Unique and Sort, mainly because the
    fields you're doing the Unique on are probably not the fields you're
    sorting on, so you're probably not saving much.

    However, merging Unique/Distinct/GroupBy is another avenue that has
    been considered.

    In general LIMIT is not handled bad because we don't execute further
    once we have the number of tuples. Only nodes that Materialize are a
    problem, basically Sort being the common one.
    Or am I overthinking this and having some state nodes peek inside other state
    nodes is normal?
    I don't think so. In general the parser and planner poke around quite a
    bit, but once the optimizer has been over it, the plan has to be
    static, for rescans, backward scans, executing stored plans, etc.

    Have a nice day,
    --
    Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
    From each according to his ability. To each according to his ability to litigate.
  • Gregory Stark at Sep 15, 2006 at 7:35 pm

    Martijn van Oosterhout writes:
    On Fri, Sep 15, 2006 at 05:30:27PM +0100, Gregory Stark wrote:
    Also, because heap sort is slower than qsort (on average anyways) it makes
    sense to not bother with the heap until the number of tuples grows well beyond
    the limit or until it would otherwise spill to disk.
    The thought that comes to mind is to leave the sorting as is, but
    change the code that writes to the tapes to stop writing once it hits
    the limit. So each tape will never have more than N tuples, where N is
    the limit. This would be fairly unobtrusive because none of the other
    code actually needs to care.
    I'm sorry, I forgot to mention that part of my analysis. Once you reach disk
    any chance of optimising the limit case is pretty much out the window. You
    could trim some tuples from each tape but unless you're sorting truly
    stupendous amounts of data I doubt it would really help much.

    I think it only makes sense to look at the in-memory case. Instead of qsorting
    thousands of records or, worse, spilling millions of records to disk and doing
    an external sort only to use only the top 10 and throw the rest away, we throw
    tuples away before they accumulate in memory in the first place.
    Alternatively we could have Limit(Sort()), Unique(Sort()), and
    Limit(Unique(Sort())) be handled by new types of Sort nodes entirely and not
    introduce the Limit and Unique nodes at all.
    I don't think it's easy to merge Unique and Sort, mainly because the
    fields you're doing the Unique on are probably not the fields you're
    sorting on, so you're probably not saving much.
    On the contrary I think the vast majority of the time you have a Unique(Sort)
    it will be the same key because it will be caused by a SELECT DISTINCT. Now
    that I actually test it I see there are more nodes that could do implement
    this (namely GroupAgg) so I'm thinking more and more about having an abstract
    way to pass information down through the nodes rather than handle just
    Limit/Sort specifically.

    --
    Gregory Stark
    EnterpriseDB http://www.enterprisedb.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedSep 15, '06 at 4:30p
activeSep 16, '06 at 9:34a
posts13
users7
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase