From: Ron Peacetree <rjpeace@earthlink.net>
Sent: Sep 24, 2005 6:30 AM
Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?

... the amount of IO done is the most
important of the things that you should be optimizing for in
choosing an external sorting algorithm.

<snip>

Since sorting is a fundamental operation in many parts of a DBMS,
this is a Big Deal.

This discussion has gotten my creative juices flowing. I'll post
some Straw Man algorithm sketches after I've done some more
thought.
As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item, and we want
to avoid seeking more than the critical number of seeks implied above
when doing it. It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU operations to
avoid a random HD access. In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to do
so for the forseeable future. In short, _internal_ sorts have some, and are
going to increasingly have more, of the same IO problems usually
associated with external sorts.


Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key that
only has two values.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B RID
or Rptr for location + a 1b key for each record. Just the RID's would take up
1.25GB (250M records) or 2.5GB (500M records). Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually physically
sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU
as we scan the data set and append the entries in a catagory from CPU cache
to a RAM file in one IO burst whenever said catagory gets full inside the CPU.
We can do the same with either RAM file to HD whenever they get full. The
sorted order of the data is found by concatenating the appropriate files at the
end of the process.

As simple as this example is, it has many of the characteristics we are looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO in
either case.
C= We do as much work as possible within the CPU.
D= This process is stable. Equal keys stay in the original order they are encountered.

To generalize this method, we first need our 1b Key to become a sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in size.
Therefore our RID+Key or KeyPrefix should never be larger than this. For a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or KeyPrefix.
Since the data can't take on more than 40b worth different values (actually 500M= 29b
for our example), we have more than adequate space for Key or KeyPrefix. We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines.
That's enough that we should be able to do a significant amount of useful work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also needs to be
generalized. We want a space efficient DS that allows us to find any given element in
as few accesses as possible and that allows us to insert new elements or reorganize
the DS as efficiently as possible. This being a DB discussion list, a B+ tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no node is larger than
what fits into L2 cache can be created dynamically as we scan the data set via any of
the fast, low IO methods well known for doing so. Since the L2 cache can hold 10's of
thousands of cache lines, it should be easy to make sure that the B+ tree has something
like 1000 elements per node (making the base of the logarithm for access being at least
1000). The log base 1000 of 500M is ~2.9, so that means that even in the absolute
worst case where every one of the 500M records is unique we can find any given
element in less than 3 accesses of the B+ tree. Increasing the order of the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree, it's very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set to different
CPUs, let them build their independant B+ trees to represent the data in sorted order from
their POV, and then merge the B+ trees very efficiently into one overall DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily update them at will
for frequent used sorting conditions.

What do people think?

Ron

Search Discussions

  • Dann Corbit at Sep 26, 2005 at 9:13 pm

    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
    owner@postgresql.org] On Behalf Of Ron Peacetree
    Sent: Monday, September 26, 2005 10:47 AM
    To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
    Subject: [HACKERS] [PERFORM] A Better External Sort?
    From: Ron Peacetree <rjpeace@earthlink.net>
    Sent: Sep 24, 2005 6:30 AM
    Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?
    ... the amount of IO done is the most
    important of the things that you should be optimizing for in
    choosing an external sorting algorithm.

    <snip>

    Since sorting is a fundamental operation in many parts of a DBMS,
    this is a Big Deal.

    This discussion has gotten my creative juices flowing. I'll post
    some Straw Man algorithm sketches after I've done some more
    thought.
    As a thought exeriment, I've been considering the best way to sort 1TB
    (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

    Part I: A Model of the System
    The performance of such external sorts is limited by HD IO, then
    memory IO, and finally CPU throughput.

    On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
    case) to ~1/128 (single HD best case. No more than one seek every
    ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

    RAID HD IO will be in the range from as low as a single HD (RAID 1) to
    ~1/8 (a RAID system saturating the external IO bus) the throughput of
    RAM.

    RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
    pathways internal to the CPU.

    This model suggests that HD IO will greatly dominate every other
    factor, particuarly if we are talking about a single HD rather than a
    peripheral bus saturating RAID subsystem. If at all possible, we want
    to access the HD subsystem only once for each data item,
    If you can achieve that, I think you should be given a Nobel Prize, and
    I mean that sincerely. I also think that your analysis is interesting.
    and we want
    to avoid seeking more than the critical number of seeks implied above
    when doing it. It also suggests that at a minimum, it's worth it to
    spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
    Far more than that if we are talking about a single random access.

    It's worth spending ~128 CPU operations to avoid a single random RAM
    access, and literally 10's or even 100's of thousands of CPU
    operations to
    avoid a random HD access. In addition, there are many indications in
    current ECE and IT literature that the performance gaps between these
    pieces of computer systems are increasing and expected to continue to do
    so for the forseeable future. In short, _internal_ sorts have some, and
    are
    going to increasingly have more, of the same IO problems usually
    associated with external sorts.
    Knuth has made the observation (confirmed by others) that 40% of
    mainframe CPU cycles are spent on sorting. Hence, any sort of
    optimization in this area is a potential for enormous savings.
    Part II: a Suggested Algorithm
    The simplest case is one where we have to order the data using a key that
    only has two values.
    I suggest testing against a very large class of distributions. All of
    the common statistical models are a start (Gaussian, Poisson, etc.) and
    also single value, two distinct values, to some limit.
    Given 2^40B of data using 2KB or 4KB per record, the most compact
    representation we can make of such a data set is to assign a 32b= 4B RID
    or Rptr for location + a 1b key for each record. Just the RID's would
    take up
    1.25GB (250M records) or 2.5GB (500M records). Enough space that even
    an implied ordering of records may not fit into RAM.

    Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in
    terms
    of IO operations than sorting the actual 1TB of data.

    That IO cost can be lowered even further if instead of actually
    physically
    sorting the RIDs, we assign a RID to the appropriate catagory inside the
    CPU
    as we scan the data set and append the entries in a catagory from CPU
    cache
    to a RAM file in one IO burst whenever said catagory gets full inside the
    CPU.
    We can do the same with either RAM file to HD whenever they get full. The
    sorted order of the data is found by concatenating the appropriate files
    at the
    end of the process.

    As simple as this example is, it has many of the characteristics we are
    looking for:
    A= We access each piece of data once on HD and in RAM.
    B= We do the minimum amount of RAM and HD IO, and almost no random IO in
    either case.
    C= We do as much work as possible within the CPU.
    D= This process is stable. Equal keys stay in the original order they are
    encountered.

    To generalize this method, we first need our 1b Key to become a
    sufficiently large
    enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache
    unfriendly.

    Cache lines (also sometimes called "blocks") are usually 64B= 512b in
    size.
    Therefore our RID+Key or KeyPrefix should never be larger than this. For
    a 2^40B
    data set, a 5B RID leaves us with potentially as much as 59B of Key or
    KeyPrefix.
    Since the data can't take on more than 40b worth different values
    (actually 500M= 29b
    for our example), we have more than adequate space for Key or
    KeyPrefix.
    We just
    have to figure out how to use it effectively.
    A typical CPU L2 cache can hold 10's or 100's of thousands of such cache
    lines.
    That's enough that we should be able to do a significant amount of useful
    work within
    the CPU w/o having to go off-die.

    The data structure we are using to represent the sorted data also needs to
    be
    generalized. We want a space efficient DS that allows us to find any
    given element in
    as few accesses as possible and that allows us to insert new elements or
    reorganize
    the DS as efficiently as possible. This being a DB discussion list, a B+
    tree seems like
    a fairly obvious suggestion ;-)

    A B+ tree where each element is no larger than a cache line and no node is
    larger than
    what fits into L2 cache can be created dynamically as we scan the data set
    via any of
    the fast, low IO methods well known for doing so. Since the L2 cache can
    hold 10's of
    thousands of cache lines, it should be easy to make sure that the B+ tree
    has something
    like 1000 elements per node (making the base of the logarithm for access
    being at least
    1000). The log base 1000 of 500M is ~2.9, so that means that even in the
    absolute
    worst case where every one of the 500M records is unique we can find any
    given
    element in less than 3 accesses of the B+ tree. Increasing the order of
    the B+ tree is
    an option to reduce average accesses even further.

    Since the DS representing the sorted order of the data is a B+ tree, it's
    very "IO friendly"
    if we need to store part or all of it on HD.

    In an multiprocessor environment, we can assign chunks of the data set to
    different
    CPUs, let them build their independant B+ trees to represent the data in
    sorted order from
    their POV, and then merge the B+ trees very efficiently into one overall
    DS to represent
    the sorted order of the entire data set.

    Finally, since these are B+ trees, we can keep them around and easily
    update them at will
    for frequent used sorting conditions.

    What do people think?
    I think that your analysis is very interesting. I would like to see the
    result of the experiment.

    I think that the btrees are going to be O(n*log(n)) in construction of
    the indexes in disk access unless you memory map them [which means you
    would need stupendous memory volume] and so I cannot say that I really
    understand your idea yet. Can you draw a picture of it for me? (I am
    dyslexic and understand things far better when I can visualize it).
  • Jonah H. Harris at Sep 26, 2005 at 9:27 pm
    Ron,

    Having rested my brain for the last few days, your theory made for
    interesting reading... Rather than argue the technical specs, I'd love to
    see an implementation :)

    -Jonah
    On 9/26/05, Dann Corbit wrote:


    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
    owner@postgresql.org] On Behalf Of Ron Peacetree
    Sent: Monday, September 26, 2005 10:47 AM
    To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
    Subject: [HACKERS] [PERFORM] A Better External Sort?
    From: Ron Peacetree <rjpeace@earthlink.net>
    Sent: Sep 24, 2005 6:30 AM
    Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?
    ... the amount of IO done is the most
    important of the things that you should be optimizing for in
    choosing an external sorting algorithm.

    <snip>

    Since sorting is a fundamental operation in many parts of a DBMS,
    this is a Big Deal.

    This discussion has gotten my creative juices flowing. I'll post
    some Straw Man algorithm sketches after I've done some more
    thought.
    As a thought exeriment, I've been considering the best way to sort 1TB
    (2^40B) of 2-4KB (2^11-2^12B) records. That's 2^28-2^29 records.

    Part I: A Model of the System
    The performance of such external sorts is limited by HD IO, then
    memory IO, and finally CPU throughput.

    On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
    case) to ~1/128 (single HD best case. No more than one seek every
    ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

    RAID HD IO will be in the range from as low as a single HD (RAID 1) to
    ~1/8 (a RAID system saturating the external IO bus) the throughput of
    RAM.

    RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
    pathways internal to the CPU.

    This model suggests that HD IO will greatly dominate every other
    factor, particuarly if we are talking about a single HD rather than a
    peripheral bus saturating RAID subsystem. If at all possible, we want
    to access the HD subsystem only once for each data item,
    If you can achieve that, I think you should be given a Nobel Prize, and
    I mean that sincerely. I also think that your analysis is interesting.
    and we want
    to avoid seeking more than the critical number of seeks implied above
    when doing it. It also suggests that at a minimum, it's worth it to
    spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
    Far more than that if we are talking about a single random access.

    It's worth spending ~128 CPU operations to avoid a single random RAM
    access, and literally 10's or even 100's of thousands of CPU
    operations to
    avoid a random HD access. In addition, there are many indications in
    current ECE and IT literature that the performance gaps between these
    pieces of computer systems are increasing and expected to continue to do
    so for the forseeable future. In short, _internal_ sorts have some, and
    are
    going to increasingly have more, of the same IO problems usually
    associated with external sorts.
    Knuth has made the observation (confirmed by others) that 40% of
    mainframe CPU cycles are spent on sorting. Hence, any sort of
    optimization in this area is a potential for enormous savings.
    Part II: a Suggested Algorithm
    The simplest case is one where we have to order the data using a key that
    only has two values.
    I suggest testing against a very large class of distributions. All of
    the common statistical models are a start (Gaussian, Poisson, etc.) and
    also single value, two distinct values, to some limit.
    Given 2^40B of data using 2KB or 4KB per record, the most compact
    representation we can make of such a data set is to assign a 32b= 4B RID
    or Rptr for location + a 1b key for each record. Just the RID's would
    take up
    1.25GB (250M records) or 2.5GB (500M records). Enough space that even
    an implied ordering of records may not fit into RAM.

    Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in
    terms
    of IO operations than sorting the actual 1TB of data.

    That IO cost can be lowered even further if instead of actually
    physically
    sorting the RIDs, we assign a RID to the appropriate catagory inside the
    CPU
    as we scan the data set and append the entries in a catagory from CPU
    cache
    to a RAM file in one IO burst whenever said catagory gets full inside the
    CPU.
    We can do the same with either RAM file to HD whenever they get full. The
    sorted order of the data is found by concatenating the appropriate files
    at the
    end of the process.

    As simple as this example is, it has many of the characteristics we are
    looking for:
    A= We access each piece of data once on HD and in RAM.
    B= We do the minimum amount of RAM and HD IO, and almost no random IO in
    either case.
    C= We do as much work as possible within the CPU.
    D= This process is stable. Equal keys stay in the original order they are
    encountered.

    To generalize this method, we first need our 1b Key to become a
    sufficiently large
    enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache
    unfriendly.

    Cache lines (also sometimes called "blocks") are usually 64B= 512b in
    size.
    Therefore our RID+Key or KeyPrefix should never be larger than this. For
    a 2^40B
    data set, a 5B RID leaves us with potentially as much as 59B of Key or
    KeyPrefix.
    Since the data can't take on more than 40b worth different values
    (actually 500M= 29b
    for our example), we have more than adequate space for Key or
    KeyPrefix.
    We just
    have to figure out how to use it effectively.
    A typical CPU L2 cache can hold 10's or 100's of thousands of such cache
    lines.
    That's enough that we should be able to do a significant amount of useful
    work within
    the CPU w/o having to go off-die.

    The data structure we are using to represent the sorted data also needs to
    be
    generalized. We want a space efficient DS that allows us to find any
    given element in
    as few accesses as possible and that allows us to insert new elements or
    reorganize
    the DS as efficiently as possible. This being a DB discussion list, a B+
    tree seems like
    a fairly obvious suggestion ;-)

    A B+ tree where each element is no larger than a cache line and no node is
    larger than
    what fits into L2 cache can be created dynamically as we scan the data set
    via any of
    the fast, low IO methods well known for doing so. Since the L2 cache can
    hold 10's of
    thousands of cache lines, it should be easy to make sure that the B+ tree
    has something
    like 1000 elements per node (making the base of the logarithm for access
    being at least
    1000). The log base 1000 of 500M is ~2.9, so that means that even in the
    absolute
    worst case where every one of the 500M records is unique we can find any
    given
    element in less than 3 accesses of the B+ tree. Increasing the order of
    the B+ tree is
    an option to reduce average accesses even further.

    Since the DS representing the sorted order of the data is a B+ tree, it's
    very "IO friendly"
    if we need to store part or all of it on HD.

    In an multiprocessor environment, we can assign chunks of the data set to
    different
    CPUs, let them build their independant B+ trees to represent the data in
    sorted order from
    their POV, and then merge the B+ trees very efficiently into one overall
    DS to represent
    the sorted order of the entire data set.

    Finally, since these are B+ trees, we can keep them around and easily
    update them at will
    for frequent used sorting conditions.

    What do people think?
    I think that your analysis is very interesting. I would like to see the
    result of the experiment.

    I think that the btrees are going to be O(n*log(n)) in construction of
    the indexes in disk access unless you memory map them [which means you
    would need stupendous memory volume] and so I cannot say that I really
    understand your idea yet. Can you draw a picture of it for me? (I am
    dyslexic and understand things far better when I can visualize it).

    ---------------------------(end of broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org


    --
    Respectfully,

    Jonah H. Harris, Database Internals Architect
    EnterpriseDB Corporation
    http://www.enterprisedb.com/
  • Ron Peacetree at Sep 27, 2005 at 1:10 am

    From: Dann Corbit <DCorbit@connx.com>
    Sent: Sep 26, 2005 5:13 PM
    To: Ron Peacetree <rjpeace@earthlink.net>, pgsql-hackers@postgresql.org,
    pgsql-performance@postgresql.org
    Subject: RE: [HACKERS] [PERFORM] A Better External Sort?

    I think that the btrees are going to be O(n*log(n)) in construction of
    the indexes in disk access unless you memory map them [which means you
    would need stupendous memory volume] and so I cannot say that I really
    understand your idea yet.
    Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
    don't require O(nlgn) HD accesses. These shouldn't either.

    Let's start by assuming that an element is <= in size to a cache line and a
    node fits into L1 DCache. To make the discussion more concrete, I'll use a
    64KB L1 cache + a 1MB L2 cache only as an example.

    Simplest case: the Key has few enough distinct values that all Keys or
    KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
    <= 1000 different values. More if we can fit more than 1 element into
    each cache line.).

    As we scan the data set coming in from HD, we compare the Key or KeyPrefix
    to the sorted list of Key values in the node. This can be done in O(lgn) using
    Binary Search or O(lglgn) using a variation of Interpolation Search.
    If the Key value exists, we append this RID to the list of RIDs having the
    same Key:
    If the RAM buffer of this list of RIDs is full we append it and the current
    RID to the HD list of these RIDs.
    Else we insert this new key value into its proper place in the sorted list of Key
    values in the node and start a new list for this value of RID.

    We allocate room for a CPU write buffer so we can schedule RAM writes to
    the RAM lists of RIDs so as to minimize the randomness of them.

    When we are finished scanning the data set from HD, the sorted node with
    RID lists for each Key value contains the sort order for the whole data set.

    Notice that almost all of the random data access is occuring within the CPU
    rather than in RAM or HD, and that we are accessing RAM or HD only when
    absolutely needed.

    Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
    In the given example CPU, we will be able to fit at least 1000 elements per
    node and 2^20/2^16= up to 16 such nodes in this CPU. We use a node's
    worth of space as a RAM write buffer, so we end up with room for 15 such
    nodes in this CPU. This is enough for a 2 level index to at least 15,000
    distinct Key value lists.

    All of the traditional tricks for splitting a Btree node and redistributing
    elements within them during insertion or splitting for maximum node
    utilization can be used here.

    The most general case: There are too many nodes to fit within the CPU
    cache(s). The root node now points to a maximum of at least 1000 nodes
    since each element in the root node points to another node. A full 2 level
    index is now enough to point to at least 10^6 distinct Key value lists, and
    3 levels will index more distinct Key values than is possible in our 1TB,
    500M record example.

    We can use some sort of node use prediction algorithm like LFU to decide
    which node should be moved out of CPU when we have to replace one of
    the nodes in the CPU. The nodes in RAM or on HD can be arranged to
    maximize streaming IO behavior and minimize random access IO
    behavior.

    As you can see, both the RAM and HD IO are as minimized as possible,
    and what such IO there is has been optimized for streaming behavior.

    Can you draw a picture of it for me? (I am dyslexic and understand things
    far better when I can visualize it).
    Not much for pictures. Hopefully the explanation helps?

    Ron
  • Tom Lane at Sep 27, 2005 at 1:42 am

    Ron Peacetree writes:
    Let's start by assuming that an element is <= in size to a cache line and a
    node fits into L1 DCache. [ much else snipped ]
    So far, you've blithely assumed that you know the size of a cache line,
    the sizes of L1 and L2 cache, and that you are working with sort keys
    that you can efficiently pack into cache lines. And that you know the
    relative access speeds of the caches and memory so that you can schedule
    transfers, and that the hardware lets you get at that transfer timing.
    And that the number of distinct key values isn't very large.

    I don't see much prospect that anything we can actually use in a
    portable fashion is going to emerge from this line of thought.

    regards, tom lane
  • Josh Berkus at Sep 27, 2005 at 4:15 pm
    Ron,

    I've somehow missed part of this thread, which is a shame since this is
    an area of primary concern for me.

    Your suggested algorithm seems to be designed to relieve I/O load by
    making more use of the CPU. (if I followed it correctly). However,
    that's not PostgreSQL's problem; currently for us external sort is a
    *CPU-bound* operation, half of which is value comparisons. (oprofiles
    available if anyone cares)

    So we need to look, instead, at algorithms which make better use of
    work_mem to lower CPU activity, possibly even at the expense of I/O.

    --Josh Berkus
  • Ron Peacetree at Sep 27, 2005 at 5:09 am
    SECOND ATTEMPT AT POST. Web mailer appears to have
    eaten first one. I apologize in advance if anyone gets two
    versions of this post.
    =r
    From: Tom Lane <tgl@sss.pgh.pa.us>
    Sent: Sep 26, 2005 9:42 PM
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    So far, you've blithely assumed that you know the size of a cache line,
    the sizes of L1 and L2 cache,
    NO. I used exact values only as examples. Realistic examples drawn
    from an extensive survey of past, present, and what I could find out
    about future systems; but only examples nonetheless. For instance,
    Hennessy and Patterson 3ed points out that 64B cache lines are
    optimally performing for caches between 16KB and 256KB. The same
    source as well as sources specifically on CPU memory hierarchy
    design points out that we are not likely to see L1 caches larger than
    256KB in the forseeable future.

    The important point was the idea of an efficient Key, rather than
    Record, sort using a CPU cache friendly data structure with provably
    good space and IO characteristics based on a reasonable model of
    current and likely future single box computer architecture (although
    it would be fairly easy to extend it to include the effects of
    networking.)

    No apriori exact or known values are required for the method to work.

    and that you are working with sort keys that you can efficiently pack
    into cache lines.
    Not "pack". "map". n items can not take on more than n values. n
    values can be represented in lgn bits. Less efficient mappings can
    also work. Either way I demonstrated that we have plenty of space in
    a likely and common cache line size. Creating a mapping function
    to represent m values in lgm bits is a well known hack, and if we keep
    track of minimum and maximum values for fields during insert and
    delete operations, we can even create mapping functions fairly easily.
    (IIRC, Oracle does keep track of minimum and maximum field
    values.)

    And that you know the relative access speeds of the caches and
    memory so that you can schedule transfers,
    Again, no. I created a reasonable model of a computer system that
    holds remarkably well over a _very_ wide range of examples. I
    don't need the numbers to be exactly right to justify my approach
    to this problem or understand why other approaches may have
    downsides. I just have to get the relative performance of the
    system components and the relative performance gap between them
    reasonably correct. The stated model does that very well.

    Please don't take my word for it. Go grab some random box:
    laptop, desktop, unix server, etc and try it for yourself. Part of the
    reason I published the model was so that others could examine it.

    and that the hardware lets you get at that transfer timing.
    Never said anything about this, and in fact I do not need any such.

    And that the number of distinct key values isn't very large.
    Quite the opposite in fact. I went out of my way to show that the
    method still works well even if every Key is distinct. It is _more
    efficient_ when the number of distinct keys is small compared to
    the number of data items, but it works as well as any other Btree
    would when all n of the Keys are distinct. This is just a CPU cache
    and more IO friendly Btree, not some magical and unheard of
    technique. It's just as general purpose as Btrees usually are.

    I'm simply looking at the current and likely future state of computer
    systems architecture and coming up with a slight twist on how to use
    already well known and characterized techniques. not trying to start
    a revolution.


    I'm trying very hard NOT to waste anyone's time around here.
    Including my own
    Ron
  • Jonah H. Harris at Sep 27, 2005 at 12:49 pm
    Ron,

    Again, if you feel strongly enough about the theory to argue it, I recommend
    that you spend your time constructively; create an implemenation of it.
    Citing academics is cool and all, but code speaks louder than theory in this
    case. As Tom mentioned, this has to be portable. Making assumptions about
    computing architectures (especially those in the future), is fine for
    theory, but not practical for something that needs to be maintained in the
    real-world. Go forth and write thy code.

    -Jonah
    On 9/27/05, Ron Peacetree wrote:

    SECOND ATTEMPT AT POST. Web mailer appears to have
    eaten first one. I apologize in advance if anyone gets two
    versions of this post.
    =r
    From: Tom Lane <tgl@sss.pgh.pa.us>
    Sent: Sep 26, 2005 9:42 PM
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    So far, you've blithely assumed that you know the size of a cache line,
    the sizes of L1 and L2 cache,
    NO. I used exact values only as examples. Realistic examples drawn
    from an extensive survey of past, present, and what I could find out
    about future systems; but only examples nonetheless. For instance,
    Hennessy and Patterson 3ed points out that 64B cache lines are
    optimally performing for caches between 16KB and 256KB. The same
    source as well as sources specifically on CPU memory hierarchy
    design points out that we are not likely to see L1 caches larger than
    256KB in the forseeable future.

    The important point was the idea of an efficient Key, rather than
    Record, sort using a CPU cache friendly data structure with provably
    good space and IO characteristics based on a reasonable model of
    current and likely future single box computer architecture (although
    it would be fairly easy to extend it to include the effects of
    networking.)

    No apriori exact or known values are required for the method to work.

    and that you are working with sort keys that you can efficiently pack
    into cache lines.
    Not "pack". "map". n items can not take on more than n values. n
    values can be represented in lgn bits. Less efficient mappings can
    also work. Either way I demonstrated that we have plenty of space in
    a likely and common cache line size. Creating a mapping function
    to represent m values in lgm bits is a well known hack, and if we keep
    track of minimum and maximum values for fields during insert and
    delete operations, we can even create mapping functions fairly easily.
    (IIRC, Oracle does keep track of minimum and maximum field
    values.)

    And that you know the relative access speeds of the caches and
    memory so that you can schedule transfers,
    Again, no. I created a reasonable model of a computer system that
    holds remarkably well over a _very_ wide range of examples. I
    don't need the numbers to be exactly right to justify my approach
    to this problem or understand why other approaches may have
    downsides. I just have to get the relative performance of the
    system components and the relative performance gap between them
    reasonably correct. The stated model does that very well.

    Please don't take my word for it. Go grab some random box:
    laptop, desktop, unix server, etc and try it for yourself. Part of the
    reason I published the model was so that others could examine it.

    and that the hardware lets you get at that transfer timing.
    Never said anything about this, and in fact I do not need any such.

    And that the number of distinct key values isn't very large.
    Quite the opposite in fact. I went out of my way to show that the
    method still works well even if every Key is distinct. It is _more
    efficient_ when the number of distinct keys is small compared to
    the number of data items, but it works as well as any other Btree
    would when all n of the Keys are distinct. This is just a CPU cache
    and more IO friendly Btree, not some magical and unheard of
    technique. It's just as general purpose as Btrees usually are.

    I'm simply looking at the current and likely future state of computer
    systems architecture and coming up with a slight twist on how to use
    already well known and characterized techniques. not trying to start
    a revolution.


    I'm trying very hard NOT to waste anyone's time around here.
    Including my own
    Ron

    ---------------------------(end of broadcast)---------------------------
    TIP 5: don't forget to increase your free space map settings


    --
    Respectfully,

    Jonah H. Harris, Database Internals Architect
    EnterpriseDB Corporation
    http://www.enterprisedb.com/
  • Kevin Grittner at Sep 27, 2005 at 3:22 pm
    I can't help wondering how a couple thousand context switches per
    second would affect the attempt to load disk info into the L1 and
    L2 caches. That's pretty much the low end of what I see when the
    server is under any significant load.
  • Ron Peacetree at Sep 27, 2005 at 5:15 pm

    From: Josh Berkus <josh@agliodbs.com>
    ent: Sep 27, 2005 12:15 PM
    To: Ron Peacetree <rjpeace@earthlink.net>
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    I've somehow missed part of this thread, which is a shame since this is
    an area of primary concern for me.

    Your suggested algorithm seems to be designed to relieve I/O load by
    making more use of the CPU. (if I followed it correctly).
    The goal is to minimize all IO load. Not just HD IO load, but also RAM
    IO load. Particularly random access IO load of any type (for instance:
    "the pointer chasing problem").

    In addition, the design replaces explicit data or explicit key manipulation
    with the creation of a smaller, far more CPU and IO efficient data
    structure (essentially a CPU cache friendly Btree index) of the sorted
    order of the data.

    That Btree can be used to generate a physical reordering of the data
    in one pass, but that's the weakest use for it. The more powerful
    uses involve allowing the Btree to persist and using it for more
    efficient re-searches or combining it with other such Btrees (either as
    a step in task distribution across multiple CPUs or as a more efficient
    way to do things like joins by manipulating these Btrees rather than
    the actual records.)

    However, that's not PostgreSQL's problem; currently for us external
    sort is a *CPU-bound* operation, half of which is value comparisons.
    (oprofiles available if anyone cares)

    So we need to look, instead, at algorithms which make better use of
    work_mem to lower CPU activity, possibly even at the expense of I/O.
    I suspect that even the highly efficient sorting code we have is
    suffering more pessimal CPU IO behavior than what I'm presenting.
    Jim Gray's external sorting contest web site points out that memory IO
    has become a serious problem for most of the contest entries.

    Also, I'll bet the current code manipulates more data.

    Finally, there's the possibilty of reusing the product of this work to a
    degree and in ways that we can't with our current sorting code.


    Now all we need is resources and time to create a prototype.
    Since I'm not likely to have either any time soon, I'm hoping that
    I'll be able to explain this well enough that others can test it.

    *sigh* I _never_ have enough time or resources any more...
    Ron
  • Jeffrey W. Baker at Sep 27, 2005 at 5:26 pm

    On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

    That Btree can be used to generate a physical reordering of the data
    in one pass, but that's the weakest use for it. The more powerful
    uses involve allowing the Btree to persist and using it for more
    efficient re-searches or combining it with other such Btrees (either as
    a step in task distribution across multiple CPUs or as a more efficient
    way to do things like joins by manipulating these Btrees rather than
    the actual records.)
    Maybe you could describe some concrete use cases. I can see what you
    are getting at, and I can imagine some advantageous uses, but I'd like
    to know what you are thinking.

    Specifically I'd like to see some cases where this would beat sequential
    scan. I'm thinking that in your example of a terabyte table with a
    column having only two values, all the queries I can think of would be
    better served with a sequential scan.

    Perhaps I believe this because you can now buy as much sequential I/O as
    you want. Random I/O is the only real savings.

    -jwb
  • Ron Peacetree at Sep 28, 2005 at 4:03 pm

    From: "Jeffrey W. Baker" <jwbaker@acm.org>
    Sent: Sep 27, 2005 1:26 PM
    To: Ron Peacetree <rjpeace@earthlink.net>
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
    On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

    That Btree can be used to generate a physical reordering of the data
    in one pass, but that's the weakest use for it. The more powerful
    uses involve allowing the Btree to persist and using it for more
    efficient re-searches or combining it with other such Btrees (either as
    a step in task distribution across multiple CPUs or as a more efficient
    way to do things like joins by manipulating these Btrees rather than
    the actual records.)
    Maybe you could describe some concrete use cases. I can see what
    you are getting at, and I can imagine some advantageous uses, but
    I'd like to know what you are thinking.
    1= In a 4P box, we split the data in RAM into 4 regions and create
    a CPU cache friendly Btree using the method I described for each CPU.
    The 4 Btrees can be merged in a more time and space efficient manner
    than the original records to form a Btree that represents the sorted order
    of the entire data set. Any of these Btrees can be allowed to persist to
    lower the cost of doing similar operations in the future (Updating the
    Btrees during inserts and deletes is cheaper than updating the original
    data files and then redoing the same sort from scratch in the future.)
    Both the original sort and future such sorts are made more efficient
    than current methods.

    2= We use my method to sort two different tables. We now have these
    very efficient representations of a specific ordering on these tables. A
    join operation can now be done using these Btrees rather than the
    original data tables that involves less overhead than many current
    methods.

    3= We have multiple such Btrees for the same data set representing
    sorts done using different fields (and therefore different Keys).
    Calculating a sorted order for the data based on a composition of
    those Keys is now cheaper than doing the sort based on the composite
    Key from scratch. When some of the Btrees exist and some of them
    do not, there is a tradeoff calculation to be made. Sometimes it will be
    cheaper to do the sort from scratch using the composite Key.

    Specifically I'd like to see some cases where this would beat sequential
    scan. I'm thinking that in your example of a terabyte table with a
    column having only two values, all the queries I can think of would be
    better served with a sequential scan.
    In my original example, a sequential scan of the 1TB of 2KB or 4KB
    records, => 250M or 500M records of data, being sorted on a binary
    value key will take ~1000x more time than reading in the ~1GB Btree
    I described that used a Key+RID (plus node pointers) representation
    of the data.

    Just to clarify the point further,
    1TB of 1B records => 2^40 records of at most 256 distinct values.
    1TB of 2B records => 2^39 records of at most 2^16 distinct values.
    1TB of 4B records => 2^38 records of at most 2^32 distinct values.
    1TB of 5B records => 200B records of at most 200B distinct values.
    From here on, the number of possible distinct values is limited by the
    number of records.
    100B records are used in the "Indy" version of Jim Gray's sorting
    contests, so 1TB => 10B records.
    2KB-4KB is the most common record size I've seen in enterprise
    class DBMS (so I used this value to make my initial example more
    realistic).

    Therefore the vast majority of the time representing a data set by Key
    will use less space that the original record. Less space used means
    less IO to scan the data set, which means faster scan times.

    This is why index files work in the first place, right?

    Perhaps I believe this because you can now buy as much sequential I/O
    as you want. Random I/O is the only real savings.
    1= No, you can not "buy as much sequential IO as you want". Even if
    with an infinite budget, there are physical and engineering limits. Long
    before you reach those limits, you will pay exponentially increasing costs
    for linearly increasing performance gains. So even if you _can_ buy a
    certain level of sequential IO, it may not be the most efficient way to
    spend money.

    2= Most RW IT professionals have far from an infinite budget. Just traffic
    on these lists shows how severe the typical cost constraints usually are.
    OTOH, if you have an inifinite IT budget, care to help a few less fortunate
    than yourself? After all, a even a large constant substracted from infinity
    is still infinity... ;-)

    3= No matter how fast you can do IO, IO remains the most expensive
    part of the performance equation. The fastest and cheapest IO you can
    do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU
    operations for IO correctly, more space efficient data representations will
    always be a Win because of this.
  • Jeffrey W. Baker at Sep 29, 2005 at 4:27 am

    On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
    From: "Jeffrey W. Baker" <jwbaker@acm.org>
    Sent: Sep 27, 2005 1:26 PM
    To: Ron Peacetree <rjpeace@earthlink.net>
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
    On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

    That Btree can be used to generate a physical reordering of the data
    in one pass, but that's the weakest use for it. The more powerful
    uses involve allowing the Btree to persist and using it for more
    efficient re-searches or combining it with other such Btrees (either as
    a step in task distribution across multiple CPUs or as a more efficient
    way to do things like joins by manipulating these Btrees rather than
    the actual records.)
    Maybe you could describe some concrete use cases. I can see what
    you are getting at, and I can imagine some advantageous uses, but
    I'd like to know what you are thinking.

    Specifically I'd like to see some cases where this would beat sequential
    scan. I'm thinking that in your example of a terabyte table with a
    column having only two values, all the queries I can think of would be
    better served with a sequential scan.
    In my original example, a sequential scan of the 1TB of 2KB or 4KB
    records, => 250M or 500M records of data, being sorted on a binary
    value key will take ~1000x more time than reading in the ~1GB Btree
    I described that used a Key+RID (plus node pointers) representation
    of the data.
    You are engaging in a length and verbose exercise in mental
    masturbation, because you have not yet given a concrete example of a
    query where this stuff would come in handy. A common, general-purpose
    case would be the best.

    We can all see that the method you describe might be a good way to sort
    a very large dataset with some known properties, which would be fine if
    you are trying to break the terasort benchmark. But that's not what
    we're doing here. We are designing and operating relational databases.
    So please explain the application.

    Your main example seems to focus on a large table where a key column has
    constrained values. This case is interesting in proportion to the
    number of possible values. If I have billions of rows, each having one
    of only two values, I can think of a trivial and very fast method of
    returning the table "sorted" by that key: make two sequential passes,
    returning the first value on the first pass and the second value on the
    second pass. This will be faster than the method you propose.

    I think an important aspect you have failed to address is how much of
    the heap you must visit after the sort is complete. If you are
    returning every tuple in the heap then the optimal plan will be very
    different from the case when you needn't.

    -jwb

    PS: Whatever mailer you use doesn't understand or respect threading nor
    attribution. Out of respect for the list's readers, please try a mailer
    that supports these 30-year-old fundamentals of electronic mail.
  • Josh Berkus at Sep 29, 2005 at 4:54 pm
    Jeff, Ron,

    First off, Jeff, please take it easy. We're discussing 8.2 features at
    this point and there's no reason to get stressed out at Ron. You can
    get plenty stressed out when 8.2 is near feature freeze. ;-)


    Regarding use cases for better sorts:

    The biggest single area where I see PostgreSQL external sort sucking is
    on index creation on large tables. For example, for free version of
    TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
    hardware, but over 3 hours to create each index on that table. This
    means that over all our load into TPCH takes 4 times as long to create
    the indexes as it did to bulk load the data.

    Anyone restoring a large database from pg_dump is in the same situation.
    Even worse, if you have to create a new index on a large table on a
    production database in use, because the I/O from the index creation
    swamps everything.

    Following an index creation, we see that 95% of the time required is the
    external sort, which averages 2mb/s. This is with seperate drives for
    the WAL, the pg_tmp, the table and the index. I've confirmed that
    increasing work_mem beyond a small minimum (around 128mb) had no benefit
    on the overall index creation speed.


    --Josh Berkus
  • Luke Lonergan at Sep 29, 2005 at 5:07 pm
    Josh,
    On 9/29/05 9:54 AM, "Josh Berkus" wrote:

    Following an index creation, we see that 95% of the time required is the
    external sort, which averages 2mb/s. This is with seperate drives for
    the WAL, the pg_tmp, the table and the index. I've confirmed that
    increasing work_mem beyond a small minimum (around 128mb) had no benefit
    on the overall index creation speed.
    Yuuuup! That about sums it up - regardless of taking 1 or 2 passes through
    the heap being sorted, 1.5 - 2 MB/s is the wrong number. This is not
    necessarily an algorithmic problem, but is a optimization problem with
    Postgres that must be fixed before it can be competitive.

    We read/write to/from disk at 240MB/s and so 2 passes would run at a net
    rate of 120MB/s through the sort set if it were that efficient.

    Anyone interested in tackling the real performance issue? (flame bait, but
    for a worthy cause :-)

    - Luke
  • David Fetter at Sep 29, 2005 at 5:18 pm

    On Thu, Sep 29, 2005 at 10:06:52AM -0700, Luke Lonergan wrote:
    Josh,
    On 9/29/05 9:54 AM, "Josh Berkus" wrote:

    Following an index creation, we see that 95% of the time required
    is the external sort, which averages 2mb/s. This is with seperate
    drives for the WAL, the pg_tmp, the table and the index. I've
    confirmed that increasing work_mem beyond a small minimum (around
    128mb) had no benefit on the overall index creation speed.
    Yuuuup! That about sums it up - regardless of taking 1 or 2 passes
    through the heap being sorted, 1.5 - 2 MB/s is the wrong number.
    This is not necessarily an algorithmic problem, but is a
    optimization problem with Postgres that must be fixed before it can
    be competitive.

    We read/write to/from disk at 240MB/s and so 2 passes would run at a
    net rate of 120MB/s through the sort set if it were that efficient.

    Anyone interested in tackling the real performance issue? (flame
    bait, but for a worthy cause :-)
    I'm not sure that it's flamebait, but what do I know? Apart from the
    nasty number (1.5-2 MB/s), what other observations do you have to
    hand? Any ideas about what things are not performing here? Parts of
    the code that could bear extra scrutiny? Ideas on how to fix same in
    a cross-platform way?

    Cheers,
    D
    --
    David Fetter david@fetter.org http://fetter.org/
    phone: +1 510 893 6100 mobile: +1 415 235 3778

    Remember to vote!
  • Jeffrey W. Baker at Sep 29, 2005 at 5:44 pm

    On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
    Josh,
    On 9/29/05 9:54 AM, "Josh Berkus" wrote:

    Following an index creation, we see that 95% of the time required is the
    external sort, which averages 2mb/s. This is with seperate drives for
    the WAL, the pg_tmp, the table and the index. I've confirmed that
    increasing work_mem beyond a small minimum (around 128mb) had no benefit
    on the overall index creation speed.
    Yuuuup! That about sums it up - regardless of taking 1 or 2 passes through
    the heap being sorted, 1.5 - 2 MB/s is the wrong number.
    Yeah this is really bad ... approximately the speed of GNU sort.

    Josh, do you happen to know how many passes are needed in the multiphase
    merge on your 60GB table?

    Looking through tuplesort.c, I have a couple of initial ideas. Are we
    allowed to fork here? That would open up the possibility of using the
    CPU and the I/O in parallel. I see that tuplesort.c also suffers from
    the kind of postgresql-wide disease of calling all the way up and down a
    big stack of software for each tuple individually. Perhaps it could be
    changed to work on vectors.

    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.

    I would also recommend using an external processes to asynchronously
    feed the tuples into the heap during the merge.

    What's the timeframe for 8.2?

    -jwb
  • Josh Berkus at Sep 29, 2005 at 5:59 pm
    Jeff,
    Josh, do you happen to know how many passes are needed in the multiphase
    merge on your 60GB table?
    No, any idea how to test that?
    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.
    Yes, but the evidence suggests that we're actually not using the whole 1GB
    of RAM ... maybe using only 32MB of it which would mean over 200 passes
    (I'm not sure of the exact match). Just fixing our algorithm so that it
    used all of the work_mem permitted might improve things tremendously.
    I would also recommend using an external processes to asynchronously
    feed the tuples into the heap during the merge.

    What's the timeframe for 8.2?
    Too far out to tell yet. Probably 9mo to 1 year, that's been our history.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Jeffrey W. Baker at Sep 29, 2005 at 6:16 pm

    On Thu, 2005-09-29 at 11:03 -0700, Josh Berkus wrote:
    Jeff,
    Josh, do you happen to know how many passes are needed in the multiphase
    merge on your 60GB table?
    No, any idea how to test that?
    I would just run it under the profiler and see how many times
    beginmerge() is called.

    -jwb
  • Josh Berkus at Sep 29, 2005 at 6:24 pm
    Jeff,
    I would just run it under the profiler and see how many times
    beginmerge() is called.
    Hmm, I'm not seeing it at all in the oprofile results on a 100million-row
    sort.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Luke Lonergan at Sep 30, 2005 at 4:22 am
    Jeff,
    On 9/29/05 10:44 AM, "Jeffrey W. Baker" wrote:

    On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
    Looking through tuplesort.c, I have a couple of initial ideas. Are we
    allowed to fork here? That would open up the possibility of using the
    CPU and the I/O in parallel. I see that tuplesort.c also suffers from
    the kind of postgresql-wide disease of calling all the way up and down a
    big stack of software for each tuple individually. Perhaps it could be
    changed to work on vectors. Yes!
    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.
    Yes again, see above.
    I would also recommend using an external processes to asynchronously
    feed the tuples into the heap during the merge.
    Simon Riggs is working this idea a bit - it's slightly less interesting to
    us because we already have a multiprocessing executor. Our problem is that
    4 x slow is still far too slow.
    What's the timeframe for 8.2?
    Let's test it out in Bizgres!

    - Luke
  • Tom Lane at Oct 1, 2005 at 6:02 am

    "Jeffrey W. Baker" <jwbaker@acm.org> writes:
    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.
    I had more or less despaired of this thread yielding any usable ideas
    :-( but I think you have one here. The reason the current code uses a
    six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
    edition) shows that there's not much incremental gain from using more
    tapes ... if you are in the regime where number of runs is much greater
    than number of tape drives. But if you can stay in the regime where
    only one merge pass is needed, that is obviously a win.

    I don't believe we can simply legislate that there be only one merge
    pass. That would mean that, if we end up with N runs after the initial
    run-forming phase, we need to fit N tuples in memory --- no matter how
    large N is, or how small work_mem is. But it seems like a good idea to
    try to use an N-way merge where N is as large as work_mem will allow.
    We'd not have to decide on the value of N until after we've completed
    the run-forming phase, at which time we've already seen every tuple
    once, and so we can compute a safe value for N as work_mem divided by
    largest_tuple_size. (Tape I/O buffers would have to be counted too
    of course.)

    It's been a good while since I looked at the sort code, and so I don't
    recall if there are any fundamental reasons for having a compile-time-
    constant value of the merge order rather than choosing it at runtime.
    My guess is that any inefficiencies added by making it variable would
    be well repaid by the potential savings in I/O.

    regards, tom lane
  • Simon Riggs at Oct 1, 2005 at 9:29 am

    On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote:
    "Jeffrey W. Baker" <jwbaker@acm.org> writes:
    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.
    I had more or less despaired of this thread yielding any usable ideas
    :-( but I think you have one here. The reason the current code uses a
    six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
    edition) shows that there's not much incremental gain from using more
    tapes ... if you are in the regime where number of runs is much greater
    than number of tape drives. But if you can stay in the regime where
    only one merge pass is needed, that is obviously a win.

    I don't believe we can simply legislate that there be only one merge
    pass. That would mean that, if we end up with N runs after the initial
    run-forming phase, we need to fit N tuples in memory --- no matter how
    large N is, or how small work_mem is. But it seems like a good idea to
    try to use an N-way merge where N is as large as work_mem will allow.
    We'd not have to decide on the value of N until after we've completed
    the run-forming phase, at which time we've already seen every tuple
    once, and so we can compute a safe value for N as work_mem divided by
    largest_tuple_size. (Tape I/O buffers would have to be counted too
    of course.)

    It's been a good while since I looked at the sort code, and so I don't
    recall if there are any fundamental reasons for having a compile-time-
    constant value of the merge order rather than choosing it at runtime.
    My guess is that any inefficiencies added by making it variable would
    be well repaid by the potential savings in I/O.
    Well, perhaps Knuth is not untouchable!

    So we merge R runs with N variable rather than N=6.

    Pick N so that N >= 6 and N <= R, with N limited by memory, sufficient
    to allow long sequential reads from the temp file.

    Looking at the code, in selectnewtape() we decide on the connection
    between run number and tape number. This gets executed during the
    writing of initial runs, which was OK when the run->tape mapping was
    known ahead of time because of fixed N.

    To do this it sounds like we'd be better to write each run out to its
    own personal runtape, taking the assumption that N is very large. Then
    when all runs are built, re-assign the run numbers to tapes for the
    merge. That is likely to be a trivial mapping unless N isn't large
    enough to fit in memory. That idea should be easily possible because the
    tape numbers were just abstract anyway.

    Right now, I can't see any inefficiencies from doing this. It uses
    memory better and Knuth shows that using more tapes is better anyhow.
    Keeping track of more tapes isn't too bad, even for hundreds or even
    thousands of runs/tapes.

    Tom, its your idea, so you have first dibs. I'm happy to code this up if
    you choose not to, once I've done my other immediate chores.

    That just leaves these issues for a later time:
    - CPU and I/O interleaving
    - CPU cost of abstract data type comparison operator invocation

    Best Regards, Simon Riggs
  • Greg Stark at Oct 1, 2005 at 4:17 pm

    Tom Lane writes:

    "Jeffrey W. Baker" <jwbaker@acm.org> writes:
    I think the largest speedup will be to dump the multiphase merge and
    merge all tapes in one pass, no matter how large M. Currently M is
    capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
    the tape. It could be done in a single pass heap merge with N*log(M)
    comparisons, and, more importantly, far less input and output.
    I had more or less despaired of this thread yielding any usable ideas
    :-( but I think you have one here. The reason the current code uses a
    six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
    edition) shows that there's not much incremental gain from using more
    tapes ... if you are in the regime where number of runs is much greater
    than number of tape drives. But if you can stay in the regime where
    only one merge pass is needed, that is obviously a win.
    Is that still true when the multiple tapes are being multiplexed onto a single
    actual file on disk?

    That brings up one of my pet features though. The ability to declare multiple
    temporary areas on different spindles and then have them be used on a rotating
    basis. So a sort could store each tape on a separate spindle and merge them
    together at full sequential i/o speed.

    This would make the tradeoff between multiway merges and many passes even
    harder to find though. The broader the multiway merges the more sort areas
    would be used which would increase the likelihood of another sort using the
    same sort area and hurting i/o performance.

    --
    greg
  • Tom Lane at Oct 1, 2005 at 3:45 pm

    Josh Berkus writes:
    The biggest single area where I see PostgreSQL external sort sucking is
    on index creation on large tables. For example, for free version of
    TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
    hardware, but over 3 hours to create each index on that table. This
    means that over all our load into TPCH takes 4 times as long to create
    the indexes as it did to bulk load the data.
    ...
    Following an index creation, we see that 95% of the time required is the
    external sort, which averages 2mb/s. This is with seperate drives for
    the WAL, the pg_tmp, the table and the index. I've confirmed that
    increasing work_mem beyond a small minimum (around 128mb) had no benefit
    on the overall index creation speed.
    These numbers don't seem to add up. You have not provided any details
    about the index key datatypes or sizes, but I'll take a guess that the
    raw data for each index is somewhere around 10GB. The theory says that
    the runs created during the first pass should on average be about twice
    work_mem, so at 128mb work_mem there should be around 40 runs to be
    merged, which would take probably three passes with six-way merging.
    Raising work_mem to a gig should result in about five runs, needing only
    one pass, which is really going to be as good as it gets. If you could
    not see any difference then I see little hope for the idea that reducing
    the number of merge passes will help.

    Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

    We really need to get some hard data about what's going on here. The
    sort code doesn't report any internal statistics at the moment, but it
    would not be hard to whack together a patch that reports useful info
    in the form of NOTICE messages or some such.

    regards, tom lane
  • Josh Berkus at Oct 3, 2005 at 8:36 pm
    Tom,
    Raising work_mem to a gig should result in about five runs, needing only
    one pass, which is really going to be as good as it gets. If you could
    not see any difference then I see little hope for the idea that reducing
    the number of merge passes will help.
    Right. It *should have*, but didn't seem to. Example of a simple sort
    test of 100 million random-number records

    1M 3294 seconds
    16M 1107 seconds
    256M 1209 seconds
    512M 1174 seconds
    512M with 'not null' for column that is indexed 1168 seconds
    Umm ... you were raising maintenance_work_mem, I trust, not work_mem? Yes.
    We really need to get some hard data about what's going on here. The
    sort code doesn't report any internal statistics at the moment, but it
    would not be hard to whack together a patch that reports useful info
    in the form of NOTICE messages or some such.
    Yeah, I'll do this as soon as the patch is finished. Always useful to
    gear up the old TPC-H.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Gregory Maxwell at Oct 1, 2005 at 2:07 am

    On 9/28/05, Ron Peacetree wrote:
    2= We use my method to sort two different tables. We now have these
    very efficient representations of a specific ordering on these tables. A
    join operation can now be done using these Btrees rather than the
    original data tables that involves less overhead than many current
    methods.
    If we want to make joins very fast we should implement them using RD
    trees. For the example cases where a join against a very large table
    will produce a much smaller output, a RD tree will provide pretty much
    the optimal behavior at a very low memory cost.

    On the subject of high speed tree code for in-core applications, you
    should check out http://judy.sourceforge.net/ . The performance
    (insert, remove, lookup, AND storage) is really quite impressive.
    Producing cache friendly code is harder than one might expect, and it
    appears the judy library has already done a lot of the hard work.
    Though it is *L*GPLed, so perhaps that might scare some here away from
    it. :) and good luck directly doing joins with a LC-TRIE. ;)
  • Ron Peacetree at Sep 28, 2005 at 11:49 pm
    In the interest of efficiency and "not reinventing the wheel", does anyone know
    where I can find C or C++ source code for a Btree variant with the following
    properties:

    A= Data elements (RIDs) are only stored in the leaves, Keys (actually
    KeyPrefixes; see "D" below) and Node pointers are only stored in the internal
    nodes of the Btree.

    B= Element redistribution is done as an alternative to node splitting in overflow
    conditions during Inserts whenever possible.

    C= Variable length Keys are supported.

    D= Node buffering with a reasonable replacement policy is supported.

    E= Since we will know beforehand exactly how many RID's will be stored, we
    will know apriori how much space will be needed for leaves, and will know the
    worst case for how much space will be required for the Btree internal nodes
    as well. This implies that we may be able to use an array, rather than linked
    list, implementation of the Btree. Less pointer chasing at the expense of more
    CPU calculations, but that's a trade-off in the correct direction.

    Such source would be a big help in getting a prototype together.

    Thanks in advance for any pointers or source,
    Ron
  • Ron Peacetree at Sep 29, 2005 at 12:25 am
    If I've done this correctly, there should not be anywhere near
    the number of context switches we currently see while sorting.

    Each unscheduled context switch represents something unexpected
    occuring or things not being where they are needed when they are
    needed. Reducing such circumstances to the absolute minimum
    was one of the design goals.

    Reducing the total amount of IO to the absolute minimum should
    help as well.

    Ron


    -----Original Message-----
    From: Kevin Grittner <Kevin.Grittner@wicourts.gov>
    Sent: Sep 27, 2005 11:21 AM
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    I can't help wondering how a couple thousand context switches per
    second would affect the attempt to load disk info into the L1 and
    L2 caches. That's pretty much the low end of what I see when the
    server is under any significant load.
  • Ron Peacetree at Sep 29, 2005 at 6:21 am

    From: "Jeffrey W. Baker" <jwbaker@acm.org>
    Sent: Sep 29, 2005 12:27 AM
    To: Ron Peacetree <rjpeace@earthlink.net>
    Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    You are engaging in a length and verbose exercise in mental
    masturbation, because you have not yet given a concrete example of a
    query where this stuff would come in handy. A common, general-purpose
    case would be the best.
    ??? I posted =3= specific classes of common, general-purpose query
    operations where OES and the OES Btrees look like they should be
    superior to current methods:
    1= when splitting sorting or other operations across multiple CPUs
    2= when doing joins of different tables by doing the join on these Btrees
    rather than the original tables.
    3= when the opportunity arises to reuse OES Btree results of previous
    sorts for different keys in the same table. Now we can combine the
    existing Btrees to obtain the new order based on the composite key
    without ever manipulating the original, much larger, table.

    In what way are these examples not "concrete"?

    We can all see that the method you describe might be a good way to sort
    a very large dataset with some known properties, which would be fine if
    you are trying to break the terasort benchmark. But that's not what
    we're doing here. We are designing and operating relational databases.
    So please explain the application.
    This is a GENERAL method. It's based on CPU cache efficient Btrees that
    use variable length prefix keys and RIDs.
    It assumes NOTHING about the data or the system in order to work.
    I gave some concrete examples for the sake of easing explanation, NOT
    as an indication of assumptions or limitations of the method. I've even
    gone out of my way to prove that no such assumptions or limitations exist.
    Where in the world are you getting such impressions?

    Your main example seems to focus on a large table where a key column has
    constrained values. This case is interesting in proportion to the
    number of possible values. If I have billions of rows, each having one
    of only two values, I can think of a trivial and very fast method of
    returning the table "sorted" by that key: make two sequential passes,
    returning the first value on the first pass and the second value on the
    second pass. This will be faster than the method you propose.
    1= No that was not my main example. It was the simplest example used to
    frame the later more complicated examples. Please don't get hung up on it.

    2= You are incorrect. Since IO is the most expensive operation we can do,
    any method that makes two passes through the data at top scanning speed
    will take at least 2x as long as any method that only takes one such pass.

    I think an important aspect you have failed to address is how much of
    the heap you must visit after the sort is complete. If you are
    returning every tuple in the heap then the optimal plan will be very
    different from the case when you needn't.
    Hmmm. Not sure which "heap" you are referring to, but the OES Btree
    index is provably the lowest (in terms of tree height) and smallest
    possible CPU cache efficient data structure that one can make and still
    have all of the traditional benefits associated with a Btree representation
    of a data set.

    Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
    RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
    efficient. Just as should be for a Btree (actually it's a B+ tree variant to
    use Knuth's nomenclature). I'm sure someone posting from acm.org
    recognizes how each of these Btree operations maps to various SQL
    features...

    I haven't been talking about query plans because they are orthogonal to
    the issue under discussion? If we use a layered model for PostgreSQL's
    architecture, this functionality is more primal than that of a query
    planner. ALL query plans that currently involve sorts will benefit from a
    more efficient way to do, or avoid, sorts.

    PS: Whatever mailer you use doesn't understand or respect threading nor
    attribution. Out of respect for the list's readers, please try a mailer
    that supports these 30-year-old fundamentals of electronic mail.
    That is an issue of infrastructure on the recieving side, not on the sending
    (my) side since even my web mailer seems appropriately RFC conformant.
    Everything seems to be going in the correct places and being properly
    organized on archival.postgres.org ...

    Ron
  • Pailloncy Jean-Gerard at Sep 29, 2005 at 11:12 am

    Your main example seems to focus on a large table where a key
    column has
    constrained values. This case is interesting in proportion to the
    number of possible values. If I have billions of rows, each
    having one
    of only two values, I can think of a trivial and very fast method of
    returning the table "sorted" by that key: make two sequential passes,
    returning the first value on the first pass and the second value
    on the
    second pass. This will be faster than the method you propose.
    1= No that was not my main example. It was the simplest example
    used to
    frame the later more complicated examples. Please don't get hung
    up on it.

    2= You are incorrect. Since IO is the most expensive operation we
    can do,
    any method that makes two passes through the data at top scanning
    speed
    will take at least 2x as long as any method that only takes one
    such pass.
    You do not get the point.
    As the time you get the sorted references to the tuples, you need to
    fetch the tuples themself, check their visbility, etc. and returns
    them to the client.

    So,
    if there is only 2 values in the column of big table that is larger
    than available RAM,
    two seq scans of the table without any sorting
    is the fastest solution.

    Cordialement,
    Jean-Gérard Pailloncy
  • Pierre-Frédéric Caillaud at Sep 29, 2005 at 4:10 pm
    Just to add a little anarchy in your nice debate...

    Who really needs all the results of a sort on your terabyte table ?

    I guess not many people do a SELECT from such a table and want all the
    results. So, this leaves :
    - Really wanting all the results, to fetch using a cursor,
    - CLUSTER type things, where you really want everything in order,
    - Aggregates (Sort->GroupAggregate), which might really need to sort the
    whole table.
    - Complex queries where the whole dataset needs to be examined, in order
    to return a few values
    - Joins (again, the whole table is probably not going to be selected)
    - And the ones I forgot.

    However,

    Most likely you only want to SELECT N rows, in some ordering :
    - the first N (ORDER BY x LIMIT N)
    - last N (ORDER BY x DESC LIMIT N)
    - WHERE x>value ORDER BY x LIMIT N
    - WHERE x<value ORDER BY x DESC LIMIT N
    - and other variants

    Or, you are doing a Merge JOIN against some other table ; in that case,
    yes, you might need the whole sorted terabyte table, but most likely there
    are WHERE clauses in the query that restrict the set, and thus, maybe we
    can get some conditions or limit values on the column to sort.

    Also the new, optimized hash join, which is more memory efficient, might
    cover this case.

    Point is, sometimes, you only need part of the results of your sort. And
    the bigger the sort, the most likely it becomes that you only want part of
    the results. So, while we're in the fun hand-waving, new algorithm trying
    mode, why not consider this right from the start ? (I know I'm totally in
    hand-waving mode right now, so slap me if needed).

    I'd say your new, fancy sort algorithm needs a few more input values :

    - Range of values that must appear in the final result of the sort :
    none, minimum, maximum, both, or even a set of values from the other
    side of the join, hashed, or sorted.
    - LIMIT information (first N, last N, none)
    - Enhanced Limit information (first/last N values of the second column to
    sort, for each value of the first column) (the infamous "top10 by
    category" query)
    - etc.

    With this, the amount of data that needs to be kept in memory is
    dramatically reduced, from the whole table (even using your compressed
    keys, that's big) to something more manageable which will be closer to the
    size of the final result set which will be returned to the client, and
    avoid a lot of effort.

    So, this would not be useful in all cases, but when it applies, it would
    be really useful.

    Regards !
  • Andreas Zeugswetter at Sep 29, 2005 at 1:28 pm

    In my original example, a sequential scan of the 1TB of 2KB
    or 4KB records, => 250M or 500M records of data, being sorted
    on a binary value key will take ~1000x more time than reading
    in the ~1GB Btree I described that used a Key+RID (plus node
    pointers) representation of the data.
    Imho you seem to ignore the final step your algorithm needs of
    collecting the
    data rows. After you sorted the keys the collect step will effectively
    access the
    tuples in random order (given a sufficiently large key range).

    This random access is bad. It effectively allows a competing algorithm
    to read the
    whole data at least 40 times sequentially, or write the set 20 times
    sequentially.
    (Those are the random/sequential ratios of modern discs)

    Andreas
  • Jim C. Nasby at Oct 8, 2005 at 10:51 pm

    On Thu, Sep 29, 2005 at 03:28:27PM +0200, Zeugswetter Andreas DAZ SD wrote:
    In my original example, a sequential scan of the 1TB of 2KB
    or 4KB records, => 250M or 500M records of data, being sorted
    on a binary value key will take ~1000x more time than reading
    in the ~1GB Btree I described that used a Key+RID (plus node
    pointers) representation of the data.
    Imho you seem to ignore the final step your algorithm needs of
    collecting the
    data rows. After you sorted the keys the collect step will effectively
    access the
    tuples in random order (given a sufficiently large key range).

    This random access is bad. It effectively allows a competing algorithm
    to read the
    whole data at least 40 times sequentially, or write the set 20 times
    sequentially.
    (Those are the random/sequential ratios of modern discs)
    True, but there is a compromise... not shuffling full tuples around when
    sorting in memory. Do your sorting with pointers, then write the full
    tuples out to 'tape' if needed.

    Of course the other issue here is that as correlation improves it
    becomes better and better to do full pointer-based sorting.
    --
    Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
    Pervasive Software http://pervasive.com work: 512-231-6117
    vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
  • Dann Corbit at Sep 29, 2005 at 6:32 pm
    If I were to be nosy and poke around in this, what patches of code would
    I be interested in?
    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
    owner@postgresql.org] On Behalf Of Josh Berkus
    Sent: Thursday, September 29, 2005 11:28 AM
    To: pgsql-hackers@postgresql.org
    Cc: Jeffrey W. Baker
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    Jeff,
    I would just run it under the profiler and see how many times
    beginmerge() is called.
    Hmm, I'm not seeing it at all in the oprofile results on a
    100million-row
    sort.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco

    ---------------------------(end of
    broadcast)---------------------------
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org
  • Ron Peacetree at Sep 30, 2005 at 2:03 am

    From: Pailloncy Jean-Gerard <jg@rilk.com>
    Sent: Sep 29, 2005 7:11 AM
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
    Jeff Baker:
    Your main example seems to focus on a large table where a key
    column has constrained values. This case is interesting in
    proportion to the number of possible values. If I have billions
    of rows, each having one of only two values, I can think of a
    trivial and very fast method of returning the table "sorted" by
    that key: make two sequential passes, returning the first value
    on the first pass and the second value on the second pass.
    This will be faster than the method you propose.
    Ron Peacetree:
    1= No that was not my main example. It was the simplest example
    used to frame the later more complicated examples. Please don't
    get hung up on it.

    2= You are incorrect. Since IO is the most expensive operation we
    can do, any method that makes two passes through the data at top
    scanning speed will take at least 2x as long as any method that only
    takes one such pass.
    You do not get the point.
    As the time you get the sorted references to the tuples, you need to
    fetch the tuples themself, check their visbility, etc. and returns
    them to the client.
    As PFC correctly points out elsewhere in this thread, =maybe= you
    have to do all that. The vast majority of the time people are not
    going to want to look at a detailed record by record output of that
    much data.

    The most common usage is to calculate or summarize some quality
    or quantity of the data and display that instead or to use the tuples
    or some quality of the tuples found as an intermediate step in a
    longer query process such as a join.

    Sometimes there's a need to see _some_ of the detailed records; a
    random sample or a region in a random part of the table or etc.
    It's rare that there is a RW need to actually list every record in a
    table of significant size.

    On the rare occasions where one does have to return or display all
    records in such large table, network IO and/or display IO speeds
    are the primary performance bottleneck. Not HD IO.

    Nonetheless, if there _is_ such a need, there's nothing stopping us
    from rearranging the records in RAM into sorted order in one pass
    through RAM (using at most space for one extra record) after
    constructing the cache conscious Btree index. Then the sorted
    records can be written to HD in RAM buffer sized chunks very
    efficiently.
    Repeating this process until we have stepped through the entire
    data set will take no more HD IO than one HD scan of the data
    and leave us with a permanent result that can be reused for
    multiple purposes. If the sorted records are written in large
    enough chunks, rereading them at any later time can be done
    at maximum HD throughput

    In a total of two HD scans (one to read the original data, one
    to write out the sorted data) we can make a permanent
    rearrangement of the data. We've essentially created a
    cluster index version of the data.

    So, if there is only 2 values in the column of big table that is larger
    than available RAM, two seq scans of the table without any sorting
    is the fastest solution.
    If you only need to do this once, yes this wins. OTOH, if you have
    to do this sort even twice, my method is better.

    regards,
    Ron
  • Ron Peacetree at Sep 30, 2005 at 2:57 am

    From: Zeugswetter Andreas DAZ SD <ZeugswetterA@spardat.at>
    Sent: Sep 29, 2005 9:28 AM
    Subject: RE: [HACKERS] [PERFORM] A Better External Sort?
    In my original example, a sequential scan of the 1TB of 2KB
    or 4KB records, => 250M or 500M records of data, being sorted
    on a binary value key will take ~1000x more time than reading
    in the ~1GB Btree I described that used a Key+RID (plus node
    pointers) representation of the data.
    Imho you seem to ignore the final step your algorithm needs of
    collecting the data rows. After you sorted the keys the collect
    step will effectively access the tuples in random order (given a
    sufficiently large key range).
    "Collecting" the data rows can be done for each RAM buffer full of
    of data in one pass through RAM after we've built the Btree. Then
    if desired those data rows can be read out to HD in sorted order
    in essentially one streaming burst. This combination of index build
    + RAM buffer rearrangement + write results to HD can be repeat
    as often as needed until we end up with an overall Btree index and
    a set of sorted sublists on HD. Overall HD IO for the process is only
    two effectively sequential passes through the data.

    Subsequent retrieval of the sorted information from HD can be
    done at full HD streaming speed and whatever we've decided to
    save to HD can be reused later if we desire.

    Hope this helps,
    Ron
  • Ron Peacetree at Sep 30, 2005 at 5:24 am

    From: Josh Berkus <josh@agliodbs.com>
    Sent: Sep 29, 2005 12:54 PM
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    The biggest single area where I see PostgreSQL external
    sort sucking is on index creation on large tables. For
    example, for free version of TPCH, it takes only 1.5 hours to
    load a 60GB Lineitem table on OSDL's hardware, but over 3
    hours to create each index on that table. This means that
    over all our load into TPCH takes 4 times as long to create
    the indexes as it did to bulk load the data.
    Hmmm.
    60GB/5400secs= 11MBps. That's ssllooww. So the first
    problem is evidently our physical layout and/or HD IO layer
    sucks.

    Creating the table and then creating the indexes on the table
    is going to require more physical IO than if we created the
    table and the indexes concurrently in chunks and then
    combined the indexes on the chunks into the overall indexes
    for the whole table, so there's a potential speed-up.

    The method I've been talking about is basically a recipe for
    creating indexes as fast as possible with as few IO operations,
    HD or RAM, as possible and nearly no random ones, so it
    could help as well.

    OTOH, HD IO rate is the fundamental performance metric.
    As long as our HD IO rate is pessimal, so will the performance
    of everything else be. Why can't we load a table at closer to
    the peak IO rate of the HDs?

    Anyone restoring a large database from pg_dump is in the
    same situation. Even worse, if you have to create a new
    index on a large table on a production database in use,
    because the I/O from the index creation swamps everything.
    Fix for this in the works ;-)

    Following an index creation, we see that 95% of the time
    required is the external sort, which averages 2mb/s.
    Assuming decent HD HW, this is HORRIBLE.

    What's kind of instrumenting and profiling has been done of
    the code involved?

    This is with seperate drives for the WAL, the pg_tmp, the table
    and the index. I've confirmed that increasing work_mem
    beyond a small minimum (around 128mb) had no benefit on
    the overall index creation speed.
    No surprise. The process is severely limited by the abyssmally
    slow HD IO.

    Ron
  • Josh Berkus at Sep 30, 2005 at 5:23 pm
    Ron,
    Hmmm.
    60GB/5400secs= 11MBps. That's ssllooww. So the first
    problem is evidently our physical layout and/or HD IO layer
    sucks.
    Actually, it's much worse than that, because the sort is only dealing
    with one column. As I said, monitoring the iostat our top speed was
    2.2mb/s.

    --Josh
  • Dann Corbit at Sep 30, 2005 at 5:44 pm

    -----Original Message-----
    From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
    owner@postgresql.org] On Behalf Of PFC
    Sent: Thursday, September 29, 2005 9:10 AM
    To: rjpeace@earthlink.net
    Cc: Pg Hackers; pgsql-performance@postgresql.org
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?


    Just to add a little anarchy in your nice debate...

    Who really needs all the results of a sort on your terabyte
    table ?

    Reports with ORDER BY/GROUP BY, and many other possibilities. 40% of
    mainframe CPU cycles are spent sorting. That is because huge volumes of
    data require lots of energy to be meaningfully categorized. Let's
    suppose that instead of a terabyte of data (or a petabyte or whatever)
    we have 10% of it. That's still a lot of data.
    I guess not many people do a SELECT from such a table and want all
    the
    results.
    What happens when they do? The cases where it is already fast are not
    very important. The cases where things go into the crapper are the ones
    that need attention.
    So, this leaves :
    - Really wanting all the results, to fetch using a cursor,
    - CLUSTER type things, where you really want everything in order,
    - Aggregates (Sort->GroupAggregate), which might really need to sort
    the
    whole table.
    - Complex queries where the whole dataset needs to be examined, in
    order
    to return a few values
    - Joins (again, the whole table is probably not going to be
    selected)
    - And the ones I forgot.

    However,

    Most likely you only want to SELECT N rows, in some ordering :
    - the first N (ORDER BY x LIMIT N)
    - last N (ORDER BY x DESC LIMIT N)
    For these, the QuickSelect algorithm is what is wanted. For example:
    #include <stdlib.h>
    typedef double Etype;

    extern Etype RandomSelect(Etype * A, size_t p, size_t r, size_t i);
    extern size_t RandRange(size_t a, size_t b);
    extern size_t RandomPartition(Etype * A, size_t p, size_t r);
    extern size_t Partition(Etype * A, size_t p, size_t r);

    /*
    **
    ** In the following code, every reference to CLR means:
    **
    ** "Introduction to Algorithms"
    ** By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
    ** ISBN 0-07-013143-0
    */


    /*
    ** CLR, page 187
    */
    Etype RandomSelect(Etype A[], size_t p, size_t r, size_t i)
    {
    size_t q,
    k;
    if (p == r)
    return A[p];
    q = RandomPartition(A, p, r);
    k = q - p + 1;

    if (i <= k)
    return RandomSelect(A, p, q, i);
    else
    return RandomSelect(A, q + 1, r, i - k);
    }

    size_t RandRange(size_t a, size_t b)
    {
    size_t c = (size_t) ((double) rand() / ((double) RAND_MAX +
    1) * (b - a));
    return c + a;
    }

    /*
    ** CLR, page 162
    */
    size_t RandomPartition(Etype A[], size_t p, size_t r)
    {
    size_t i = RandRange(p, r);
    Etype Temp;
    Temp = A[p];
    A[p] = A[i];
    A[i] = Temp;
    return Partition(A, p, r);
    }

    /*
    ** CLR, page 154
    */
    size_t Partition(Etype A[], size_t p, size_t r)
    {
    Etype x,
    temp;
    size_t i,
    j;

    x = A[p];
    i = p - 1;
    j = r + 1;

    for (;;) {
    do {
    j--;
    } while (!(A[j] <= x));
    do {
    i++;
    } while (!(A[i] >= x));
    if (i < j) {
    temp = A[i];
    A[i] = A[j];
    A[j] = temp;
    } else
    return j;
    }
    }
    - WHERE x>value ORDER BY x LIMIT N
    - WHERE x<value ORDER BY x DESC LIMIT N
    - and other variants

    Or, you are doing a Merge JOIN against some other table ; in that
    case,
    yes, you might need the whole sorted terabyte table, but most likely there
    are WHERE clauses in the query that restrict the set, and thus, maybe we
    can get some conditions or limit values on the column to sort.
    Where clause filters are to be applied AFTER the join operations,
    according to the SQL standard.
    Also the new, optimized hash join, which is more memory
    efficient,
    might
    cover this case.
    For == joins. Not every order by is applied to joins. And not every
    join is an equal join.
    Point is, sometimes, you only need part of the results of your sort.
    And
    the bigger the sort, the most likely it becomes that you only want part of
    the results.
    That is an assumption that will sometimes be true, and sometimes not.
    It is not possible to predict usage patterns for a general purpose
    database system.
    So, while we're in the fun hand-waving, new algorithm trying
    mode, why not consider this right from the start ? (I know I'm totally in
    hand-waving mode right now, so slap me if needed).

    I'd say your new, fancy sort algorithm needs a few more input values
    :

    - Range of values that must appear in the final result of the sort :
    none, minimum, maximum, both, or even a set of values from the
    other
    side of the join, hashed, or sorted.
    That will already happen (or it certainly ought to)
    - LIMIT information (first N, last N, none)
    That will already happen (or it certainly ought to -- I would be pretty
    surprised if it does not happen)
    - Enhanced Limit information (first/last N values of the second
    column to
    sort, for each value of the first column) (the infamous "top10 by
    category" query)
    - etc.
    All the filters will (at some point) be applied to the data unless they
    cannot be applied to the data by formal rule.
    With this, the amount of data that needs to be kept in memory is
    dramatically reduced, from the whole table (even using your compressed
    keys, that's big) to something more manageable which will be closer to the
    size of the final result set which will be returned to the client, and
    avoid a lot of effort.
    Sorting the minimal set is a good idea. Sometimes there is a big
    savings there. I would be pretty surprised if a large fraction of data
    that does not have to be included is actually processed during the
    sorts.
    So, this would not be useful in all cases, but when it applies, it
    would
    be really useful.
    No argument there. And if an algorithm is being reworked, it is a good
    idea to look at things like filtering to see if all filtering that is
    allowed by the language standard before the sort takes place is applied.
  • Ron Peacetree at Sep 30, 2005 at 8:21 pm
    That 11MBps was your =bulk load= speed. If just loading a table
    is this slow, then there are issues with basic physical IO, not just
    IO during sort operations.

    As I said, the obvious candidates are inefficient physical layout
    and/or flawed IO code.

    Until the basic IO issues are addressed, we could replace the
    present sorting code with infinitely fast sorting code and we'd
    still be scrod performance wise.

    So why does basic IO suck so badly?

    Ron


    -----Original Message-----
    From: Josh Berkus <josh@agliodbs.com>
    Sent: Sep 30, 2005 1:23 PM
    To: Ron Peacetree <rjpeace@earthlink.net>
    Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
    Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

    Ron,
    Hmmm.
    60GB/5400secs= 11MBps. That's ssllooww. So the first
    problem is evidently our physical layout and/or HD IO layer
    sucks.
    Actually, it's much worse than that, because the sort is only dealing
    with one column. As I said, monitoring the iostat our top speed was
    2.2mb/s.

    --Josh
  • Josh Berkus at Sep 30, 2005 at 8:37 pm
    Ron,
    That 11MBps was your =bulk load= speed. If just loading a table
    is this slow, then there are issues with basic physical IO, not just
    IO during sort operations.
    Oh, yeah. Well, that's separate from sort. See multiple posts on this
    list from the GreenPlum team, the COPY patch for 8.1, etc. We've been
    concerned about I/O for a while.

    Realistically, you can't do better than about 25MB/s on a single-threaded
    I/O on current Linux machines, because your bottleneck isn't the actual
    disk I/O. It's CPU. Databases which "go faster" than this are all, to
    my knowledge, using multi-threaded disk I/O.

    (and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's
    another thread ... )
    As I said, the obvious candidates are inefficient physical layout
    and/or flawed IO code.
    Yeah, that's what I thought too. But try sorting an 10GB table, and
    you'll see: disk I/O is practically idle, while CPU averages 90%+. We're
    CPU-bound, because sort is being really inefficient about something. I
    just don't know what yet.

    If we move that CPU-binding to a higher level of performance, then we can
    start looking at things like async I/O, O_Direct, pre-allocation etc. that
    will give us incremental improvements. But what we need now is a 5-10x
    improvement and that's somewhere in the algorithms or the code.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Michael Stone at Sep 30, 2005 at 11:10 pm

    On Fri, Sep 30, 2005 at 01:41:22PM -0700, Josh Berkus wrote:
    Realistically, you can't do better than about 25MB/s on a single-threaded
    I/O on current Linux machines,
    What on earth gives you that idea? Did you drop a zero?

    Mike Stone
  • Josh Berkus at Oct 3, 2005 at 8:30 pm
    Michael,
    Realistically, you can't do better than about 25MB/s on a
    single-threaded I/O on current Linux machines,
    What on earth gives you that idea? Did you drop a zero?
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Jeffrey W. Baker at Oct 3, 2005 at 8:42 pm

    On Mon, 2005-10-03 at 13:34 -0700, Josh Berkus wrote:
    Michael,
    Realistically, you can't do better than about 25MB/s on a
    single-threaded I/O on current Linux machines,
    What on earth gives you that idea? Did you drop a zero?
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.
    I find this claim very suspicious. I get single-threaded reads in
    excess of 1GB/sec with XFS and > 250MB/sec with ext3.

    -jwb
  • Josh Berkus at Oct 3, 2005 at 9:14 pm
    Jeff,
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.
    I find this claim very suspicious. I get single-threaded reads in
    excess of 1GB/sec with XFS and > 250MB/sec with ext3.
    Database reads? Or raw FS reads? It's not the same thing.

    Also, we're talking *write speed* here, not read speed.

    I also find *your* claim suspicious, since there's no way XFS is 300% faster
    than ext3 for the *general* case.

    --
    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Luke Lonergan at Oct 3, 2005 at 9:28 pm
    Jeff, Josh,
    On 10/3/05 2:16 PM, "Josh Berkus" wrote:

    Jeff,
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.
    I find this claim very suspicious. I get single-threaded reads in
    excess of 1GB/sec with XFS and > 250MB/sec with ext3.
    Database reads? Or raw FS reads? It's not the same thing.

    Also, we're talking *write speed* here, not read speed.
    I think you are both talking past each other here. I'll state what I
    *think* each of you are saying:

    Josh: single threaded DB writes are limited to 25MB/s

    My opinion: Not if they're done better than they are now in PostgreSQL.
    PostgreSQL COPY is still CPU limited at 12MB/s on a super fast Opteron. The
    combination of WAL and head writes while this is the case is about 50MB/s,
    which is far from the limit of the filesystems we test on that routinely
    perform at 250MB/s on ext2 writing in sequential 8k blocks.

    There is no reason that we couldn't do triple the current COPY speed by
    reducing the CPU overhead in parsing and attribute conversion. We've talked
    this to death, and implemented much of the code to fix it, but there's much
    more to do.

    Jeff: Plenty of FS bandwidth to be had on Linux, observed 250MB/s on ext3
    and 1,000MB/s on XFS.

    Wow - can you provide a link or the results from the XFS test? Is this 8k
    blocksize sequential I/O? How many spindles and what controller are you
    using? Inquiring minds want to know...

    - Luke
  • Jeffrey W. Baker at Oct 3, 2005 at 9:32 pm

    On Mon, 2005-10-03 at 14:16 -0700, Josh Berkus wrote:
    Jeff,
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.
    I find this claim very suspicious. I get single-threaded reads in
    excess of 1GB/sec with XFS and > 250MB/sec with ext3.
    Database reads? Or raw FS reads? It's not the same thing.
    Just reading files off the filesystem. These are input rates I get with
    a specialized sort implementation. 1GB/sec is not even especially
    wonderful, I can get that on two controllers with 24-disk stripe set.

    I guess database reads are different, but I remain unconvinced that they
    are *fundamentally* different. After all, a tab-delimited file (my sort
    workload) is a kind of database.
    Also, we're talking *write speed* here, not read speed.
    Ok, I did not realize. Still you should see 250-300MB/sec
    single-threaded sequential output on ext3, assuming the storage can
    provide that rate.
    I also find *your* claim suspicious, since there's no way XFS is 300% faster
    than ext3 for the *general* case.
    On a single disk you wouldn't notice, but XFS scales much better when
    you throw disks at it. I get a 50MB/sec boost from the 24th disk,
    whereas ext3 stops scaling after 16 disks. For writes both XFS and ext3
    top out around 8 disks, but in this case XFS tops out at 500MB/sec while
    ext3 can't break 350MB/sec.

    I'm hopeful that in the future the work being done at ClusterFS will
    make ext3 on-par with XFS.

    -jwb
  • Josh Berkus at Oct 3, 2005 at 9:59 pm
    Jeffrey,
    I guess database reads are different, but I remain unconvinced that they
    are *fundamentally* different. After all, a tab-delimited file (my sort
    workload) is a kind of database.
    Unfortunately, they are ... because of CPU overheads. I'm basing what's
    "reasonable" for data writes on the rates which other high-end DBs can
    make. From that, 25mb/s or even 40mb/s for sorts should be achievable
    but doing 120mb/s would require some kind of breakthrough.
    On a single disk you wouldn't notice, but XFS scales much better when
    you throw disks at it. I get a 50MB/sec boost from the 24th disk,
    whereas ext3 stops scaling after 16 disks. For writes both XFS and ext3
    top out around 8 disks, but in this case XFS tops out at 500MB/sec while
    ext3 can't break 350MB/sec.
    That would explain it. I seldom get more than 6 disks (and 2 channels) to
    test with.

    --
    --Josh

    Josh Berkus
    Aglio Database Solutions
    San Francisco
  • Hannu Krosing at Oct 3, 2005 at 9:43 pm

    On E, 2005-10-03 at 14:16 -0700, Josh Berkus wrote:
    Jeff,
    Nope, LOTS of testing, at OSDL, GreenPlum and Sun. For comparison, A
    Big-Name Proprietary Database doesn't get much more than that either.
    I find this claim very suspicious. I get single-threaded reads in
    excess of 1GB/sec with XFS and > 250MB/sec with ext3.
    Database reads? Or raw FS reads? It's not the same thing.
    Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and
    it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k disks in
    RAID10, reiserfs). A little less than 100MB sec.

    After this I ran count(*) over a 2.4GB file from another tablespace on
    another device (4x142GB 10k disks in RAID10) and it run 22.5 sec on
    first run and 12.5 on second.

    db=# show shared_buffers ;
    shared_buffers
    ----------------
    196608
    (1 row)

    db=# select version();
    version
    --------------------------------------------------------------------------------------------
    PostgreSQL 8.0.3 on x86_64-pc-linux-gnu, compiled by GCC cc (GCC) 3.3.6
    (Debian 1:3.3.6-7)
    (1 row)


    --
    Hannu Krosing <hannu@skype.net>
  • Luke Lonergan at Oct 3, 2005 at 9:52 pm
    Hannu,
    On 10/3/05 2:43 PM, "Hannu Krosing" wrote:

    Just FYI, I run a count(*) on a 15.6GB table on a lightly loaded db and
    it run in 163 sec. (Dual opteron 2.6GHz, 6GB RAM, 6 x 74GB 15k disks in
    RAID10, reiserfs). A little less than 100MB sec.
    This confirms our findings - sequential scan is CPU limited at about 120MB/s
    per single threaded executor. This is too slow for fast file systems like
    we're discussing here.

    Bizgres MPP gets 250MB/s by running multiple scanners, but we still chew up
    unnecessary amounts of CPU.
    After this I ran count(*) over a 2.4GB file from another tablespace on
    another device (4x142GB 10k disks in RAID10) and it run 22.5 sec on
    first run and 12.5 on second.
    You're getting caching effects here.

    - Luke

Related Discussions

People

Translate

site design / logo © 2021 Grokbase