Brian Hurt wrote:

While we're blue skying things, I've had an idea for a sorting

algorithm kicking around for a couple of years that might be

interesting. It's a variation on heapsort to make it significantly

more block-friendly. I have no idea if the idea would work, or how

well it'd work, but it might be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap

order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the

0-based array version here). The problem is that, assuming that the

length of a is larger than memory, then a[2i+1] is likely going to be

on a different page or block than a[i]. That means every time you

have to bubble down a new element, you end up reading O(log N) blocks-

this is *per element*.

The variation is to instead work with blocks, so you have a block of

entries b[i], and you change the definition of heap order, so that

min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during

bubble down, you need to be carefull to only change the minimum value

of one of the two child blocks b[2i+1] and b[2i+2]. Other than that,

the algorithm works as normal. The advantage of doing it this way is

that while each bubble down still takes O(log N) blocks being touched,

you get a entire block worth of results for your effort. Make your

blocks large enough (say, 1/4 the size of workmem) and you greatly

reduce N, the number of blocks you have to deal with, and get much

better I/O (when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here. This

is more of a sketch of the idea. But it's something to consider.

While we're blue skying things, I've had an idea for a sorting

algorithm kicking around for a couple of years that might be

interesting. It's a variation on heapsort to make it significantly

more block-friendly. I have no idea if the idea would work, or how

well it'd work, but it might be worthwhile kicking around.

Now, the core idea of heapsort is that the array is put into heap

order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the

0-based array version here). The problem is that, assuming that the

length of a is larger than memory, then a[2i+1] is likely going to be

on a different page or block than a[i]. That means every time you

have to bubble down a new element, you end up reading O(log N) blocks-

this is *per element*.

The variation is to instead work with blocks, so you have a block of

entries b[i], and you change the definition of heap order, so that

min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during

bubble down, you need to be carefull to only change the minimum value

of one of the two child blocks b[2i+1] and b[2i+2]. Other than that,

the algorithm works as normal. The advantage of doing it this way is

that while each bubble down still takes O(log N) blocks being touched,

you get a entire block worth of results for your effort. Make your

blocks large enough (say, 1/4 the size of workmem) and you greatly

reduce N, the number of blocks you have to deal with, and get much

better I/O (when you're reading, you're reading megabytes at a shot).

Now, there are boatloads of complexities I'm glossing over here. This

is more of a sketch of the idea. But it's something to consider.

there are three advantages to this proposal that I've since thought of:

1) The two child blocks b[2i+1] and b[2i+2]- the one with the larger

minimum element is the one we might replace. In other words, if

min(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we know

we're going to want the blocks b[4i+3] and b[4i+4]- before we're done

with blocks b[2i+1] and b[2i+2]. The point here is that this would work

wonders with the posix_fadvise/asyncio ideas kicking around. It'd be

easy for the code to keep 2 large writes and 2 large reads going pretty

constantly.

2) There is some easy parallelization available. I'm not sure how much

worth this is, but the bubble down code is fairly easy to parallelize.

If we have two bubble-downs going on in parallel, once they go down

different branches (one thread goes to block b[2i+1] while the other

goes to b[2i+2]) they no longer interact. Blocks near the root of the

heap would be contended over, and multiple threads means smaller blocks

to keep the total memory foot print the same. Personally, I think the

asyncio idea above is more likely to be worthwhile.

3) It's possible to perform the sort lazily. You have the initial O(N)

pass over the list, but then each block is only O(log N) cost. If it's

likely that only the first part of the result is needed, then much of

the work can be avoided.

Brian