Hello hackers,

I've been toying around with freelist in btrees, getting lots of
deadlocks and bootstrap problems. I think I've learned something about
the code, though. Just for the record: I already read Lehman and Yao's
paper and I think I grok it.

There's few ideas I'd like to present and get some feedback about the
implementation, hoping that something that I miss may be catched by
someone with more experience in these things.


First is about freelist structure. I had originally proposed that the
freelist should keep an array of BlockNumbers in the metapage. Tom
argued that the number that fit in there was too small, and proposed
that the first 2000 or so freepages be recorded in the metapage, and
the rest should have a link to the next freepage each.

I propose instead an hybrid approach: the metapage has an array of
BlockNumbers, _and_ a pointer to a "next freelist page". That page has
another array of BlockNumbers and another link to next freelist page.
This allows for easier "compaction" of the freelist, an operation which
should be done on a regular basis (with each VACUUM FULL, for example).
The list of freelist-pages should actually be double linked; that way,
the compaction process can take the BlockNumbers from the last page and
put it on the first to fill it up, etc. (Remember that each newpage
operation has to sequentially scan the freelist, and put a zero when it
takes one).


The second idea is about how to do the free page detection. Each time a
tuple is deleted the page is checked to see if it's empty. If it is,
it's added to a "candidate-empty" list. At the end of the
btbulkdelete() operation, we have a list of pages that are candidates
for adding into the freelist. For each one, the page is
exclusive-locked, and rechecked if it's empty. If it is, the parent is
also exclusive-locked (beware of deadlock here!) and also its left and
right siblings. In the parent, the item pointing to this page is
deleted; in the siblings, the side pointers are updated (high key
on the left sibling also?). Then this page is added to the freelist.

For the btree_xlog_delete() operation the recovery should be similar,
but the state should be checked with every operation, not with a
candidate-empty list.

On the "add-to-free-list" operation, the page is checked to see if it's
the last page of the relation. If it is, the page just before is
checked for emptyness (using the BTP_FREE flag) iteratively until a
nonfree page is found. All those pages are deleted from the freelist.
Then the relation is shrinked in that many pages.


Third and last idea: how the _bt_getbuf() operation is modified. I
think the best way is to create a new _bt_newbuf() operation that grabs
one from the freelist, and use that whenever _bt_getbuf() is called with
P_NEW as BlockNumber in the current code (current _bt_getbuf() will have
and Assert(blkno != P_NEW). To prevent deadlocks, the newroot operation
does not get a page from the freelist; it always extends the relation.


Comments? I think I've put too many things in one email. Sorry for
this.

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"The eagle never lost so much time as
when he submitted to learn from the crow." (William Blake, citado por Nobody)

Search Discussions

  • Mark Kirkwood at Oct 22, 2002 at 7:18 am
    Dear hackers,


    I have recently been playing with DB2 8.1 Beta. It has introduced a
    feature to enable index clustering on more than one key. This reminded
    me of a previous thread on HACKERS about index access anding/bitmaps in
    Firebird. So anyway, here is a little snip from the 8.1 manual as a FYI.

    -- snip

    As the name implies, MDC tables cluster data on more than one dimension.
    Each dimension is determined by a column or set of columns that you
    specify in the ORGANIZE BY DIMENSIONS clause of the CREATE TABLE
    statement. When you create an MDC table, the following two kinds of
    indexes are created automatically:

    * A dimension-block index, which contains pointers to each occupied
    block for a single dimension.
    * A composite block index, which contains all dimension key columns.
    The composite block index is used to maintain clustering during
    insert and update activity.

    The optimizer considers dimension-block index scan plans when it
    determines the most efficient access plan for a particular query. When
    queries have predicates on dimension values, the optimizer can use the
    dimension block index to identify, and fetch from, only extents that
    contain these values. In addition, because extents are physically
    contiguous pages on disk, this results in more efficient performance and
    minimizes I/O.

    -- snipped


    regards

    Mark
  • Manfred Koizar at Oct 23, 2002 at 2:18 pm
    Alvaro,

    some time ago I started to collect ideas for btree reorganization but
    never got beyond brainstorming. Maybe it helps if I add them to your
    ideas ...

    On Tue, 22 Oct 2002 00:12:30 -0300, Alvaro Herrera
    wrote:
    I propose instead an hybrid approach: the metapage has an array of
    BlockNumbers, _and_ a pointer to a "next freelist page". That page has
    another array of BlockNumbers and another link to next freelist page.
    This allows for easier "compaction" of the freelist, an operation which
    should be done on a regular basis (with each VACUUM FULL, for example).
    The list of freelist-pages should actually be double linked; that way,
    the compaction process can take the BlockNumbers from the last page and
    put it on the first to fill it up, etc.
    (Remember that each newpage
    operation has to sequentially scan the freelist, and put a zero when it
    takes one).
    What do you mean by "sequentially scan the freelist"? Scan the array
    of page numbers on the meta page or scan the whole list of freelist
    pages? I'd take a page number from the meta page. When the list of
    free pages on the meta page is empty, read the first freelist page,
    move its freelist to the metapage, let the meta page point to its
    successor, and return the (formerly) first freelist page as the new
    page.

    Putting a page on the free list: If there is room for one more page
    number on the meta page, put the page number there and we are
    finished. Otherwise move the freelist from the meta page to the new
    page and insert this page at the start of the freelist chain.

    To avoid freelist ping-pong you might want to move only, say, 90% or
    95% of the freelist from the metapage to the new free page ...

    I don't know if we need to doubly link the list. When do we have to
    scan it backwards?

    Each time a
    tuple is deleted the page is checked to see if it's empty. If it is,
    it's added to a "candidate-empty" list.
    At this point the page is marked as "dead" (by setting a new flag in
    BTPageOpaqueData.btpo_flags) and the lock is released (not sure about
    releasing or keeping the pin). A dead page is not yet available for
    reuse. It is always empty, so it is immediately skipped by index
    scans. The insertion code is modified to never put an index tuple
    onto a dead page (you can put it onto the right sibling without
    breaking consistency).
    At the end of the
    btbulkdelete() operation, we have a list of pages that are candidates
    for adding into the freelist.
    There are several possible approaches:
    (a) Maintain a list of dead pages in memory
    (b) Rescan the whole index for dead pages
    (c) Handle each dead page immediately

    Each of these has its drawbacks: (a) leaves behind dead pages when the
    reorganization crashes, (b) might have to read lots of non-dead pages,
    (c) is inefficient if there are adjacent dead pages.
    For each one, the page is
    exclusive-locked, and rechecked if it's empty.
    The "dead"-flag guarantees emptiness, just assert it's still dead.
    (Can VACUUMs run concurrently? If yes, this algorithm has to be
    adjusted.) Also no need to lock the page now.
    If it is, the parent is
    also exclusive-locked (beware of deadlock here!) and also its left and
    right siblings. In the parent, the item pointing to this page is
    deleted; in the siblings, the side pointers are updated (high key
    on the left sibling also?).
    Do one page at a time: Exclusively lock the parent page, remove the
    index tuple pointing to the dead page, unlock parent page. Lock one
    sibling, update pointer, unlock. Lock the other sibling, update
    pointer, unlock.

    Attention! Maybe there's a problem lurking regarding recent splits of
    the left sibling! Have to think more about this ...

    Now we have reached a state where the dead page will not be visited by
    any new index scan. There might still be danger from scans that were
    paused just before they were about to touch the dead page. Have to
    look at the code whether pins overlap when a scan is sent down from a
    parent page or left/right from a sibling ...
    Then this page is added to the freelist.
    On the "add-to-free-list" operation, the page is checked to see if it's
    the last page of the relation. If it is, the page just before is
    checked for emptyness (using the BTP_FREE flag) iteratively until a
    nonfree page is found.
    Special handling for freelist pages required here.
    All those pages are deleted from the freelist.
    You don't have to delete them explicitly. Just store the maximum free
    page number on the meta page and let the free page search ignore (and
    set to 0) all page numbers greater than this number. Making this work
    with newroot always extending the relation needs some thought ...
    Then the relation is shrinked in that many pages.
    [...]
    To prevent deadlocks, the newroot operation
    does not get a page from the freelist; it always extends the relation.
    I think I've put too many things in one email. Sorry for this.
    Me too :-)

    Servus
    Manfred

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedOct 22, '02 at 3:12a
activeOct 23, '02 at 2:18p
posts3
users3
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase