If people are happy with Tom's suggestion of using '*' instead of 'all'
in pg_hba.conf I will prepare a patch for it.

(I will also replace the ugly long IP6 localhost netmask with a CIDR mask).

cheers

andrew

Search Discussions

  • Peter Eisentraut at Dec 18, 2003 at 5:29 pm

    Andrew Dunstan wrote:
    If people are happy with Tom's suggestion of using '*' instead of
    'all' in pg_hba.conf I will prepare a patch for it.
    Well, while we're breaking stuff in the name of improvement, what about
    databases named "*" or databases with spaces in their names?
  • Andrew Dunstan at Dec 18, 2003 at 5:49 pm

    Peter Eisentraut wrote:
    Andrew Dunstan wrote:

    If people are happy with Tom's suggestion of using '*' instead of
    'all' in pg_hba.conf I will prepare a patch for it.
    Well, while we're breaking stuff in the name of improvement, what about
    databases named "*" or databases with spaces in their names?
    Good point. Perhaps we need to provide for an escaping mechanism in the
    routines that parse the file, although personally I have little sympathy
    for anyone who names a database '*'. I think it comes into the category
    of "Doctor, it hurts when I do this" ... "Then stop doing that." Spaces
    are a more likely problem, especially when we get W32 native users.

    cheers

    andrew
  • Andrew Dunstan at Dec 18, 2003 at 8:18 pm

    I wrote:

    Peter Eisentraut wrote:
    Andrew Dunstan wrote:

    If people are happy with Tom's suggestion of using '*' instead of
    'all' in pg_hba.conf I will prepare a patch for it.

    Well, while we're breaking stuff in the name of improvement, what
    about databases named "*" or databases with spaces in their names?
    Good point. Perhaps we need to provide for an escaping mechanism in
    the routines that parse the file, although personally I have little
    sympathy for anyone who names a database '*'. I think it comes into
    the category of "Doctor, it hurts when I do this" ... "Then stop doing
    that." Spaces are a more likely problem, especially when we get W32
    native users.

    Looking at the code I discover that there is already provision covering
    spaces etc., because you can quote names. It's even documented ;-)

    The minimal disturbance change might be to teach the parser to
    distinguish between a quoted 'all' and an unquoted 'all', and forget the
    '*' idea. Alternatively, do the same sort of thing, but replacing 'all'
    with '*'. A patch for the first would be quite tiny - similar for '*'
    except for extra doc and sample file changes.

    cheers

    andrew
  • Tom Lane at Dec 19, 2003 at 5:12 am

    Andrew Dunstan writes:
    The minimal disturbance change might be to teach the parser to
    distinguish between a quoted 'all' and an unquoted 'all', and forget the
    '*' idea.
    Probably we ought to go with that, on backwards-compatibility grounds.

    regards, tom lane
  • Marc G. Fournier at Dec 19, 2003 at 3:30 pm

    On Fri, 19 Dec 2003, Tom Lane wrote:

    Andrew Dunstan <andrew@dunslane.net> writes:
    The minimal disturbance change might be to teach the parser to
    distinguish between a quoted 'all' and an unquoted 'all', and forget the
    '*' idea.
    Probably we ought to go with that, on backwards-compatibility grounds.
    why not do both, but deprecate the use of all in the docs? say with an
    eventual goal of removing the use of all altogether in 2 releases?

    ----
    Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
    Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
  • Andrew Dunstan at Dec 19, 2003 at 3:42 pm

    Marc G. Fournier wrote:
    On Fri, 19 Dec 2003, Tom Lane wrote:


    Andrew Dunstan <andrew@dunslane.net> writes:

    The minimal disturbance change might be to teach the parser to
    distinguish between a quoted 'all' and an unquoted 'all', and forget the
    '*' idea.
    Probably we ought to go with that, on backwards-compatibility grounds.
    why not do both, but deprecate the use of all in the docs? say with an
    eventual goal of removing the use of all altogether in 2 releases?

    The extra value is fairly low IMNSHO if we can distinguish between a
    magical and a non-magical 'all' - see the patch I just submitted.

    cheers

    andrew
  • Tom Lane at Dec 19, 2003 at 5:12 pm

    Andrew Dunstan writes:
    Marc G. Fournier wrote:
    why not do both, but deprecate the use of all in the docs? say with an
    eventual goal of removing the use of all altogether in 2 releases?
    The extra value is fairly low IMNSHO if we can distinguish between a
    magical and a non-magical 'all' - see the patch I just submitted.
    Also, your point about the special sameuser and samegroup keywords is a
    good one. We should make those all work consistently (ie, quoting makes
    it not a keyword). Going in the "*" direction would only make sense
    if that were the only special case --- but it isn't. I don't think we
    want to start choosing random symbols for sameuser, samegroup, and other
    stuff we might think of in future.

    regards, tom lane
  • Andrew Dunstan at Dec 19, 2003 at 5:48 pm

    Tom Lane wrote:
    Andrew Dunstan <andrew@dunslane.net> writes:

    Marc G. Fournier wrote:

    why not do both, but deprecate the use of all in the docs? say with an
    eventual goal of removing the use of all altogether in 2 releases?
    The extra value is fairly low IMNSHO if we can distinguish between a
    magical and a non-magical 'all' - see the patch I just submitted.
    Also, your point about the special sameuser and samegroup keywords is a
    good one. We should make those all work consistently (ie, quoting makes
    it not a keyword). Going in the "*" direction would only make sense
    if that were the only special case --- but it isn't. I don't think we
    want to start choosing random symbols for sameuser, samegroup, and other
    stuff we might think of in future.

    Right. Revised patch sent to patches list.

    cheers

    andrew
  • Michael Brusser at Jan 16, 2004 at 5:03 pm
    Is there a way to force database to load
    a frequently-accessed table into cache and keep it there?

    Thanks,
    Mike.
  • Reinoud van Leeuwen at Jan 16, 2004 at 5:21 pm

    On Fri, Jan 16, 2004 at 12:00:08PM -0500, Michael Brusser wrote:
    Is there a way to force database to load
    a frequently-accessed table into cache and keep it there?
    If it is frequently accessed, I guess it would be in the cachke
    permanently....

    --
    __________________________________________________
    "Nothing is as subjective as reality"
    Reinoud van Leeuwen reinoud.v@n.leeuwen.net
    http://www.xs4all.nl/~reinoud
    __________________________________________________
  • Neil Conway at Jan 16, 2004 at 5:39 pm

    Michael Brusser writes:
    Is there a way to force database to load a frequently-accessed table
    into cache and keep it there?
    No.

    BTW, this idea has been suggested in the past, so check the archives
    for the prior discussions on this topic. The usual response is that
    the PostgreSQL bufmgr should already placing hot pages into the cache,
    so there isn't really a need for this mechanism. (And if the bufmgr
    doesn't do this well enough, we should improve the bufmgr -- as Jan
    has done for 7.5)

    -Neil
  • Scott.marlowe at Jan 16, 2004 at 5:47 pm

    On Fri, 16 Jan 2004, Michael Brusser wrote:

    Is there a way to force database to load
    a frequently-accessed table into cache and keep it there?
    Nope. But there is a new cache buffer handler that may make it into 7.5
    that would make that happen automagically.
  • Simon Riggs at Jan 22, 2004 at 11:04 am
    This discussion seems likely to have a major effect on DBT-3 (DSS-type)
    performance from PostgreSQL...
    On Fri, 16 Jan 2004, Michael Brusser wrote:

    Is there a way to force database to load
    a frequently-accessed table into cache and keep it there?
    Scott Marlow replied...

    Nope. But there is a new cache buffer handler that may make it into 7.5
    that would make that happen automagically.
    The important question here is "what forces blocks out of cache?" rather
    than thinking about how to directly keep them there.
    Jeroen T. Vermeulen wrote:
    Sent: Friday, January 16, 2004 23:02
    Subject: [HACKERS] Reverse scans?

    Would it be doable, and would it be useful, to try to alternate the
    directions of table and index scans every time each table/index was
    fully scanned?

    I was thinking that it could help cache performance at various levels
    in cases where data at the end of a large table, say, that remained in
    memory after a scan, would otherwise be flushed out by a new scan of the
    same table. If the next scan of the same table was to go in the other
    direction, any remains of the last time around that were still in the
    filesystem cache, buffer pool, hard disk cache etc. would stand a greater
    chance of being reused.
    Jereon's idea is a good one when we consider the current behaviour,
    which is
    that large scans are placed into block buffer cache, which then forces
    other data out. I would like to question the latter behaviour, so we can
    address the cause and not just the symptom.

    Earlier versions of Oracle had a parameter called something like
    SMALL_TABLE_THRESHOLD. Any scan of a table larger than that threshold
    was NOT placed into buffer cache, but was consumed directly by the
    shadow process (the backend). Teradata also uses a similar buffer
    control technique for large table scans.

    If a table is too large to fit into buffer, it clearly wasn't going to
    be cached properly in the first place; Jereon's idea only works well for
    tables near to the size of the cache. If the table is MUCH bigger then
    it will have very little gain. Good original thinking, though I'm not
    sure its worth it.

    Oracle 9i now offers some variety for buffer cache management (as does
    DB2). You can specify at the tablespace and object level whether to use
    one of two buffer pools, KEEP or RECYCLE. The idea is to separate out
    the two types of blocks - ones that are there because they're well used
    and other blocks that are there at the moment, but unlikely to stay.

    My suggestion would be to:
    - split the buffer cache into two, just as Oracle does: KEEP & RECYCLE.
    This could default to KEEP=66% of total memory available, but could also
    be settable by init parameter.
    [changes to the memory management routines]
    - if we do a scan on a table whose size in blocks is more than some
    fraction (25%?) of KEEP bufferpool then we place the blocks into RECYCLE
    bufferpool. This can be decided immediately following optimization,
    rather than including it within the optimizer decision process since we
    aren't going to change the way the statement executes, we're just going
    to stop it from having an adverse effect on other current or future
    statements.
    [additional test to set parameter, then work out where to note it]

    Notice that I haven't suggested that the KEEP/RECYCLE option could be
    specified at table level. That optionality sounds like a lot of extra
    work, when what is needed is the automatic avoidance of cache-spoiling
    behaviour. (This would still mean that very large indexes with random
    request patterns would still spoil cache...maybe implement that later?)

    This would remove most reasons for spoiling the cache and blocks would
    then leave the cache only when they were genuinely no longer wanted.

    Any comments?? Takers?
  • Jan Wieck at Jan 22, 2004 at 3:22 pm
    Simon,

    have you read src/backend/storage/buffer/README of current CVS tip?

    The algorithm in the new replacement strategy is an attempt to figure
    that SMALL_TABLE_THRESHOLD automatically. Do you see anything that can
    be improved in that algorithm?


    Jan

    Simon Riggs wrote:
    This discussion seems likely to have a major effect on DBT-3 (DSS-type)
    performance from PostgreSQL...
    On Fri, 16 Jan 2004, Michael Brusser wrote:

    Is there a way to force database to load
    a frequently-accessed table into cache and keep it there?
    Scott Marlow replied...

    Nope. But there is a new cache buffer handler that may make it into 7.5
    that would make that happen automagically.
    The important question here is "what forces blocks out of cache?" rather
    than thinking about how to directly keep them there.
    Jeroen T. Vermeulen wrote:
    Sent: Friday, January 16, 2004 23:02
    Subject: [HACKERS] Reverse scans?

    Would it be doable, and would it be useful, to try to alternate the
    directions of table and index scans every time each table/index was
    fully scanned?

    I was thinking that it could help cache performance at various levels
    in cases where data at the end of a large table, say, that remained in
    memory after a scan, would otherwise be flushed out by a new scan of the
    same table. If the next scan of the same table was to go in the other
    direction, any remains of the last time around that were still in the
    filesystem cache, buffer pool, hard disk cache etc. would stand a greater
    chance of being reused.
    Jereon's idea is a good one when we consider the current behaviour,
    which is
    that large scans are placed into block buffer cache, which then forces
    other data out. I would like to question the latter behaviour, so we can
    address the cause and not just the symptom.

    Earlier versions of Oracle had a parameter called something like
    SMALL_TABLE_THRESHOLD. Any scan of a table larger than that threshold
    was NOT placed into buffer cache, but was consumed directly by the
    shadow process (the backend). Teradata also uses a similar buffer
    control technique for large table scans.

    If a table is too large to fit into buffer, it clearly wasn't going to
    be cached properly in the first place; Jereon's idea only works well for
    tables near to the size of the cache. If the table is MUCH bigger then
    it will have very little gain. Good original thinking, though I'm not
    sure its worth it.

    Oracle 9i now offers some variety for buffer cache management (as does
    DB2). You can specify at the tablespace and object level whether to use
    one of two buffer pools, KEEP or RECYCLE. The idea is to separate out
    the two types of blocks - ones that are there because they're well used
    and other blocks that are there at the moment, but unlikely to stay.

    My suggestion would be to:
    - split the buffer cache into two, just as Oracle does: KEEP & RECYCLE.
    This could default to KEEP=66% of total memory available, but could also
    be settable by init parameter.
    [changes to the memory management routines]
    - if we do a scan on a table whose size in blocks is more than some
    fraction (25%?) of KEEP bufferpool then we place the blocks into RECYCLE
    bufferpool. This can be decided immediately following optimization,
    rather than including it within the optimizer decision process since we
    aren't going to change the way the statement executes, we're just going
    to stop it from having an adverse effect on other current or future
    statements.
    [additional test to set parameter, then work out where to note it]

    Notice that I haven't suggested that the KEEP/RECYCLE option could be
    specified at table level. That optionality sounds like a lot of extra
    work, when what is needed is the automatic avoidance of cache-spoiling
    behaviour. (This would still mean that very large indexes with random
    request patterns would still spoil cache...maybe implement that later?)

    This would remove most reasons for spoiling the cache and blocks would
    then leave the cache only when they were genuinely no longer wanted.

    Any comments?? Takers?


    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?

    http://archives.postgresql.org

    --
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #================================================== JanWieck@Yahoo.com #
  • Simon Riggs at Jan 23, 2004 at 11:09 am

    Jan Wieck wrote:

    have you read src/backend/storage/buffer/README of current CVS tip?

    The algorithm in the new replacement strategy is an attempt to figure
    that SMALL_TABLE_THRESHOLD automatically. Do you see anything that can
    be improved in that algorithm?
    Jan,

    I've read src/backend/storage/buffer/README rev 1.6 as you suggest. The
    new algorithm looks great - many thanks for implementing that.

    I'm not able to improve on this for the general case - I especially like
    the automatic management that it gives, allowing you to avoid additional
    DBA set parameters (and the coding to add these option
    settings/keywords).

    My concern was for DBT-3 performance and general Decision Support (DSS)
    workloads, where large proportion of table scans occur (not on the DBT-3
    single-threaded test). The new strategy is much better than the older
    one and is likely to have a positive effect in this area. I don't think,
    right now, that anything further should be changed, in the interests of
    stability.

    For the record/for the future: My observation was that two commercial
    databases focused on DSS use a strategy which in terms of the new ARC
    implementation is effectively: "place blocks in T1 (RECENCY/RECYCLE
    buffer) and NEVER promote them to T2 (FREQUENCY/KEEP buffer)" when they
    do large object scans.

    In the new README, you note that:
    StrategyHintVacuum(bool vacuum_active)

    Because vacuum reads all relations of the entire database
    through the buffer manager, it can greatly disturb the
    buffer replacement strategy. This function is used by vacuum
    to inform that all subsequent buffer lookups are caused
    by vacuum scanning relations.
    ...I would say that scans of very large tables also "greatly disturb the
    buffer replacement strategy", i.e. have exactly the same effect on the
    cache as the Vacuum utility.

    You'd clearly thought of the idea before me, though with regard to
    Vacuum.

    If we know ahead of time that a large scan is going to have this effect,
    why wait for the ARC to play its course, why not take exactly the same
    action?
    Have large scans call StrategyHint also. (Maybe rename it...?)...of
    course, some extra code to establish it IS a large scan...
    ...large table lookup should wait until a shared catalog cache is
    implemented

    Anyway, this idea can wait at least until we have extensive performance
    tuning on DBT-3 with 7.5. Thanks again for adding the new algorithm.

    Best Regards, Simon

    ...
    Simon Riggs wrote:
    ...
    My suggestion would be to:
    - split the buffer cache into two, just as Oracle does: KEEP &
    RECYCLE.
    This could default to KEEP=66% of total memory available, but could
    also
    be settable by init parameter.
    [changes to the memory management routines]
    - if we do a scan on a table whose size in blocks is more than some
    fraction (25%?) of KEEP bufferpool then we place the blocks into
    RECYCLE
    bufferpool. This can be decided immediately following optimization,
    rather than including it within the optimizer decision process since
    we
    aren't going to change the way the statement executes, we're just
    going
    to stop it from having an adverse effect on other current or future
    statements.
    [additional test to set parameter, then work out where to note it]
  • Jan Wieck at Jan 23, 2004 at 1:11 pm
    Simon,

    thanks for the time to give this further thought.


    Simon Riggs wrote:
    If we know ahead of time that a large scan is going to have this effect,
    why wait for the ARC to play its course, why not take exactly the same
    action?
    Have large scans call StrategyHint also. (Maybe rename it...?)...of
    course, some extra code to establish it IS a large scan...
    ...large table lookup should wait until a shared catalog cache is
    implemented
    The problem with this is a) how to detect that something will be a large
    scan, and b) how to decide what is a large scan in the first place.

    Large sequential scans in warehousing are often part of more complex
    join operations. And just because something returns a large number of
    result rows doesn't mean that the input data was that much.

    As for the definition of "large" itself, this depends on the size of the
    buffer cache and the access pattern of the application. As you surely
    have noticed, the usual sizes of B1+T1 = T2+B2 = C in the algorithm.
    Buffers evicted from T1 are remembered in B1, and because of that even
    repeated sequential scans of the same large relation will only cycle
    through T1 blocks, never cause any turbulence in T2 or B2.

    The only thing that will affect T2 and B2 dramatically by adjusting the
    cache split point is multiple scanning of more than one significantly
    large but smaller than C table. Scanning the same large but smaller C
    table over and over will have it after the second scan in T2, where it
    belongs. But having two tables A and B that are both just smaller C and
    having an access pattern like A, A, B, B, A, A, ... will cause many B1
    hits and thereby increase the target T1 size. And it must be exactly
    that access pattern, because A, A, A, B, B, B, A, A, A, ... produces a
    complete MISS on the first, a B1 hit on the second and a B2 hit on the
    third scan, so it will up and down the split point evenly.

    Honestly, I don't even know what type of application could possibly
    produce such a screwed access pattern. And I am absolutely confident one
    can find corner cases to wring down Oracles complicated configuration
    harness more easily.
    Anyway, this idea can wait at least until we have extensive performance
    tuning on DBT-3 with 7.5. Thanks again for adding the new algorithm.
    Everyone is always welcome to try and show that something can be
    improved. And we are in the middle of the 7.5 development cycle, so feel
    free to hack around.


    Jan

    --
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #================================================== JanWieck@Yahoo.com #
  • Simon Riggs at Jan 26, 2004 at 12:55 pm
    Jan,

    Happy to continue the discussion...though without changing my suggestion
    that we defer any further more specialised improvements for now.
    Jan Wieck replied to...
    Simon Riggs wrote:
    If we know ahead of time that a large scan is going to have this
    effect,
    why wait for the ARC to play its course, why not take exactly the
    same
    action?
    Have large scans call StrategyHint also. (Maybe rename it...?)...of
    course, some extra code to establish it IS a large scan...
    ...large table lookup should wait until a shared catalog cache is
    implemented
    The problem with this is a) how to detect that something will be a large
    scan, and b) how to decide what is a large scan in the first place.
    My thoughts are that we know immediately prior to execution whether or
    not a plan calls for a full table scan (FTS) (or not). We also know the
    table and therefore its size. A large table in this context is one that
    would disrupt the cache if it made it onto T2. We can discuss an
    appropriate and usefully simple rule, perhaps sizeoftable(T) > 2*C???
    Large sequential scans in warehousing are often part of more complex
    join operations.
    Yes, I agree. PostgreSQL is particularly prone to this currently,
    because of the high number of plans that resolve to FTS. Complexity of
    plan shouldn't effect the basic situation that we are reading all the
    blocks of a table and putting them in sequentially into T1 and then
    working on them. Plan complexity may increase the time that a T1 block
    stays in memory, with subsequent increase in probability of promotion to
    T1.
    And just because something returns a large number of
    result rows doesn't mean that the input data was that much.
    I agree also that overall execution time may be unrelated to whether a
    "large" table is involved. The number of output rows shouldn't have any
    effect on input rows and thus data blocks that need to be cached.

    (Jan gives a detailed analysis...ending with)
    Honestly, I don't even know what type of application could possibly
    produce such a screwed access pattern. And I am absolutely confident one
    can find corner cases to wring down Oracles complicated configuration
    harness more easily.
    I agree with everything you say. The algorithm copes well with almost
    every sequential pattern of access and there is significant benefit from
    ignoring the very very very rare cases that might give it problems.

    My thoughts are about multiple concurrent accesses, specifically FTS on
    large tables, rather than sequential ones.
    Buffers evicted from T1 are remembered in B1, and because of that even
    repeated sequential scans of the same large relation will only cycle
    through T1 blocks, never cause any turbulence in T2 or B2.
    If we have a situation where a single backend makes repeated scans of
    the same table, these will be sequential and will have no effect on T1.

    In a DW situation, you are likely to have one or more very popular large
    tables (maybe think of this as the "Fact table", if you have a
    dimensional design). The tables are large and therefore query execution
    times will be extended (and accepted by user). In this situation it is
    very likely that: i) a single user/app submits multiple requests from
    other windows/threads
    Or simply,
    ii) multiple users access the popular table

    The common effect will be concurrent, rather than sequential, access to
    the popular table. Different SQL statements will have different plans
    and will perform scans of the same table at different rates because of
    other joins, more complex WHERE clauses etc. Like waves at a beach
    moving at different rates. Every time one scan catches up with another,
    it will cause T1 hits for almost the whole of the T1 list, promoting all
    of these blocks to the top of the T2 MRU and thus spoiling the cache -
    if it hits one it will probably hit most of them. This will not happen
    ALL the time, but I don't want it to happen EVER. Even in DW situation,
    I still want to be inserting data regularly (that's how the table got
    big!), so I have index blocks and other stuff that I want almost
    permanently in memory. Concurrent access via an index might have the
    same effect, though less dramatically.

    The closer the size of a table I to C, the greater the likelihood that
    these spurious cache hits will occur. (Of course, it might be argued
    that these are worthwhile and genuine cache hits - I argue that they are
    not wanted and this is the main basis of my discussion). Of course, if a
    table does fit in memory than that is very good. If a table was, say
    2*C, then spurious cache hits will occur often and spoil the whole of
    T2.

    The DBT-3 workload is very similar to TPC-D/TPC-H workload. The test
    consists of a power test (all queries sequentially in random order) and
    a throughput test (1 or more concurrent streams, each stream executing
    all queries in a random order). When this benchmark first came out most
    vendors chose to perform the throughput test with only 1 stream (though
    with parallel processing)...I would say one reason for this is poor
    cache management...hence recent changes in various commercial products.

    In summary, I believe there is a reasonably common effect in DW
    situations where concurrent query access to large and popular tables
    will result in undesirable cache spoiling. This effect will still occur
    even after the ARC improvements are introduced - though in every other
    case I can think of, the ARC code is a major improvement on earlier
    strategies and should be hailed as a major improvement in automatic
    performance adaptation.

    There are two solution ideas:
    i) change the code so that FTS on large tables use the "no cache"
    strategy that has already been developed to support Vaccuum.
    ii) more complex: synchronise the FTS of the large table so that all
    backends that want scans produce only one set of I/Os and they share the
    block many times (yet still don't put it in cache!). FTS don't start at
    "the beginning" every time, they start wherever a current scan has got
    to, then loop back round at end (so average of two concurrent scans is
    1.5 times as much I/O as a single FTS - a 25% saving on I/O). A more
    detailed explanation may be required - this technique is in commercial
    use within the Teradata rdbms. Implementing it would take some doing...

    Best Regards

    Simon
  • Jan Wieck at Jan 26, 2004 at 2:38 pm

    Simon Riggs wrote:
    Jan, [...]
    My thoughts are about multiple concurrent accesses, specifically FTS on
    large tables, rather than sequential ones.
    Single or multiple backends is irrelevant here because a data block only
    exists once, and therefore we have only one shared buffer cache.
    Buffers evicted from T1 are remembered in B1, and because of that even
    repeated sequential scans of the same large relation will only cycle
    through T1 blocks, never cause any turbulence in T2 or B2.
    If we have a situation where a single backend makes repeated scans of
    the same table, these will be sequential and will have no effect on T1.
    You really have to look at this a bit more global, not table related.
    The strategy of ARC is this:

    In an unknown access pattern, if a specific block is accessed less
    frequently than every C requests, then it will only go into T1, age, get
    evicted and the CDB moves to B1, will get removed from that and is
    forgotten. Every block that is accessed more frequently than C will be
    after it's last access in any of the four queues of the directory and
    immediately go into T2.

    The adjustment of the target T1 size is an attempt to catch as many
    newcomers as possible. If an application does many inserts, it will
    access new blocks very soon again, so that a small T1 is sufficient to
    hold them in memory until their next access where they move into T2. An
    application that does non-uniform random access to blocks (there are
    always bestsellers and less frequently asked items), then a larger T1
    might better satisfy that access pattern.
    In a DW situation, you are likely to have one or more very popular large
    tables (maybe think of this as the "Fact table", if you have a
    dimensional design). The tables are large and therefore query execution
    times will be extended (and accepted by user). In this situation it is
    very likely that: i) a single user/app submits multiple requests from
    other windows/threads
    Or simply,
    ii) multiple users access the popular table
    If that causes that it's blocks are more frequently requested than every
    C lookups, it belongs into T2.
    The common effect will be concurrent, rather than sequential, access to
    the popular table. Different SQL statements will have different plans
    and will perform scans of the same table at different rates because of
    other joins, more complex WHERE clauses etc. Like waves at a beach
    moving at different rates. Every time one scan catches up with another,
    it will cause T1 hits for almost the whole of the T1 list, promoting all
    of these blocks to the top of the T2 MRU and thus spoiling the cache -
    if it hits one it will probably hit most of them. This will not happen
    ALL the time, but I don't want it to happen EVER. Even in DW situation,
    I still want to be inserting data regularly (that's how the table got
    big!), so I have index blocks and other stuff that I want almost
    permanently in memory. Concurrent access via an index might have the
    same effect, though less dramatically.

    The closer the size of a table I to C, the greater the likelihood that
    these spurious cache hits will occur. (Of course, it might be argued
    that these are worthwhile and genuine cache hits - I argue that they are
    not wanted and this is the main basis of my discussion). Of course, if a
    table does fit in memory than that is very good. If a table was, say
    2*C, then spurious cache hits will occur often and spoil the whole of
    T2.
    How can any generic algorithm ever sense that when the application is
    accessing the same blocks multiple times, it should NOT cache them? Are
    you asking for a fine granulated tuning of cache priorities and
    behaviour on a per table basis?
    The DBT-3 workload is very similar to TPC-D/TPC-H workload. The test
    consists of a power test (all queries sequentially in random order) and
    a throughput test (1 or more concurrent streams, each stream executing
    all queries in a random order). When this benchmark first came out most
    vendors chose to perform the throughput test with only 1 stream (though
    with parallel processing)...I would say one reason for this is poor
    cache management...hence recent changes in various commercial products.

    In summary, I believe there is a reasonably common effect in DW
    situations where concurrent query access to large and popular tables
    will result in undesirable cache spoiling. This effect will still occur
    even after the ARC improvements are introduced - though in every other
    case I can think of, the ARC code is a major improvement on earlier
    strategies and should be hailed as a major improvement in automatic
    performance adaptation.

    There are two solution ideas:
    i) change the code so that FTS on large tables use the "no cache"
    strategy that has already been developed to support Vaccuum.
    ii) more complex: synchronise the FTS of the large table so that all
    backends that want scans produce only one set of I/Os and they share the
    block many times (yet still don't put it in cache!). FTS don't start at
    "the beginning" every time, they start wherever a current scan has got
    to, then loop back round at end (so average of two concurrent scans is
    1.5 times as much I/O as a single FTS - a 25% saving on I/O). A more
    detailed explanation may be required - this technique is in commercial
    use within the Teradata rdbms. Implementing it would take some doing...
    How will the configuration of all that look like? You are using several
    business terms a human brain can imagine to describe various access
    patterns you want to be treated specially. In the whole system catalog
    and all the way down to the buffer cache, we only have some file and
    block number, maybe the size of it too but that's not guaranteed (think
    of blind writes by a backend of another DB). So how do we express what
    you want in some algorithm that we can put into the strategy?


    Jan

    --
    #======================================================================#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #================================================== JanWieck@Yahoo.com #
  • Simon Riggs at Jan 26, 2004 at 4:22 pm
    Jan,

    I think we should suspend further discussion for now...in summary:

    ARC Buffer management is an important new performance feature for 7.5;
    the implementation is a good one and should have positive benefit for
    everybody's workload. ARC will adapt to a variety of situations and has
    been designed to allow Vacuum to avoid interfering with user
    applications.

    That's the important bit: The implementation notes are detailed; I've
    read them a few times to ensure I've got it straight. I am confident
    that the situation I described CAN exist with regard to multiple
    concurrent queries performing full table scans upon a single large
    table. Further debate on that point is continuing because of my poor
    explanation of that situation; forgive me. Thanks very much for your
    further explanations and examples.

    I will take a more practical tack on this now: providing evidence of a
    real query mix that exhibits the described properties and quantifying
    the effects and their frequency. If it IS worth it, and I accept that it
    may not be, I'll have a hack at the very specialised improvement I was
    suggesting, for very specific workload types.

    Best Regards

    Simon Riggs
  • Tom Lane at Jan 22, 2004 at 3:41 pm

    "Simon Riggs" <simon@2ndquadrant.com> writes:
    Any comments?? Takers?
    Um, did you read the discussion of the ARC buffer management algorithm
    that's already been implemented for 7.5?

    The main objection I have to doing it Oracle's way is that that creates
    more parameters that DBAs have to get right for reasonable performance.
    Self-tuning algorithms are better, when available.

    regards, tom lane
  • Simon Riggs at Jan 23, 2004 at 11:09 am

    Jan Wieck wrote:

    have you read src/backend/storage/buffer/README of current CVS tip?
    Tom Lane wrote:

    Um, did you read the discussion of the ARC buffer management algorithm
    that's already been implemented for 7.5?
    Tom, Jan,

    No, I hadn't read this. Thank you both for your time and trouble to
    point this out for me, which I was not aware of.

    My understanding, possibly faulty, was that, if work was completed, then
    it appears on the TODO list with a dash in front of it. The new cache
    management strategy isn't mentioned there, so was not aware that any
    work was completed (or even in progress). No finger pointing, just an
    observation of how the development process works...

    If the TODO-list-with-dash isn't the correct place to have looked, is
    there another list of committed changes for the next release? The latest
    README in CVS doesn't have a list of "what's new in 7.5" or similar.

    Do we need such a list? (I'd be happy to compile and maintain this if it
    agreed that it is a good idea to have such a document or process as
    separate from TODO - I'll be doing this anyway before I pass further
    comments!)

    Regards, Simon
  • Tom Lane at Jan 23, 2004 at 3:13 pm

    "Simon Riggs" <simon@2ndquadrant.com> writes:
    If the TODO-list-with-dash isn't the correct place to have looked, is
    there another list of committed changes for the next release?
    We tend to rely on the CVS commit logs as the definitive source. You
    can pull the info from the CVS server (I use cvs2cl.pl to format the
    results nicely), or read the archives of pgsql-committers.

    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons this
    hasn't gotten installed in the 7.5 branch yet. (Look at the CVS
    versions during 7.4 development to see how we did it last time.)

    As far as the ARC change goes, I believe Jan still considers it a
    work-in-progress, so it may not be appropriate to list yet anyway.
    (Jan, where are you on that exactly?)
    Do we need such a list? (I'd be happy to compile and maintain this if it
    agreed that it is a good idea to have such a document or process as
    separate from TODO - I'll be doing this anyway before I pass further
    comments!)
    If you wanted to go through the existing 7.5 commits and write up a
    new done-so-far section, it'd save someone else (like me or Bruce)
    from having to do it sometime soon ...

    regards, tom lane
  • Neil Conway at Jan 23, 2004 at 8:21 pm

    Tom Lane writes:
    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons
    this hasn't gotten installed in the 7.5 branch yet. (Look at the
    CVS versions during 7.4 development to see how we did it last time.)
    Well, keep in mind we didn't do it very effectively in 7.4 :-) The
    vast majority of changes weren't recorded there, and the ones that
    were had to be fleshed out quite a lot in the actual release notes.

    The last time that someone (Peter and myself, IIRC) suggested that we
    really incrementally maintain the release notes during the development
    cycle, Bruce said that he personally finds it more comfortable to
    summarize the CVS changelogs all at once shortly before we release the
    first beta. AFAIR that's where the discussion ended.

    -Neil
  • Tom Lane at Jan 23, 2004 at 8:40 pm

    Neil Conway writes:
    Tom Lane <tgl@sss.pgh.pa.us> writes:
    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons
    this hasn't gotten installed in the 7.5 branch yet. (Look at the
    CVS versions during 7.4 development to see how we did it last time.)
    Well, keep in mind we didn't do it very effectively in 7.4 :-) The
    vast majority of changes weren't recorded there, and the ones that
    were had to be fleshed out quite a lot in the actual release notes.
    The last time that someone (Peter and myself, IIRC) suggested that we
    really incrementally maintain the release notes during the development
    cycle, Bruce said that he personally finds it more comfortable to
    summarize the CVS changelogs all at once shortly before we release the
    first beta. AFAIR that's where the discussion ended.
    It's fine with me if Bruce prefers to build the release notes directly
    from the change logs. As I saw it, the purpose of the temporary list of
    things-done-so-far is not to be the raw material for the release notes.
    It's to let alpha testers know about major changes that they might want
    to test. As such, it's fine that it's incomplete.

    The other way we could handle this goal is to be a tad more vigorous about
    checking off items as "done" in the TODO list. However, Bruce generally
    doesn't bother to make a new entry in the TODO list if someone does
    something that wasn't in the list to begin with, and so I'm not sure
    it's the right vehicle.

    regards, tom lane
  • Simon Riggs at Jan 26, 2004 at 12:55 pm
    OK, I will attempt to draw together this information as currently
    stands. If this makes any sense, we can discuss what the
    requirement/process is for regular maintenance (daily/weekly/monthly
    etc).

    Understood to mean "changes in next release (current progress)" - items
    that have been completed/committed since last release, for the purpose
    of informing developers/testers what's new PRIOR to full release.

    Leaving unobstructed the functions of
    - TODO list - a combined list of desired work items (Bruce)
    - Release Notes - final list of features of a release (Bruce)

    This should help alpha testing, which should allow more control of what
    actually does get released (and therefore what the contents of Release
    Notes should be)

    Best Regards, Simon
    -----Original Message-----
    From: Tom Lane
    Sent: Friday, January 23, 2004 20:40
    To: Neil Conway
    Cc: simon@2ndquadrant.com; 'Jan Wieck'; 'Postgresql Hackers'
    Subject: Re: 7.5 change documentation

    Neil Conway <neilc@samurai.com> writes:
    Tom Lane <tgl@sss.pgh.pa.us> writes:
    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons
    this hasn't gotten installed in the 7.5 branch yet. (Look at the
    CVS versions during 7.4 development to see how we did it last
    time.)
    Well, keep in mind we didn't do it very effectively in 7.4 :-) The
    vast majority of changes weren't recorded there, and the ones that
    were had to be fleshed out quite a lot in the actual release notes.
    The last time that someone (Peter and myself, IIRC) suggested that
    we
    really incrementally maintain the release notes during the
    development
    cycle, Bruce said that he personally finds it more comfortable to
    summarize the CVS changelogs all at once shortly before we release
    the
    first beta. AFAIR that's where the discussion ended.
    It's fine with me if Bruce prefers to build the release notes directly
    from the change logs. As I saw it, the purpose of the temporary list of
    things-done-so-far is not to be the raw material for the release notes.
    It's to let alpha testers know about major changes that they might want
    to test. As such, it's fine that it's incomplete.

    The other way we could handle this goal is to be a tad more vigorous about
    checking off items as "done" in the TODO list. However, Bruce generally
    doesn't bother to make a new entry in the TODO list if someone does
    something that wasn't in the list to begin with, and so I'm not sure
    it's the right vehicle.

    regards, tom lane
  • Bruce Momjian at Jan 26, 2004 at 5:53 pm

    Tom Lane wrote:
    Neil Conway <neilc@samurai.com> writes:
    It's fine with me if Bruce prefers to build the release notes directly
    from the change logs. As I saw it, the purpose of the temporary list of
    things-done-so-far is not to be the raw material for the release notes.
    It's to let alpha testers know about major changes that they might want
    to test. As such, it's fine that it's incomplete.

    The other way we could handle this goal is to be a tad more vigorous about
    checking off items as "done" in the TODO list. However, Bruce generally
    doesn't bother to make a new entry in the TODO list if someone does
    something that wasn't in the list to begin with, and so I'm not sure
    it's the right vehicle.
    Right. I see TODO as a way for us to remember our limitations, and to
    document them for our users. Once an item is completed, it didn't seem
    necessary to put it on the TODO list.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Bruce Momjian at Jan 26, 2004 at 5:28 pm

    Tom Lane wrote:
    "Simon Riggs" <simon@2ndquadrant.com> writes:
    If the TODO-list-with-dash isn't the correct place to have looked, is
    there another list of committed changes for the next release?
    We tend to rely on the CVS commit logs as the definitive source. You
    can pull the info from the CVS server (I use cvs2cl.pl to format the
    results nicely), or read the archives of pgsql-committers.

    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons this
    hasn't gotten installed in the 7.5 branch yet. (Look at the CVS
    versions during 7.4 development to see how we did it last time.)

    As far as the ARC change goes, I believe Jan still considers it a
    work-in-progress, so it may not be appropriate to list yet anyway.
    (Jan, where are you on that exactly?)
    Do we need such a list? (I'd be happy to compile and maintain this if it
    agreed that it is a good idea to have such a document or process as
    separate from TODO - I'll be doing this anyway before I pass further
    comments!)
    If you wanted to go through the existing 7.5 commits and write up a
    new done-so-far section, it'd save someone else (like me or Bruce)
    from having to do it sometime soon ...
    Doesn't Robert Treat's News Bits list all the major changes weekly?
    That would b e a good source.

    --
    Bruce Momjian | http://candle.pha.pa.us
    pgman@candle.pha.pa.us | (610) 359-1001
    + If your life is a hard drive, | 13 Roberts Road
    + Christ can be your backup. | Newtown Square, Pennsylvania 19073
  • Simon Riggs at Jan 28, 2004 at 12:31 am

    Bruce Momjian wrote
    Tom Lane wrote:
    "Simon Riggs" <simon@2ndquadrant.com> writes:
    If the TODO-list-with-dash isn't the correct place to have looked,
    is
    there another list of committed changes for the next release?
    We tend to rely on the CVS commit logs as the definitive source.
    You
    can pull the info from the CVS server (I use cvs2cl.pl to format the
    results nicely), or read the archives of pgsql-committers.

    In theory there should be a section at the head of release.sgml
    mentioning the major changes done-so-far, but for various reasons
    this
    hasn't gotten installed in the 7.5 branch yet. (Look at the CVS
    versions during 7.4 development to see how we did it last time.)

    As far as the ARC change goes, I believe Jan still considers it a
    work-in-progress, so it may not be appropriate to list yet anyway.
    (Jan, where are you on that exactly?)
    Do we need such a list? (I'd be happy to compile and maintain this
    if
    it
    agreed that it is a good idea to have such a document or process
    as
    separate from TODO - I'll be doing this anyway before I pass
    further
    comments!)
    If you wanted to go through the existing 7.5 commits and write up a
    new done-so-far section, it'd save someone else (like me or Bruce)
    from having to do it sometime soon ...
    Doesn't Robert Treat's News Bits list all the major changes weekly?
    That would b e a good source.
    Bruce - The excellent work that both you and Robert do is a slightly
    different view to what I had in mind - I agree they are all aspects of
    the same information. I'm posting a first output of this now, so we can
    discuss whether such a thing is useful, and or whether it can ever be
    all of useful/accurate/timely.

    I'll happily add this to the HEAD of release.sgml, though lets agree the
    content/direction first, before I spend time on a more formal
    publication mechanism.

    Best Regards, Simon Riggs

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedDec 18, '03 at 5:15p
activeJan 28, '04 at 12:31a
posts29
users11
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase