FAQ
Anyone have ideas why an index with blevel 3 is averaging 13 LIOs per
lookup/row? I'm working on the query below, which ran for about 10
hours - the majority of which was spent doign 13 logical IOs per row in
this index. (BT_TWO_VARCHAR2_INDEX below... 121M buffers / 8792K starts
= 13 IOs per start) Performance on this query has been degrading
rapidly over the past month or two.

-Jeremy

PS... 8k block size; segment of index in question is 100G. Query is
pulling 7 million rows from a join of two billion row tables...



SQL> select * from
table(dbms_xplan.display_cursor('8suhywrkmpj5c',null,'ALLSTATS'));

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------
SQL_ID 8suhywrkmpj5c, child number 0
-------------------------------------
SELECT LOTS_OF_FIELDS...

Plan hash value: 4164942971

-----------------------------------------------------------------------------------------------------------------------------------
Id | Operation | Name |
Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
-----------------------------------------------------------------------------------------------------------------------------------
0 | SELECT STATEMENT |
1 | | 8792K|09:54:06.81 | 142M| 8552K|
1 | NESTED LOOPS |
1 | 65 | 8792K|09:54:06.81 | 142M| 8552K|
2 | PARTITION RANGE SINGLE |
1 | 65 | 8792K|00:26:50.23 | 11M| 367K|
3 | TABLE ACCESS BY INDEX ROWID| BIG_TABLE_ONE
1 | 65 | 8792K|00:26:44.48 | 11M| 367K|
* 4 | INDEX RANGE SCAN | BT_ONE_DATE_INDEX
1 | 63 | 8792K|00:16:39.09 | 2637K| 288K|
5 | TABLE ACCESS BY INDEX ROWID | BIG_TABLE_TWO
8792K| 1 | 8792K|09:26:57.84 | 130M| 8185K|
* 6 | INDEX UNIQUE SCAN | BT_TWO_VARCHAR2_INDEX
8792K| 1 | 8792K|08:55:58.31 | 121M| 7908K|
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("BT1"."LAST_MODIFIED_DATETIME">=TO_DATE(' 2011-11-13
07:15:00', 'syyyy-mm-dd hh24:mi:ss') AND
"BT1"."LAST_MODIFIED_DATETIME"<TO_DATE(' 2011-11-14
07:15:00', 'syyyy-mm-dd hh24:mi:ss'))
6 - access("BT2"."JOIN_VARHAR2"="BT1"."JOIN_VARCHAR2")

====================================================================

INDEX_NAME : BT_TWO_VARCHAR2_INDEX
INDEX_TYPE : NORMAL
TABLE_NAME : BIG_TABLE_TWO
TABLE_TYPE : TABLE
UNIQUENESS : UNIQUE
COMPRESSION : DISABLED
PREFIX_LENGTH :
TABLESPACE_NAME : INDEX01
INI_TRANS : 2
MAX_TRANS : 255
INITIAL_EXTENT : 131072
NEXT_EXTENT : 1048576
MIN_EXTENTS : 1
MAX_EXTENTS : 2147483645
PCT_INCREASE :
PCT_THRESHOLD :
INCLUDE_COLUMN :
FREELISTS :
FREELIST_GROUPS :
PCT_FREE : 10
LOGGING : YES
BLEVEL : 3
LEAF_BLOCKS : 9268569
DISTINCT_KEYS : 1107885846
AVG_LEAF_BLOCKS_PER_KEY : 1
AVG_DATA_BLOCKS_PER_KEY : 1
CLUSTERING_FACTOR : 1107885846
STATUS : VALID
NUM_ROWS : 1107885846
SAMPLE_SIZE : 138537
LAST_ANALYZED : 07-aug-2011 14:00:56
DEGREE : 1
INSTANCES : 1
PARTITIONED : NO
TEMPORARY : N
GENERATED : N
SECONDARY : N
BUFFER_POOL : DEFAULT
FLASH_CACHE : DEFAULT
CELL_FLASH_CACHE : DEFAULT
USER_STATS : NO
DURATION :
PCT_DIRECT_ACCESS :
ITYP_OWNER :
ITYP_NAME :
PARAMETERS :
GLOBAL_STATS : YES
DOMIDX_STATUS :
DOMIDX_OPSTATUS :
FUNCIDX_STATUS :

====================================================================

TABLE_NAME : BIG_TABLE_TWO
COLUMN_NAME : JOIN_VARCHAR2
DATA_TYPE : VARCHAR2
DATA_TYPE_MOD :
DATA_TYPE_OWNER :
DATA_LENGTH : 36
DATA_PRECISION :
DATA_SCALE :
NULLABLE : Y
COLUMN_ID : 12
DEFAULT_LENGTH :
DATA_DEFAULT :
NUM_DISTINCT : 1093215934
LOW_VALUE :
30303030303030382D316634372D313165302D623932312D3030393066623236
HIGH_VALUE :
66666666666666632D366236352D313165302D393933312D3030393066623263
DENSITY : .000000000914844154395423
NUM_NULLS : 0
NUM_BUCKETS : 254
LAST_ANALYZED : 06-aug-2011 12:33:29
SAMPLE_SIZE : 13496
CHARACTER_SET_NAME : CHAR_CS
CHAR_COL_DECL_LENGTH : 36
GLOBAL_STATS : YES
USER_STATS : NO
AVG_COL_LEN : 37
CHAR_LENGTH : 36
CHAR_USED : B
V80_FMT_IMAGE : NO
DATA_UPGRADED : YES
HISTOGRAM : HEIGHT BALANCED

Search Discussions

  • Jonathan Lewis at Nov 17, 2011 at 2:21 pm
    Jeremy,

    How much other activity was going on in the 10 hours ?

    Before checking anything else I would look at the session stats (or instance
    activity stats if you don't have session stats) for undo application:
    data blocks consistent reads - undo records applied
    CR blocks created
    and the two relating to transaction table consistent reads.

    Each "undo record applied" will increment "consistent gets - examination".

    There are a couple of related observations here:
    http://jonathanlewis.wordpress.com/2011/05/08/consistent-gets-3/

    If you've got the AWR or statspack reports or the 10 hours you could check these
    figures hour by hour to see if they increase over time (a classic problem with
    long running queries is that the longer they take to run the more work they do
    generating constistent read copies, which makes them take longer to run, which
    ...)


    Regards

    Jonathan Lewis
    http://jonathanlewis.wordpress.com


    ----- Original Message -----
    From: "Jeremy Schneider" <[email protected]>
    To: <[email protected]>
    Sent: Thursday, November 17, 2011 11:34 AM
    Subject: index with very high LIO


    Anyone have ideas why an index with blevel 3 is averaging 13 LIOs per
    lookup/row? I'm working on the query below, which ran for about 10
    hours - the majority of which was spent doign 13 logical IOs per row in
    this index. (BT_TWO_VARCHAR2_INDEX below... 121M buffers / 8792K starts
    = 13 IOs per start) Performance on this query has been degrading
    rapidly over the past month or two.

    --
    http://www.freelists.org/webpage/oracle-l
  • Jeremy Schneider at Nov 18, 2011 at 5:52 pm
    This is a very busy system, lots happened during that 10 hours. I don't
    have session-specific stats, so the best I could do was look at some AWR
    stats over a 30-minute window - but this particular instance wasn't very
    busy, so luckily these statistics mainly reflect this query. I looked
    at the very last 30 minutes since I figured it would be the
    worst-performing time.

    This SQL statement accounted for 90% of buffer gets on the instance
    during the snapshot. Here are the statistics Jonathan mentioned:

    AWR snap coveres 30:04 == 1804s
    Total Logical Reads: 10,823.1/s
    SQL 8suhywrkmpj5c: 17,617,831 gets ... 9,765 gets/sec

    TOTAL PER-SECOND STATISTIC
    148,983 82.61 CR blocks created
    15,406,266 8,542.71 consistent gets - examination
    168,222 93.28 data blocks consistent reads - undo records applied
    21,173 11.74 db block gets
    160 0.09 transaction tables consistent read rollbacks
    7,929,413 4,396.82 transaction tables consistent reads - undo
    records applied
    1,379,540 764.95 undo change vector size

    I'm not an expert on this... does this say that, per second:
    - the query is only actually processing 11 data blocks
    - it's applying 8.5 thousand undo records to get them
    - this requires 9.7 thousand actual block accesses

    What other stats would be interesting? Am I reading these statistics
    correctly? Any further observations that can be made?

    -Jeremy

    --
    http://www.ardentperf.com
    +1 312-725-9249

    Jeremy Schneider
    Chicago


    On 11/17/2011 08:20 AM, Jonathan Lewis wrote:
    Jeremy,

    How much other activity was going on in the 10 hours ?

    Before checking anything else I would look at the session stats (or instance
    activity stats if you don't have session stats) for undo application:
    data blocks consistent reads - undo records applied
    CR blocks created
    and the two relating to transaction table consistent reads.

    Each "undo record applied" will increment "consistent gets - examination".

    There are a couple of related observations here:
    http://jonathanlewis.wordpress.com/2011/05/08/consistent-gets-3/

    If you've got the AWR or statspack reports or the 10 hours you could check these
    figures hour by hour to see if they increase over time (a classic problem with
    long running queries is that the longer they take to run the more work they do
    generating constistent read copies, which makes them take longer to run, which
    ...)


    Regards

    Jonathan Lewis
    http://jonathanlewis.wordpress.com


    ----- Original Message -----
    From: "Jeremy Schneider" <[email protected]>
    To: <[email protected]>
    Sent: Thursday, November 17, 2011 11:34 AM
    Subject: index with very high LIO


    Anyone have ideas why an index with blevel 3 is averaging 13 LIOs per
    lookup/row? I'm working on the query below, which ran for about 10
    hours - the majority of which was spent doign 13 logical IOs per row in
    this index. (BT_TWO_VARCHAR2_INDEX below... 121M buffers / 8792K starts
    = 13 IOs per start) Performance on this query has been degrading
    rapidly over the past month or two.

    --
    http://www.freelists.org/webpage/oracle-l

    --
    http://www.freelists.org/webpage/oracle-l
  • Jonathan Lewis at Nov 18, 2011 at 11:32 pm
    The most dramatic feature is the 7.9M transaction tables undo records applied to
    get 160 transaction table searches read-consistent.
    Assume a transaction updated one of your target objects some time close to the
    start of your query and issued a commit after the block had been flushed from
    memory so that the block was not subject to commit cleanout.
    Assume your query reaches the updated block a few hours later - when a lot of
    other activity has been going on.

    Your query can see that the block changed at some time in the past, but has to
    check the undo segment header transaction table slot to see whether it committed
    before or after the query started. Since many transactions have committed since
    that moment the transaction table slot will have been overwritten many times and
    previous versions of the slot recorded in the undo segment. This means the
    session has to walk through a chain of undo blocks to take the transaction table
    back in time.

    According to these figures, you have done this 160 times in the 30 minutes, and
    had to read 7.9 M undo blocks (a lot of history) while doing so. (It's quite
    possible that a lot of those reads were physical reads of the undo segment, of
    course). This is probably a large component of your overhead - but the figures
    for this 30 minutes don't seem to be representative of the figures for the whole
    query.

    You have another 7M examinations to account for - but only 168,000 undo records
    applied for read consistency; this suggests that most of the 7M are from the
    unique access to the table, which (assuming 4 buffer gets per row) would be
    about 1.75M of the 8.7M rows accessed. It would be good to check the figures
    across the whole interval (Anything that looks like "%consistent%" as well as
    the stuff you've got so far, and "table fetch by rowid".



    Regards

    Jonathan Lewis
    http://jonathanlewis.wordpress.com


    ----- Original Message -----
    From: "Jeremy Schneider" <[email protected]>
    To: <[email protected]>
    Sent: Friday, November 18, 2011 5:51 PM
    Subject: Re: index with very high LIO

    TOTAL PER-SECOND STATISTIC
    148,983 82.61 CR blocks created
    15,406,266 8,542.71 consistent gets - examination
    168,222 93.28 data blocks consistent reads - undo records applied
    21,173 11.74 db block gets
    160 0.09 transaction tables consistent read rollbacks
    7,929,413 4,396.82 transaction tables consistent reads - undo
    records applied
    1,379,540 764.95 undo change vector size

    I'm not an expert on this... does this say that, per second:
    - the query is only actually processing 11 data blocks
    - it's applying 8.5 thousand undo records to get them
    - this requires 9.7 thousand actual block accesses

    What other stats would be interesting? Am I reading these statistics
    correctly? Any further observations that can be made?


    --
    http://www.freelists.org/webpage/oracle-l
  • Jeremy Schneider at Nov 28, 2011 at 11:58 am
    I got a little swamped with wrapping up some work... but I didn't forget
    about this. :) I finally looked closer at the AWR data. Since the
    report ran for 10 hours, I have 20 snapshots. When I graphed it, I
    found an interesting characteristic: the number of consistent gets per
    row rises over 40 at one point, then decreases back down to 10 by the
    end of the run. However, the ratio of transaction table rollback vs
    records applied just steadily increases... as expected. So I'm no
    expert, but this looks less like undo is causing all the extra logical
    I/O on that one specific index. But I'm at a loss for what else it
    could be...

    For each 30-minute period, rows vs logical I/O from DBA_HIST_SQLSTAT:
    www.ardentperf.com/wp-content/uploads/2011/11/more-research_html_7e8de68c.jpg

    For each 30-minute period, trans table rollback vs undo records applied:
    http://www.ardentperf.com/wp-content/uploads/2011/11/more-research_html_7559fa03.jpg

    FYI, for each 30-minute period, consistent gets for this SQL vs the
    entire instance:
    http://www.ardentperf.com/wp-content/uploads/2011/11/more-research_html_m5273695a.jpg

    -Jeremy

    --
    http://www.ardentperf.com
    +1 312-725-9249

    Jeremy Schneider
    Chicago

    On 11/18/2011 05:32 PM, Jonathan Lewis wrote:
    The most dramatic feature is the 7.9M transaction tables undo records applied to
    get 160 transaction table searches read-consistent.
    Assume a transaction updated one of your target objects some time close to the
    start of your query and issued a commit after the block had been flushed from
    memory so that the block was not subject to commit cleanout.
    Assume your query reaches the updated block a few hours later - when a lot of
    other activity has been going on.

    Your query can see that the block changed at some time in the past, but has to
    check the undo segment header transaction table slot to see whether it committed
    before or after the query started. Since many transactions have committed since
    that moment the transaction table slot will have been overwritten many times and
    previous versions of the slot recorded in the undo segment. This means the
    session has to walk through a chain of undo blocks to take the transaction table
    back in time.

    According to these figures, you have done this 160 times in the 30 minutes, and
    had to read 7.9 M undo blocks (a lot of history) while doing so. (It's quite
    possible that a lot of those reads were physical reads of the undo segment, of
    course). This is probably a large component of your overhead - but the figures
    for this 30 minutes don't seem to be representative of the figures for the whole
    query.

    You have another 7M examinations to account for - but only 168,000 undo records
    applied for read consistency; this suggests that most of the 7M are from the
    unique access to the table, which (assuming 4 buffer gets per row) would be
    about 1.75M of the 8.7M rows accessed. It would be good to check the figures
    across the whole interval (Anything that looks like "%consistent%" as well as
    the stuff you've got so far, and "table fetch by rowid".



    Regards

    Jonathan Lewis
    http://jonathanlewis.wordpress.com


    ----- Original Message -----
    From: "Jeremy Schneider" <[email protected]>
    To: <[email protected]>
    Sent: Friday, November 18, 2011 5:51 PM
    Subject: Re: index with very high LIO

    TOTAL PER-SECOND STATISTIC
    148,983 82.61 CR blocks created
    15,406,266 8,542.71 consistent gets - examination
    168,222 93.28 data blocks consistent reads - undo records applied
    21,173 11.74 db block gets
    160 0.09 transaction tables consistent read rollbacks
    7,929,413 4,396.82 transaction tables consistent reads - undo
    records applied
    1,379,540 764.95 undo change vector size

    I'm not an expert on this... does this say that, per second:
    - the query is only actually processing 11 data blocks
    - it's applying 8.5 thousand undo records to get them
    - this requires 9.7 thousand actual block accesses

    What other stats would be interesting? Am I reading these statistics
    correctly? Any further observations that can be made?


    --
    http://www.freelists.org/webpage/oracle-l

    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Nov 17, 2011 at 7:26 pm
    Am I reading this right? 7 million rows joined using NESTED LOOPS
    join? Wouldn't a parallel hash join be better/faster?

    On Thu, Nov 17, 2011 at 3:34 AM, Jeremy Schneider
    wrote:
    PS... 8k block size; segment of index in question is 100G.  Query is
    pulling 7 million rows from a join of two billion row tables...
    --
    Regards,
    Greg Rahn
    http://structureddata.org
    --
    http://www.freelists.org/webpage/oracle-l
  • Jeremy Schneider at Nov 17, 2011 at 8:08 pm
    Yeah, you're not the first one to point that out... I'm doing some
    tests now. FWIW it's 7 million out of over a billion rows that are all
    stuffed into a 400GB heap table with the worst possible clustering
    factor on the foreign key we're using to join. Also I haven't been able
    to get more than 20MB/s total throughput with parallel full scans,
    although single-block reads go at a somewhat decent rate of 5ms. But
    all the same, I'm checking out the hash join... likely it will turn out
    to be more efficient.

    I should have checked this out more in-depth awhile ago... I got a
    little carried away with maths and mistakenly thought that it was
    totally unrealistic. This is the problem with ruling things out based
    on "calculations". There's little to verify your correctness. It's
    much easier to find your mistakes when you rule things out with
    *testing* instead... I should know this by now, for some reason I'm
    still making the occasional poor judgements in the heat of the moment. :)

    -Jeremy

    On 11/17/2011 01:24 PM, Greg Rahn wrote:
    Am I reading this right? 7 million rows joined using NESTED LOOPS
    join? Wouldn't a parallel hash join be better/faster?

    On Thu, Nov 17, 2011 at 3:34 AM, Jeremy Schneider
    wrote:
    PS... 8k block size; segment of index in question is 100G. Query is
    pulling 7 million rows from a join of two billion row tables...
    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Nov 18, 2011 at 5:34 am
    So a few more comments:
    - Even if you had perfect clustering factor, I still think a PX + FTS
    + HJ is a much better solution for this type/size of problem. Would
    you really want to get 7 million rows one rowid at a time?
    - 20MB/s is about 1/10 of what 1 single process (serial execution) can
    do on a modern processor -- for CPU light operations like count(*)
    with FTS, you should be ~200MB/s. Need to investigate why that is so
    low.
    - On a modern system (say any 5400-5600 Xeon system) this query should
    not run 10 hours, I'd wager it should take less than 10 minutes using
    PX.
    - Is BT1.LAST_MODIFIED_DATETIME the range partition key on this table
    (it probably should be if it's a frequent filter)? Daily partitions
    might work quite nicely here.
    - Your query uses November dates of LAST_MODIFIED_DATETIME, yet your
    stats are from 06-aug-2011. I probably would not hurt to refresh
    those - they are 2 months old.

    On Thu, Nov 17, 2011 at 12:06 PM, Jeremy Schneider
    wrote:
    Yeah, you're not the first one to point that out...  I'm doing some
    tests now.  FWIW it's 7 million out of over a billion rows that are all
    stuffed into a 400GB heap table with the worst possible clustering
    factor on the foreign key we're using to join.  Also I haven't been able
    to get more than 20MB/s total throughput with parallel full scans,
    although single-block reads go at a somewhat decent rate of 5ms.  But
    all the same, I'm checking out the hash join... likely it will turn out
    to be more efficient.

    I should have checked this out more in-depth awhile ago...  I got a
    little carried away with maths and mistakenly thought that it was
    totally unrealistic.  This is the problem with ruling things out based
    on "calculations".  There's little to verify your correctness.  It's
    much easier to find your mistakes when you rule things out with
    *testing* instead...  I should know this by now, for some reason I'm
    still making the occasional poor judgements in the heat of the moment.  :)

    -Jeremy

    On 11/17/2011 01:24 PM, Greg Rahn wrote:
    Am I reading this right?  7 million rows joined using NESTED LOOPS
    join?  Wouldn't a parallel hash join be better/faster?

    On Thu, Nov 17, 2011 at 3:34 AM, Jeremy Schneider
    wrote:
    PS... 8k block size; segment of index in question is 100G.  Query is
    pulling 7 million rows from a join of two billion row tables...


    --
    Regards,
    Greg Rahn
    http://structureddata.org
    --
    http://www.freelists.org/webpage/oracle-l
  • Jeremy Schneider at Nov 18, 2011 at 6:47 pm
    The reason for 20MB/s is the storage network. They're using DNFS with a
    single (bonded-backup-mode) 1G network card. (FYI, tcp window sizes are
    16M.) Last night I ran a parallel(64) query and I think that I was
    still only getting something around 30 MB/s throughput w/direct path...
    but it's tough to piece together 64 tracefiles to get an accurate total
    throughput measure. (By another measure I saw 60 MB/s but I think that
    might have included some cache effect...) At any rate the best
    theoretical possible on a one gig card is around 100MB/s. I don't think
    it's possible to full scan two 400 GB tables over a 1GB network link in
    10 minutes, is it? Also, this is not a warehouse system... it's
    *heavily* operational, and the operational workload is running
    satisfactorily with 1G cards. In fact this query is trying to extract a
    day's worth of data to get it INTO a warehouse.

    I do think the slow link skews the usual points where it becomes
    beneficial to switch from index to hash methods. I'd wager that on this
    system, a couple million rows were probably faster with the index.
    (Whereas generally a hash might win for this many rows.) Recently this
    query has inched toward 10 million, and perhaps now a hash join is going
    to be better -- but it's still going to take many hours.

    But I thinking that I've finally started to inch toward the real
    problem: rollbacks from other activity on the system -- in fact this
    could also explain why recently, the problem seemed to accellerate
    exponentially. This is why I originally asked about the 13 gets/row on
    the index... and I think Jonathan's pointer got me moving in the right
    direction. In fact, we would still have the same problem with a hash
    join - because it's still going to take several hours on this busy
    system with only a 1GB storage network.

    I'm thinking that the real answer is to use shorter-running queries...
    something as simple as getting 2-4 hours of data at a time instead of 24
    hours at once could in fact completely resolve this problem. Which
    interestingly, might switch us back to nested loop joins again since
    we're going to be getting a lot less data at once.

    Still in progress... so additional thoughts (from anyone on the list)
    are welcomed.

    -Jeremy

    PS... we have also been discussing stats and partitioning internally...
    I just view the above issues as primary.

    --
    http://www.ardentperf.com
    +1 312-725-9249

    Jeremy Schneider
    Chicago

    On 11/17/2011 11:32 PM, Greg Rahn wrote:
    So a few more comments:
    - Even if you had perfect clustering factor, I still think a PX + FTS
    + HJ is a much better solution for this type/size of problem. Would
    you really want to get 7 million rows one rowid at a time?
    - 20MB/s is about 1/10 of what 1 single process (serial execution) can
    do on a modern processor -- for CPU light operations like count(*)
    with FTS, you should be ~200MB/s. Need to investigate why that is so
    low.
    - On a modern system (say any 5400-5600 Xeon system) this query should
    not run 10 hours, I'd wager it should take less than 10 minutes using
    PX.
    - Is BT1.LAST_MODIFIED_DATETIME the range partition key on this table
    (it probably should be if it's a frequent filter)? Daily partitions
    might work quite nicely here.
    - Your query uses November dates of LAST_MODIFIED_DATETIME, yet your
    stats are from 06-aug-2011. I probably would not hurt to refresh
    those - they are 2 months old.

    On Thu, Nov 17, 2011 at 12:06 PM, Jeremy Schneider
    wrote:
    Yeah, you're not the first one to point that out... I'm doing some
    tests now. FWIW it's 7 million out of over a billion rows that are all
    stuffed into a 400GB heap table with the worst possible clustering
    factor on the foreign key we're using to join. Also I haven't been able
    to get more than 20MB/s total throughput with parallel full scans,
    although single-block reads go at a somewhat decent rate of 5ms. But
    all the same, I'm checking out the hash join... likely it will turn out
    to be more efficient.

    I should have checked this out more in-depth awhile ago... I got a
    little carried away with maths and mistakenly thought that it was
    totally unrealistic. This is the problem with ruling things out based
    on "calculations". There's little to verify your correctness. It's
    much easier to find your mistakes when you rule things out with
    *testing* instead... I should know this by now, for some reason I'm
    still making the occasional poor judgements in the heat of the moment. :)

    -Jeremy

    On 11/17/2011 01:24 PM, Greg Rahn wrote:
    Am I reading this right? 7 million rows joined using NESTED LOOPS
    join? Wouldn't a parallel hash join be better/faster?

    On Thu, Nov 17, 2011 at 3:34 AM, Jeremy Schneider
    wrote:
    PS... 8k block size; segment of index in question is 100G. Query is
    pulling 7 million rows from a join of two billion row tables...
    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Nov 19, 2011 at 12:14 am
    The big problem I see here is simply that the lack hardware
    infrastructure is requiring what I refer to as "unnatural acts of
    engineering" in order to do this extract. While this system may not
    require 10GB/s of scan bandwidth, less than 100MB/s is really
    insufficient for any production system these days -- especially one
    that requires multi hundred GB scans. To put it in perspective, one
    can get more than 100MB/s from the SSD drive in a laptop.

    I'd think that partitioning would be a big help here to reduce the
    data required off disk (thinking txns & time series data) but if you
    have to rebuild two 400GB tables and their indexes over that slow
    storage connection to get there.

    BTW, I'm not sure why you need trace files from PX servers to
    calculate IO bandwidth - just monitor the network tput or just get a
    rough number from elapsed time and the amount of data read from disk.

    On Fri, Nov 18, 2011 at 10:46 AM, Jeremy Schneider
    wrote:
    The reason for 20MB/s is the storage network.  They're using DNFS with a
    single (bonded-backup-mode) 1G network card.  (FYI, tcp window sizes are
    16M.)  Last night I ran a parallel(64) query and I think that I was
    still only getting something around 30 MB/s throughput w/direct path...
    but it's tough to piece together 64 tracefiles to get an accurate total
    throughput measure.  (By another measure I saw 60 MB/s but I think that
    might have included some cache effect...)  At any rate the best
    theoretical possible on a one gig card is around 100MB/s.  I don't think
    it's possible to full scan two 400 GB tables over a 1GB network link in
    10 minutes, is it?  Also, this is not a warehouse system... it's
    *heavily* operational, and the operational workload is running
    satisfactorily with 1G cards.  In fact this query is trying to extract a
    day's worth of data to get it INTO a warehouse.

    I do think the slow link skews the usual points where it becomes
    beneficial to switch from index to hash methods.  I'd wager that on this
    system, a couple million rows were probably faster with the index.
    (Whereas generally a hash might win for this many rows.)  Recently this
    query has inched toward 10 million, and perhaps now a hash join is going
    to be better -- but it's still going to take many hours.
    --
    Regards,
    Greg Rahn
    http://structureddata.org
    --
    http://www.freelists.org/webpage/oracle-l
  • Jeremy Schneider at Nov 21, 2011 at 7:48 pm
    Ok, here are a few useful figures. I think I'm finally doing my math
    right. For this customer:
    35 MB/s throughput (parallel direct path)
    6 ms response times
    400GB tables
    time to full-scan the table:
    400GB x 1024 = 409600 MB
    409600 MB / 35 = 11703 sec
    11703 / 60 / 60 = 3.25 hrs

    how many rows in 3.25 hrs with range scan (about 1.2 blocks per row)?
    6 ms / 1000 = 0.006 sec/block
    1.2 blocks/row * 0.006 = 0.0072 sec/row
    11703 sec / 0.0072 = about 1.6 million rows in 3.25 hrs

    how many rows in 3.25 hrs with index join (about 4 LIOs per row)?
    4 blocks/row * 0.006 = 0.024 sec/row
    11703 sec / 0.024 = about 500,000 rows in 3.25 hrs

    I guess there's also a case where you can hash join using the index, but
    i'm skipping that one for now...

    ==========
    However there's another factor skewing things - it's only the rows *from
    disk* that count - and this customer has a HUGE buffer cache, and were
    seeing over 90% BCHR on some of these million-row index-based executions.
    The direct path FTS will never benefit from the buffer cache. So
    accounting for this cache effect, the breakpoints become 5 million for the
    second table and 16 million for the first. This lines up with testing
    we've done. It seems that index path is still most efficient for the first
    table. And quite conveniently, as soon as the CBO got cardinality
    estimates right, it picked the index on the first table and the FTS on the
    second. :)

    Regarding their lack hardware, it's worth pointing out that this is the
    first and only query I'm aware of to get millions of rows on this system.
    Also, I believe that the data volume has been growing over the past few
    months - so it's only recently that this solo query moved close to the 10
    million row range. It's a *very* busy transactional system - so total
    throughput isn't a major point. And 6ms is decent response time.

    My initial 30MB/s *was* a rough estimate from elapsed time and data
    sizes... I was only looking at trace files to verify the measurement...
    also the first test ran in the middle of the night and I wasn't at the
    console. :) I reran some small tests today. Had trouble with iptraf;
    seemed to give me numbers that were clearly wrong... but I did get
    byte-counts from ifconfig and verified about 30-35 MB/s max throughput on
    the adapter during a parallel run.

    (hrishy - right there were three ways to get the throughput estimates!)

    -Jeremy

    On Fri, Nov 18, 2011 at 6:12 PM, Greg Rahn wrote:

    The big problem I see here is simply that the lack hardware
    infrastructure is requiring what I refer to as "unnatural acts of
    engineering" in order to do this extract. While this system may not
    require 10GB/s of scan bandwidth, less than 100MB/s is really
    insufficient for any production system these days -- especially one
    that requires multi hundred GB scans. To put it in perspective, one
    can get more than 100MB/s from the SSD drive in a laptop.

    I'd think that partitioning would be a big help here to reduce the
    data required off disk (thinking txns & time series data) but if you
    have to rebuild two 400GB tables and their indexes over that slow
    storage connection to get there.

    BTW, I'm not sure why you need trace files from PX servers to
    calculate IO bandwidth - just monitor the network tput or just get a
    rough number from elapsed time and the amount of data read from disk.
    --
    http://www.ardentperf.com
    +1 312-725-9249

    Jeremy Schneider
    Chicago


    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Nov 21, 2011 at 8:18 pm
    Curious:
    - could you leverage a parallel NLJ to engage more CPUs on the problem
    to reduce the run time? I'm assuming that if you have such a high BCHR
    it must be a CPU bound query then, right? How many CPUs do you have? (
    I guess that depends on how fast you need this to run and how much
    resources are available).
    - do they use RMAN or back this up via filer snaps?
    - what did you change to get different CBO cardinality estimates? Stats?
    - any chance of partitioning helping out?

    Can you put a number on HUGE [buffer cache]? My HUGE may be different
    than your HUGE. :D

    I do agree with you that if you can only get 35MB/s then this system
    can't really leverage PX FTS the way a more modern system could (like
    a 2 socket Xeon with 1-2 GB/s).

    On Mon, Nov 21, 2011 at 11:46 AM, Jeremy Schneider
    wrote:
    Ok, here are a few useful figures.  I think I'm finally doing my math
    right.  For this customer:
    35 MB/s throughput (parallel direct path)
    6 ms response times
    400GB tables

    time to full-scan the table:
    400GB x 1024 = 409600 MB
    409600 MB / 35 = 11703 sec
    11703 / 60 / 60 = 3.25 hrs

    how many rows in 3.25 hrs with range scan (about 1.2 blocks per row)?
    6 ms / 1000 = 0.006 sec/block
    1.2 blocks/row * 0.006 = 0.0072 sec/row
    11703 sec / 0.0072 = about 1.6 million rows in 3.25 hrs

    how many rows in 3.25 hrs with index join (about 4 LIOs per row)?
    4 blocks/row * 0.006 = 0.024 sec/row
    11703 sec / 0.024 = about 500,000 rows in 3.25 hrs

    I guess there's also a case where you can hash join using the index, but i'm
    skipping that one for now...

    ==========
    However there's another factor skewing things - it's only the rows *from
    disk* that count - and this customer has a HUGE buffer cache, and were
    seeing over 90% BCHR on some of these million-row index-based executions.
    The direct path FTS will never benefit from the buffer cache.  So accounting
    for this cache effect, the breakpoints become 5 million for the second table
    and 16 million for the first.  This lines up with testing we've done.  It
    seems that index path is still most efficient for the first table.  And
    quite conveniently, as soon as the CBO got cardinality estimates right, it
    picked the index on the first table and the FTS on the second.  :)

    Regarding their lack hardware, it's worth pointing out that this is the
    first and only query I'm aware of to get millions of rows on this system.
    Also, I believe that the data volume has been growing over the past few
    months - so it's only recently that this solo query moved close to the 10
    million row range.  It's a *very* busy transactional system - so total
    throughput isn't a major point.  And 6ms is decent response time.

    My initial 30MB/s *was* a rough estimate from elapsed time and data sizes...
    I was only looking at trace files to verify the measurement... also the
    first test ran in the middle of the night and I wasn't at the console.  :)
    I reran some small tests today.  Had trouble with iptraf; seemed to give me
    numbers that were clearly wrong... but I did get byte-counts from ifconfig
    and verified about 30-35 MB/s max throughput on the adapter during a
    parallel run.
    --
    Regards,
    Greg Rahn
    http://structureddata.org
    --
    http://www.freelists.org/webpage/oracle-l
  • Mark W. Farnham at Nov 22, 2011 at 3:23 pm
    It is a pity that there seems to be no way to use an index to grab block
    segments of a table to fts (and yes, then refilter the rows).

    True, in the polar case of a lousy cluster factor an index might have one
    lonely row in each block, or worse for the idea of segmented FTS, one row
    per 1 megabyte (or 4 megabyte) chunk of rows.

    But even with a lousy cluster factor, all the desired rows by the index
    might be (now polar good case) in a few megabytes of contiguous blocks of
    FTS scan order. I suppose this would be an adventure to build. You'd need to
    evaluate a higher order cluster by physical segment chunk analysis for a
    specific key or set of keys on the fly to see whether that was likely to be
    better than just scanning the whole thing. Setting off scans in parallel one
    per contiguous chunk paired with some number of scans picking up a list of
    onesie-twosie blocks might FTS some quite small fraction of the whole table,
    having read a small part of the index (but not using the index entries to
    yank single table blocks).

    Just a thought. Might be more effort than value, but I think in the long
    haul everything just gets too big for a full FTS, even in parallel. Heh, the
    software effort would have been more defensible 20 years ago.

    Regards,

    mwf

    -----Original Message-----
    From: [email protected]
    On Behalf Of Greg Rahn
    Sent: Monday, November 21, 2011 3:17 PM
    To: Jeremy Schneider
    Cc: [email protected]
    Subject: Re: index with very high LIO

    Curious:
    - could you leverage a parallel NLJ to engage more CPUs on the problem to
    reduce the run time? I'm assuming that if you have such a high BCHR it must
    be a CPU bound query then, right? How many CPUs do you have? ( I guess that
    depends on how fast you need this to run and how much resources are
    available).
    - do they use RMAN or back this up via filer snaps?
    - what did you change to get different CBO cardinality estimates? Stats?
    - any chance of partitioning helping out?

    Can you put a number on HUGE [buffer cache]? My HUGE may be different than
    your HUGE. :D

    I do agree with you that if you can only get 35MB/s then this system can't
    really leverage PX FTS the way a more modern system could (like a 2 socket
    Xeon with 1-2 GB/s).

    On Mon, Nov 21, 2011 at 11:46 AM, Jeremy Schneider
    wrote:
    Ok, here are a few useful figures.  I think I'm finally doing my math
    right.  For this customer:
    35 MB/s throughput (parallel direct path)
    6 ms response times
    400GB tables

    time to full-scan the table:
    400GB x 1024 = 409600 MB
    409600 MB / 35 = 11703 sec
    11703 / 60 / 60 = 3.25 hrs

    how many rows in 3.25 hrs with range scan (about 1.2 blocks per row)?
    6 ms / 1000 = 0.006 sec/block
    1.2 blocks/row * 0.006 = 0.0072 sec/row
    11703 sec / 0.0072 = about 1.6 million rows in 3.25 hrs

    how many rows in 3.25 hrs with index join (about 4 LIOs per row)?
    4 blocks/row * 0.006 = 0.024 sec/row
    11703 sec / 0.024 = about 500,000 rows in 3.25 hrs

    I guess there's also a case where you can hash join using the index,
    but i'm skipping that one for now...

    ==========
    However there's another factor skewing things - it's only the rows
    *from
    disk* that count - and this customer has a HUGE buffer cache, and were
    seeing over 90% BCHR on some of these million-row index-based executions.
    The direct path FTS will never benefit from the buffer cache.  So
    accounting for this cache effect, the breakpoints become 5 million for
    the second table and 16 million for the first.  This lines up with
    testing we've done.  It seems that index path is still most efficient
    for the first table.  And quite conveniently, as soon as the CBO got
    cardinality estimates right, it picked the index on the first table
    and the FTS on the second.  :)

    Regarding their lack hardware, it's worth pointing out that this is
    the first and only query I'm aware of to get millions of rows on this system.
    Also, I believe that the data volume has been growing over the past
    few months - so it's only recently that this solo query moved close to
    the 10 million row range.  It's a *very* busy transactional system -
    so total throughput isn't a major point.  And 6ms is decent response time.

    My initial 30MB/s *was* a rough estimate from elapsed time and data sizes...
    I was only looking at trace files to verify the measurement... also
    the first test ran in the middle of the night and I wasn't at the
    console.  :) I reran some small tests today.  Had trouble with iptraf;
    seemed to give me numbers that were clearly wrong... but I did get
    byte-counts from ifconfig and verified about 30-35 MB/s max throughput
    on the adapter during a parallel run.
    --
    Regards,
    Greg Rahn
    http://structureddata.org
    --
    http://www.freelists.org/webpage/oracle-l


    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Nov 22, 2011 at 9:34 pm

    On Tue, Nov 22, 2011 at 7:15 AM, Mark W. Farnham wrote:
    It is a pity that there seems to be no way to use an index to grab block
    segments of a table to fts (and yes, then refilter the rows).
    How does that help here? The system is limited by data movement between
    strorage and host so retrieving more data would not seem to add
    any benefit to me. Having just 35MB/s of bandwidth, sending any extra data
    seems counter productive -- even it it means getting the filter IOPS bound
    (which I doubt it is).

    Just a thought. Might be more effort than value, but I think in the long
    haul everything just gets too big for a full FTS, even in parallel. Heh,
    the
    software effort would have been more defensible 20 years ago.
    I'd say it's quite the opposite -- it is faster and more efficient to
    leverage a parallel sequential scan and filter out rows when it's a "many
    rows" problem than to get 1 block at a time for a "many rows" problem
    -- especially if your storage is rotating rust. Access via rowid is great
    for a "few rows" problem and "few" has an absolute ceiling IMO. This is
    the same reason that star transformation using bitmap indexes works great
    if the number of rowid gets is few, but once you cross into the 1 to 5
    million range it degrades and a FTS with the appropriate hardware will beat
    it out. The caveat here is that FTS does not mean read ever single row
    blindly -- it means encompassing features like partitioning or storage
    index type functionality that leverages ordering/clustering.

    The high level challenge here that I see is the client trying to fit the
    foot to the shoe, not the shoe to the foot which is resulting in some edge
    case engineering. As one of my university CS professors often said, "just
    because it works, doesn't mean it's right."
  • Mark W. Farnham at Nov 23, 2011 at 2:47 am
    Obviously I did not make myself clear. In the case where an FTS wins, a
    partial FTS should win even bigger.

    Omitting the potentially large gaps in an FTS (which the ranges of blocks
    [not rows] required can be gotten from the index) gives you a less expensive
    FTS. So a partial FTS would win whenever the cost of determining the block
    ranges required is less than the cost saved by being a smaller FTS. A
    similar code path could potentially mitigate the empty front problem as well
    for non-index related FTS. (In case the "empty front problem" is in any way
    mysterious that is when completely empty blocks near the beginning of the
    extent map remain completely empty for long periods due to either being down
    the freelist or ASSM bitmap allocation stack or because new rows only come
    in via direct load high or there simply are no new rows coming in after a
    big delete for long enough to matter. Just as you can construct a block map
    from an index, an empty block map could be maintained by last row deleted
    below the high water mark records the row in the map and if you're the first
    row in an old block you take the row out. Then any old FTS could skip "large
    enough to yield a benefit" empty block ranges. This would be trickier but
    more useful than the easier fix of recording a low water mark [which also is
    apparently insufficiently appealing to implement.])

    You're right that if the real problem on this system is that data is moved
    slowly you want to move less data. This would only win when the FTS is
    moving less data in the first place.

    I don't get where you translated block ranges into rowid access from what I
    wrote, but just to be clear: I'm not suggesting that at all. I'm suggesting
    that in many cases significant numbers of block ranges don't need to be
    visited at all by the FTS.

    Regards,

    mwf



    -----Original Message-----
    From: [email protected]
    On Behalf Of Greg Rahn
    Sent: Tuesday, November 22, 2011 4:33 PM
    To: Mark W. Farnham
    Cc: [email protected]
    Subject: Re: index with very high LIO
    On Tue, Nov 22, 2011 at 7:15 AM, Mark W. Farnham wrote:
    It is a pity that there seems to be no way to use an index to grab
    block segments of a table to fts (and yes, then refilter the rows).
    How does that help here? The system is limited by data movement between
    strorage and host so retrieving more data would not seem to add any benefit
    to me. Having just 35MB/s of bandwidth, sending any extra data seems
    counter productive -- even it it means getting the filter IOPS bound (which
    I doubt it is).

    Just a thought. Might be more effort than value, but I think in the long
    haul everything just gets too big for a full FTS, even in parallel.
    Heh, the software effort would have been more defensible 20 years ago.
    I'd say it's quite the opposite -- it is faster and more efficient to
    leverage a parallel sequential scan and filter out rows when it's a "many
    rows" problem than to get 1 block at a time for a "many rows" problem
    -- especially if your storage is rotating rust. Access via rowid is great
    for a "few rows" problem and "few" has an absolute ceiling IMO. This is
    the same reason that star transformation using bitmap indexes works great
    if the number of rowid gets is few, but once you cross into the 1 to 5
    million range it degrades and a FTS with the appropriate hardware will beat
    it out. The caveat here is that FTS does not mean read ever single row
    blindly -- it means encompassing features like partitioning or storage
    index type functionality that leverages ordering/clustering.

    The high level challenge here that I see is the client trying to fit the
    foot to the shoe, not the shoe to the foot which is resulting in some edge
    case engineering. As one of my university CS professors often said, "just
    because it works, doesn't mean it's right."
  • Hrishy at Nov 19, 2011 at 11:30 am
    Hi Greg,Jeremy
    Slightly offtopic but i am curious to know how to measure that the throughput is 20Mb/sec or 30Mb/sec



    ________________________________
    From: Jeremy Schneider <[email protected]>
    To: Greg Rahn <[email protected]>
    Cc: "[email protected]" <[email protected]>
    Sent: Saturday, 19 November 2011, 0:16
    Subject: Re: index with very high LIO

    The reason for 20MB/s is the storage network.  They're using DNFS with a
    single (bonded-backup-mode) 1G network card.  (FYI, tcp window sizes are
    16M.)  Last night I ran a parallel(64) query and I think that I was
    still only getting something around 30 MB/s throughput w/direct path...
    but it's tough to piece together 64 tracefiles to get an accurate total
    throughput measure.  (By another measure I saw 60 MB/s but I think that
    might have included some cache effect...)  At any rate the best
    theoretical possible on a one gig card is around 100MB/s.  I don't think
    it's possible to full scan two 400 GB tables over a 1GB network link in
    10 minutes, is it?  Also, this is not a warehouse system... it's
    *heavily* operational, and the operational workload is running
    satisfactorily with 1G cards.  In fact this query is trying to extract a
    day's worth of data to get it INTO a warehouse.

    I do think the slow link skews the usual points where it becomes
    beneficial to switch from index to hash methods.  I'd wager that on this
    system, a couple million rows were probably faster with the index.
    (Whereas generally a hash might win for this many rows.)  Recently this
    query has inched toward 10 million, and perhaps now a hash join is going
    to be better -- but it's still going to take many hours.

    But I thinking that I've finally started to inch toward the real
    problem: rollbacks from other activity on the system -- in fact this
    could also explain why recently, the problem seemed to accellerate
    exponentially.  This is why I originally asked about the 13 gets/row on
    the index... and I think Jonathan's pointer got me moving in the right
    direction.  In fact, we would still have the same problem with a hash
    join - because it's still going to take several hours on this busy
    system with only a 1GB storage network.

    I'm thinking that the real answer is to use shorter-running queries...
    something as simple as getting 2-4 hours of data at a time instead of 24
    hours at once could in fact completely resolve this problem.  Which
    interestingly, might switch us back to nested loop joins again since
    we're going to be getting a lot less data at once.

    Still in progress... so additional thoughts (from anyone on the list)
    are welcomed.

    -Jeremy

    PS... we have also been discussing stats and partitioning internally...
    I just view the above issues as primary.

    --
    http://www.ardentperf.com
    +1 312-725-9249

    Jeremy Schneider
    Chicago

    On 11/17/2011 11:32 PM, Greg Rahn wrote:
    So a few more comments:
    - Even if you had perfect clustering factor, I still think a PX + FTS
    + HJ is a much better solution for this type/size of problem.  Would
    you really want to get 7 million rows one rowid at a time?
    - 20MB/s is about 1/10 of what 1 single process (serial execution) can
    do on a modern processor -- for CPU light operations like count(*)
    with FTS, you should be ~200MB/s.  Need to investigate why that is so
    low.
    - On a modern system (say any 5400-5600 Xeon system) this query should
    not run 10 hours, I'd wager it should take less than 10 minutes using
    PX.
    - Is BT1.LAST_MODIFIED_DATETIME the range partition key on this table
    (it probably should be if it's a frequent filter)?  Daily partitions
    might work quite nicely here.
    - Your query uses November dates of  LAST_MODIFIED_DATETIME, yet your
    stats are from 06-aug-2011.  I probably would not hurt to refresh
    those - they are 2 months old.

    On Thu, Nov 17, 2011 at 12:06 PM, Jeremy Schneider
    wrote:
    Yeah, you're not the first one to point that out...  I'm doing some
    tests now.  FWIW it's 7 million out of over a billion rows that are all
    stuffed into a 400GB heap table with the worst possible clustering
    factor on the foreign key we're using to join.  Also I haven't been able
    to get more than 20MB/s total throughput with parallel full scans,
    although single-block reads go at a somewhat decent rate of 5ms.  But
    all the same, I'm checking out the hash join... likely it will turn out
    to be more efficient.

    I should have checked this out more in-depth awhile ago...  I got a
    little carried away with maths and mistakenly thought that it was
    totally unrealistic.  This is the problem with ruling things out based
    on "calculations".  There's little to verify your correctness.  It's
    much easier to find your mistakes when you rule things out with
    *testing* instead...  I should know this by now, for some reason I'm
    still making the occasional poor judgements in the heat of the moment.  :)

    -Jeremy

    On 11/17/2011 01:24 PM, Greg Rahn wrote:
    Am I reading this right?  7 million rows joined using NESTED LOOPS
    join?  Wouldn't a parallel hash join be better/faster?

    On Thu, Nov 17, 2011 at 3:34 AM, Jeremy Schneider
    wrote:
    PS... 8k block size; segment of index in question is 100G.  Query is
    pulling 7 million rows from a join of two billion row tables...
    --
    http://www.freelists.org/webpage/oracle-l
    --
    http://www.freelists.org/webpage/oracle-l

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouporacle-l @
categoriesoracle
postedNov 17, '11 at 11:36a
activeNov 28, '11 at 11:58a
posts16
users5
websiteoracle.com

People

Translate

site design / logo © 2023 Grokbase