FAQ
Hi,

Please tell me how density is calculated in case of histograms.

I've a table with very skewed data:

scott_at_ORADB10G> select col_skew, count(*) from tab_skew group by col_skew
order by 1;

COL_SKEW COUNT(*)

---------- ----------

1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
10 9910

10 rows selected.

scott_at_ORADB10G> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');

PL/SQL procedure successfully completed.

Creating width-based histograms, #buckets = #distinct values
scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
method_opt=>'FOR COLUMNS COL_SKEW size 10');

PL/SQL procedure successfully completed.

scott_at_ORADB10G> select * from dba_tab_col_statistics where
table_name='TAB_SKEW' and column_name='COL_SKEW';

OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V DENSITY

NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE AVG_COL_LEN HISTOGRAM

------ -------- -------------------- ------------ ------ ------ ----------

---------- ----------- --------- ----------- --- --- -----------

SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
.00005 0 10 06-APR-09 10000 YES NO 3

FREQUENCY

Question: How density is calculated here? Also it's equal to Selectivity of
non-popular values.

scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
method_opt=>'FOR COLUMNS COL_SKEW size 5');

PL/SQL procedure successfully completed.

Creating Height-based histograms, #buckets < #distinct values
scott_at_ORADB10G> select * from dba_tab_col_statistics where
table_name='TAB_SKEW' and column_name='COL_SKEW';

OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V DENSITY

NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE AVG_COL_LEN HISTOGRAM

------ -------- -------------------- ------------ ------ ------ ----------

---------- ----------- --------- ----------- --- --- -----------

SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
.98209 0 5 06-APR-09 10000 YES NO 3

HEIGHT BALANCED

Question: How density is calculated here and how selectivity will be
estimated for popolar as well as non-popular values?

Any points/advice will be appreciated.

Neeraj.

Search Discussions

  • Martin Klier at Apr 6, 2009 at 12:04 pm
    Hi Neeraj,

    no straight answer to your question, but on which DB version are you
    testing?

    Keep the estimate_percent default values in mind. Maybe if you force
    dbms_stats to compute stats, the result will differ.

    --
    Mit freundlichem Gruß

    Martin Klier
    Senior Oracle Database Administrator
    ------------------------------------------------------------------------------

    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.: +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de
    ------------------------------------------------------------------------------

    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    oracle-l-bounce_at_freelists.org schrieb am 06.04.2009 08:47:30:
    Von:

    Neeraj Bhatia

    An:

    oracle-l@freelists.org

    Datum:

    06.04.2009 08:48

    Betreff:

    Density statistic calculation in case of histogram

    Gesendet von:

    oracle-l-bounce_at_freelists.org

    Hi,
    Please tell me how density is calculated in case of histograms.
    I've a table with very skewed data:
    scott_at_ORADB10G> select col_skew, count(*) from tab_skew group by
    col_skew order by 1;
    COL_SKEW   COUNT(*)
    ---------- ----------
    1         10
    2         10
    3         10
    4         10
    5         10
    6         10
    7         10
    8         10
    9         10
    10       9910
    10 rows selected.
    scott_at_ORADB10G> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');
    PL/SQL procedure successfully completed.
    -- Creating width-based histograms, #buckets = #distinct values
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 10');
    PL/SQL procedure successfully completed.
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER  TABLE_NA COLUMN_NAME          NUM_DISTINCT LOW_VA HIGH_V
    DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT  TAB_SKEW COL_SKEW                       10 C102   C10B
    .00005          0          10 06-APR-09       10000 YES
    NO            3 FREQUENCY
    Question: How density is calculated here? Also it's equal to
    Selectivity of non-popular values.
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 5');
    PL/SQL procedure successfully completed.
    -- Creating Height-based histograms, #buckets < #distinct values
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER  TABLE_NA COLUMN_NAME          NUM_DISTINCT LOW_VA HIGH_V
    DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT  TAB_SKEW COL_SKEW                       10 C102   C10B
    .98209          0           5 06-APR-09       10000 YES
    NO            3 HEIGHT BALANCED
    Question: How density is calculated here and how selectivity will be
    estimated for popolar as well as non-popular values?

    Any points/advice will be appreciated.
    Neeraj.
    --
    http://www.freelists.org/webpage/oracle-l
  • Neeraj Bhatia at Apr 6, 2009 at 1:56 pm
    Hi Martin,

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms - Myths and Facts" by Wolfgang
    Breitling, some points get cleared.

    In case of width-based histograms, density= 1/ (2*number of distinct
    values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to 1/number of distinct
    values. But in my case, it is null. I am not using method_opt while
    collecting statistics. The global default setting is FOR ALL INDEXED COLUMNS
    SIZE 1.

    3) In case of height-based histograms, I didn't get these calculations.

    density = Σ cnt2 / ( num_rows˜ * Σ cnt )

    “the sum of the squared frequencies of all non-popular values divided by the
    sum of the frequencies of all non-popular values times the count of rows
    with not null values of the histogram column”.

    Any comments will be appreciated.

    Neeraj.
    On Mon, Apr 6, 2009 at 5:34 PM, Martin Klier wrote:

    Hi Neeraj,

    no straight answer to your question, but on which DB version are you
    testing?

    Keep the estimate_percent default values in mind. Maybe if you force
    dbms_stats to compute stats, the result will differ.

    --
    Mit freundlichem Gruß


    Martin Klier
    Senior Oracle Database Administrator

    ------------------------------------------------------------------------------

    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.: +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de

    ------------------------------------------------------------------------------

    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    oracle-l-bounce_at_freelists.org schrieb am 06.04.2009 08:47:30:
    Von:

    Neeraj Bhatia

    An:

    oracle-l@freelists.org

    Datum:

    06.04.2009 08:48

    Betreff:

    Density statistic calculation in case of histogram

    Gesendet von:

    oracle-l-bounce_at_freelists.org

    Hi,
    Please tell me how density is calculated in case of histograms.
    I've a table with very skewed data:
    scott_at_ORADB10G> select col_skew, count(*) from tab_skew group by
    col_skew order by 1;
    COL_SKEW COUNT(*)
    ---------- ----------
    1 10
    2 10
    3 10
    4 10
    5 10
    6 10
    7 10
    8 10
    9 10
    10 9910
    10 rows selected.
    scott_at_ORADB10G> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');
    PL/SQL procedure successfully completed.
    -- Creating width-based histograms, #buckets = #distinct values
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 10');
    PL/SQL procedure successfully completed.
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V
    DENSITY NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
    .00005 0 10 06-APR-09 10000 YES
    NO 3 FREQUENCY
    Question: How density is calculated here? Also it's equal to
    Selectivity of non-popular values.
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 5');
    PL/SQL procedure successfully completed.
    -- Creating Height-based histograms, #buckets < #distinct values
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V
    DENSITY NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
    .98209 0 5 06-APR-09 10000 YES
    NO 3 HEIGHT BALANCED
    Question: How density is calculated here and how selectivity will be
    estimated for popolar as well as non-popular values?

    Any points/advice will be appreciated.
    Neeraj.
    --
    http://www.freelists.org/webpage/oracle-l
  • Martin Klier at Apr 6, 2009 at 2:41 pm
    Nice to know!

    Maybe these papers are of interest for you as well, as much as they have
    been for me (they have been in this list for at least one time):

    http://method-r.com/downloads/doc_download/11-managing-statistics-for-optimal-query-performance-karen-morton

    http://richardfoote.wordpress.com/2008/01/04/dbms_stats-method_opt-default-behaviour-changed-in-10g-be-careful/

    HTH

    --
    Mit freundlichem Gruß

    Martin Klier
    Senior Oracle Database Administrator

    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.: +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de

    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    Neeraj Bhatia schrieb am 06.04.2009 15:56:10:
    Von:

    Neeraj Bhatia

    An:

    Martin Klier

    Kopie:

    oracle-l_at_freelists.org, oracle-l-bounce_at_freelists.org

    Datum:

    06.04.2009 15:57

    Betreff:

    Re: Density statistic calculation in case of histogram

    Hi Martin,

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms - Myths and Facts" by
    Wolfgang Breitling, some points get cleared.

    1)  In case of width-based histograms, density= 1/ (2*number of
    distinct values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to 1/number of
    distinct values. But in my case, it is null. I am not using
    method_opt while collecting statistics. The global default setting
    is FOR ALL INDEXED COLUMNS SIZE 1.

    3) In case of height-based histograms, I didn't get these calculations.

    density =
    Σ cnt2 / ( num_rows˜ * Σ cnt )
    “the sum of the squared frequencies of all non-popular values
    divided by the sum of the frequencies of all non-popular values
    times the count of rows with not null values of the histogram column”.

    Any comments will be appreciated.

    Neeraj.
    On Mon, Apr 6, 2009 at 5:34 PM, Martin Klier wrote:
    Hi Neeraj,

    no straight answer to your question, but on which DB version are you
    testing?

    Keep the estimate_percent default values in mind. Maybe if you force
    dbms_stats to compute stats, the result will differ.

    --
    Mit freundlichem Gruß


    Martin Klier
    Senior Oracle Database Administrator
    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.:  +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de
    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    oracle-l-bounce_at_freelists.org schrieb am 06.04.2009 08:47:30:
    Von:

    Neeraj Bhatia

    An:

    oracle-l@freelists.org

    Datum:

    06.04.2009 08:48

    Betreff:

    Density statistic calculation in case of histogram

    Gesendet von:

    oracle-l-bounce_at_freelists.org

    Hi,
    Please tell me how density is calculated in case of histograms.
    I've a table with very skewed data:
    scott_at_ORADB10G> select col_skew, count(*) from tab_skew group by
    col_skew order by 1;
      COL_SKEW   COUNT(*)
    ---------- ----------
             1         10
             2         10
             3         10
             4         10
             5         10
             6         10
             7         10
             8         10
             9         10
            10       9910
    10 rows selected.
    scott_at_ORADB10G> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');
    PL/SQL procedure successfully completed.
    -- Creating width-based histograms, #buckets = #distinct values
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 10');
    PL/SQL procedure successfully completed.
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER  TABLE_NA COLUMN_NAME          NUM_DISTINCT LOW_VA HIGH_V
    DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT  TAB_SKEW COL_SKEW                       10 C102   C10B
    .00005          0          10 06-APR-09       10000 YES
    NO            3 FREQUENCY
    Question: How density is calculated here? Also it's equal to
    Selectivity of non-popular values.
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 5');
    PL/SQL procedure successfully completed.
    -- Creating Height-based histograms, #buckets < #distinct values
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER  TABLE_NA COLUMN_NAME          NUM_DISTINCT LOW_VA HIGH_V
    DENSITY  NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT  TAB_SKEW COL_SKEW                       10 C102   C10B
    .98209          0           5 06-APR-09       10000 YES
    NO            3 HEIGHT BALANCED
    Question: How density is calculated here and how selectivity will be
    estimated for popolar as well as non-popular values?

    Any points/advice will be appreciated.
    Neeraj.
    †Ûiÿü0ÁúÞzX¬¶Ê+ƒün– {ú+iÉ^
  • Neeraj Bhatia at Apr 6, 2009 at 3:20 pm
    Basically I am interested to know how density is calculated in case of
    height-based histograms. Also, how CBO calculate selectivity.

    Thnx,
    Neeraj
    On Mon, Apr 6, 2009 at 8:11 PM, Martin Klier wrote:

    Nice to know!

    Maybe these papers are of interest for you as well, as much as they have
    been for me (they have been in this list for at least one time):


    http://method-r.com/downloads/doc_download/11-managing-statistics-for-optimal-query-performance-karen-morton


    http://richardfoote.wordpress.com/2008/01/04/dbms_stats-method_opt-default-behaviour-changed-in-10g-be-careful/

    HTH
    --
    Mit freundlichem Gruß


    Martin Klier
    Senior Oracle Database Administrator

    ------------------------------------------------------------------------------

    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.: +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de

    ------------------------------------------------------------------------------

    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    Neeraj Bhatia schrieb am 06.04.2009 15:56:10:
    Von:

    Neeraj Bhatia

    An:

    Martin Klier

    Kopie:

    oracle-l@freelists.org, oracle-l-bounce_at_freelists.org

    Datum:

    06.04.2009 15:57

    Betreff:

    Re: Density statistic calculation in case of histogram

    Hi Martin,

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms - Myths and Facts" by
    Wolfgang Breitling, some points get cleared.

    1) In case of width-based histograms, density= 1/ (2*number of
    distinct values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to 1/number of
    distinct values. But in my case, it is null. I am not using
    method_opt while collecting statistics. The global default setting
    is FOR ALL INDEXED COLUMNS SIZE 1.

    3) In case of height-based histograms, I didn't get these calculations.

    density =
    Σ cnt2 / ( num_rows˜ * Σ cnt )
    “the sum of the squared frequencies of all non-popular values
    divided by the sum of the frequencies of all non-popular values
    times the count of rows with not null values of the histogram column”.

    Any comments will be appreciated.

    Neeraj.
    On Mon, Apr 6, 2009 at 5:34 PM, Martin Klier wrote:
    Hi Neeraj,

    no straight answer to your question, but on which DB version are you
    testing?

    Keep the estimate_percent default values in mind. Maybe if you force
    dbms_stats to compute stats, the result will differ.

    --
    Mit freundlichem Gruß


    Martin Klier
    Senior Oracle Database Administrator
    ------------------------------------------------------------------------------
    Klug GmbH integrierte Systeme
    Lindenweg 13, D-92552 Teunz
    Tel.: +49 9671/9216-245
    Fax.: +49 9671/9216-112
    mailto: martin.klier_at_klug-is.de
    www.klug-is.de
    ------------------------------------------------------------------------------
    Geschäftsführer: Johann Klug, Roman Sorgenfrei
    Sitz der Gesellschaft: Teunz, USt-ID-Nr. DE175481608,
    HRB Nr. 2037, Amtsgericht Amberg

    oracle-l-bounce_at_freelists.org schrieb am 06.04.2009 08:47:30:
    Von:

    Neeraj Bhatia

    An:

    oracle-l@freelists.org

    Datum:

    06.04.2009 08:48

    Betreff:

    Density statistic calculation in case of histogram

    Gesendet von:

    oracle-l-bounce_at_freelists.org

    Hi,
    Please tell me how density is calculated in case of histograms.
    I've a table with very skewed data:
    scott_at_ORADB10G> select col_skew, count(*) from tab_skew group by
    col_skew order by 1;
    COL_SKEW COUNT(*)
    ---------- ----------
    1 10
    2 10
    3 10
    4 10
    5 10
    6 10
    7 10
    8 10
    9 10
    10 9910
    10 rows selected.
    scott_at_ORADB10G> exec dbms_stats.delete_table_stats(user,'TAB_SKEW');
    PL/SQL procedure successfully completed.
    -- Creating width-based histograms, #buckets = #distinct values
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 10');
    PL/SQL procedure successfully completed.
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V
    DENSITY NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
    .00005 0 10 06-APR-09 10000 YES
    NO 3 FREQUENCY
    Question: How density is calculated here? Also it's equal to
    Selectivity of non-popular values.
    scott_at_ORADB10G> exec dbms_stats.gather_table_stats(user, 'TAB_SKEW',
    method_opt=>'FOR COLUMNS COL_SKEW size 5');
    PL/SQL procedure successfully completed.
    -- Creating Height-based histograms, #buckets < #distinct values
    scott_at_ORADB10G> select * from dba_tab_col_statistics where
    table_name='TAB_SKEW' and column_name='COL_SKEW';
    OWNER TABLE_NA COLUMN_NAME NUM_DISTINCT LOW_VA HIGH_V
    DENSITY NUM_NULLS NUM_BUCKETS LAST_ANAL SAMPLE_SIZE GLO USE
    AVG_COL_LEN HISTOGRAM
    ------ -------- -------------------- ------------ ------ ------
    ---------- ---------- ----------- --------- ----------- --- ---
    ----------- ---------------
    SCOTT TAB_SKEW COL_SKEW 10 C102 C10B
    .98209 0 5 06-APR-09 10000 YES
    NO 3 HEIGHT BALANCED
    Question: How density is calculated here and how selectivity will be
    estimated for popolar as well as non-popular values?

    Any points/advice will be appreciated.
    Neeraj.
    --
    http://www.freelists.org/webpage/oracle-l
  • Alberto Dell'Era at Apr 8, 2009 at 10:14 pm

    On Mon, Apr 6, 2009 at 17:20, Neeraj Bhatia wrote:
    Basically I am interested to know how density is calculated in case of
    height-based histograms. Also, how CBO calculate selectivity.
    I wouldn't investigate too much the density in 10.x since the formula
    has changed (to a very intuitive one) in 11gR1:

    http://www.adellera.it/investigations/11g_newdensity/index.html

    so, the 10.x formula has been "deprecated" by the kernel developers.

    I did find a statistical justification for the 10.x computation, but
    I have never "published" it for the above reasons, it is no longer
    important.

    hth
    alberto
  • Neeraj Bhatia at Apr 9, 2009 at 1:19 am
    Thanks Alberto for the update. Actually I am investigating the calculations
    in R10 and my plan was to later cover R11 as well. Since still most of the
    systems are on R10 and in future also (atleast 2-3 years), I am very
    interested.

    Having said that, your references and update will be very helpful. I just
    downloaded the stuff and reading :-)

    On Thu, Apr 9, 2009 at 3:44 AM, Alberto Dell'Era
    wrote:
    On Mon, Apr 6, 2009 at 17:20, Neeraj Bhatia wrote:
    Basically I am interested to know how density is calculated in case of
    height-based histograms. Also, how CBO calculate selectivity.
    I wouldn't investigate too much the density in 10.x since the formula
    has changed (to a very intuitive one) in 11gR1:

    http://www.adellera.it/investigations/11g_newdensity/index.html

    so, the 10.x formula has been "deprecated" by the kernel developers.

    I did find a statistical justification for the 10.x computation, but
    I have never "published" it for the above reasons, it is no longer
    important.

    hth
    alberto
    --
    http://www.freelists.org/webpage/oracle-l
  • Alberto Dell'Era at Apr 9, 2009 at 8:21 am

    On Thu, Apr 9, 2009 at 03:19, Neeraj Bhatia wrote:
    Thanks Alberto for the update. Actually I am investigating the calculations
    in R10 and my plan was to later cover R11 as well. Since still most of the
    systems are on R10 and in future also (atleast 2-3 years), I am very
    interested.
    Ok, please find below my interpretation about the meaning of density,
    retrieved from an old offline discussion.

    First of all, for a filter predicate "where column=constant", density is
    not used for popular values. So, consider only the case of constant
    being not popular, and imagine removing the popular values from the
    table (which is what dbms_stats does).

    Of course the estimated cardinality of a filter predicate "where
    column=constant" "has to be" the expected value of the cardinality,
    that is, the average value of the number of rows retrieved over all possible
    values of constant.
    If we knew the probability mass function (PMF) of "constant" (i.e. we could
    characterize the workload perfectly), the expected value would be

    sum( count(constant) * PMF (constant) ), over all values of constant

    dbms_stats knows count(constant) perfectly, knows nothing about the PMF.

    If we assume PMF(constant) = count(constant) / num_rows, that is,
    the more frequently constant is represented in the table, the higher
    the probability of being requested by the client is
    => we get the square-based "OldDensity" formula.

    If we assume PMF(constant) = 1 / num_distinct, that is, we assume
    that each *distinct value* has the same probability of being requested
    by the client
    regardless of how frequently it is represented in the table
    => we get the "NewDensity" formula (which is also the most intuitive).

    Remember that the above figures must be adjusted by num_rows,
    since the estimated cardinality is density * num_rows, where num_rows
    is the total number of rows of the original table.

    hth
    Alberto

    --
    Alberto Dell'Era
    "the more you know, the faster you go"
    --
    http://www.freelists.org/webpage/oracle-l
  • Neeraj Bhatia at Apr 9, 2009 at 2:25 am
    Thanks Alberto! It was w good read (11g new density).

    One question: You have mentioned that CBO classifies at run-time the
    histogram type as it's stored in DD using a heuristic. As per my knowledge,
    based on NDV and #Buckets requested, CBO decides whether to create HB or
    Freq histogram and save the type in data dictionary, which is visible
    through dba_tab_columns.histogram column.

    Please correct me!

    On Thu, Apr 9, 2009 at 3:44 AM, Alberto Dell'Era
    wrote:
    On Mon, Apr 6, 2009 at 17:20, Neeraj Bhatia wrote:
    Basically I am interested to know how density is calculated in case of
    height-based histograms. Also, how CBO calculate selectivity.
    I wouldn't investigate too much the density in 10.x since the formula
    has changed (to a very intuitive one) in 11gR1:

    http://www.adellera.it/investigations/11g_newdensity/index.html

    so, the 10.x formula has been "deprecated" by the kernel developers.

    I did find a statistical justification for the 10.x computation, but
    I have never "published" it for the above reasons, it is no longer
    important.

    hth
    alberto
    --
    http://www.freelists.org/webpage/oracle-l
  • Alberto Dell'Era at Apr 9, 2009 at 8:06 am

    On Thu, Apr 9, 2009 at 04:25, Neeraj Bhatia wrote:
    One question: You have mentioned that CBO classifies at run-time the
    histogram type as it's stored in DD using a heuristic. As per my knowledge,
    based on NDV and #Buckets requested, CBO decides whether to create HB or
    Freq histogram and save the type in data dictionary, which is visible
    through dba_tab_columns.histogram column.
    Actually the histogram type is NOT stored in the data dictionary, since
    the CBO does not need (very surprisingly at first sight) to know the
    histogram type,
    since frequency and height-based histograms are used the same by the internal
    formulae ... see the following paper for further details:

    http://www.adellera.it/investigations/join_over_histograms/JoinOverHistograms.pdf

    dba_tab_columns.histogram is a view on the data dictionary, and the
    column "histogram" is calculated, not retrieved (as you can easily
    check yourself).
    Here is an excerpt from the same paper:
    "10g dba_tab_columns.histogram is an heuristic based on num_rows,
    num_buckets, num_distinct and density. It is also frequently wrong for
    FHs, since it contains the factor density*num_buckets <= 0.5 that is
    very prone to rounding errors (e.g. density*num_buckets = 0.5000001
    instead of 0.5). Also, the type of histogram reported in the 10053
    event is for information only, probably based on the same formula used
    for dba_tab_columns.histogram
    in 10g, and with the same weaknesses."

    hth
    Alberto

    --
    Alberto Dell'Era
    "the more you know, the faster you go"
    --
    http://www.freelists.org/webpage/oracle-l
  • Greg Rahn at Apr 6, 2009 at 2:42 pm

    On Mon, Apr 6, 2009 at 6:56 AM, Neeraj Bhatia wrote:
    2) Without histograms, density should be equal to 1/number of distinct
    values. But in my case, it is null. I am not using method_opt while
    collecting statistics. The global default setting is FOR ALL INDEXED COLUMNS
    SIZE 1.
    I would strongly suggest revisiting that METHOD_OPT choice. It is
    probably one of the worst ones to choose IMO. Here is why:
    http://structureddata.org/2008/10/14/dbms_stats-method_opt-and-for-all-indexed-columns/
  • Yong Huang at Apr 7, 2009 at 7:45 pm
    Neeraj,

    The calculation of density in case of a histogram is documented in the US patent 6732085:
    http://www.freepatentsonline.com/6732085.html
    Jonathan Lewis's CBO book p.172 says the same: "sum of the square of the frequency of the nonpopular values / (number of nonnull rows * number of nonpopular nonull rows)", where "frequency" is called repetition count in the patent.

    Intuitive understanding is beyond me. But it's not hard to apply the proposed formula. In your case of 5 bucket height-balanced histogram, it is:

    bucket# 1 2 3 4 5
    col_skew 1-10 10 10 10 10

    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2
    -------------------------------------------- = 0.001
    10000 * 90

    So the correct answer is 0.001. If you run your test in 10.2.0.4, you'll get this number. In 10.2.0.1, you get 0.98209 for some reason. You would "accidentally" get that number if you applied the formula to your 10-bucket frequency (you call width-based) histogram:

    bucket# 1 2 3 4 5 6 7 8 9 10
    col_skew 1 2 3 4 5 6 7 8 9 10

    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+9910^2
    --------------------------------------------------- = .98209
    10000 * 10000

    Yong Huang

    Neeraj Bhatia wrote:

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms - Myths and Facts" by Wolfgang
    Breitling, some points get cleared.

    In case of width-based histograms, density= 1/ (2*number of distinct
    values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to 1/number of distinct
    values. But in my case, it is null. I am not using method_opt while
    collecting statistics. The global default setting is FOR ALL INDEXED COLUMNS
    SIZE 1.

    3) In case of height-based histograms, I didn't get these calculations.

    density = Σ cnt2 / ( num_rows˜ * Σ cnt )

    “the sum of the squared frequencies of all non-popular values divided by the
    sum of the frequencies of all non-popular values times the count of rows
    with not null values of the histogram column�.
  • Neeraj Bhatia at Apr 8, 2009 at 1:51 pm
    Thanks a lot Yong for very straight forward answer and reference. It helped
    me and now concept it clear to me.
    I'll also install newer patch set and check densities on that.

    BTW, have you get a chance to look into Wolfgang's presentation, where he
    discussed the calculation of density for various buckets?

    Again thanks a lot!
    Neeraj Bhatia
    On Wed, Apr 8, 2009 at 1:15 AM, Yong Huang wrote:


    Neeraj,

    The calculation of density in case of a histogram is documented in the US
    patent 6732085:
    http://www.freepatentsonline.com/6732085.html
    Jonathan Lewis's CBO book p.172 says the same: "sum of the square of the
    frequency of the nonpopular values / (number of nonnull rows * number of
    nonpopular nonull rows)", where "frequency" is called repetition count in
    the patent.

    Intuitive understanding is beyond me. But it's not hard to apply the
    proposed formula. In your case of 5 bucket height-balanced histogram, it is:

    bucket# 1 2 3 4 5
    col_skew 1-10 10 10 10 10

    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2
    -------------------------------------------- = 0.001
    10000 * 90

    So the correct answer is 0.001. If you run your test in 10.2.0.4, you'll
    get this number. In 10.2.0.1, you get 0.98209 for some reason. You would
    "accidentally" get that number if you applied the formula to your 10-bucket
    frequency (you call width-based) histogram:

    bucket# 1 2 3 4 5 6 7 8 9 10
    col_skew 1 2 3 4 5 6 7 8 9 10

    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+9910^2
    --------------------------------------------------- = .98209
    10000 * 10000

    Yong Huang


    Neeraj Bhatia wrote:

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms - Myths and Facts" by Wolfgang
    Breitling, some points get cleared.

    1) In case of width-based histograms, density= 1/ (2*number of distinct
    values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to 1/number of distinct
    values. But in my case, it is null. I am not using method_opt while
    collecting statistics. The global default setting is FOR ALL INDEXED
    COLUMNS
    SIZE 1.
    3) In case of height-based histograms, I didn't get these calculations.

    density = Σ cnt2 / ( num_rows˜ * Σ cnt )

    “the sum of the squared frequencies of all non-popular values divided by
    the
    sum of the frequencies of all non-popular values times the count of rows
    with not null values of the histogram column�.



    --
    http://www.freelists.org/webpage/oracle-l
  • Yong Huang at Apr 9, 2009 at 2:16 am
    You're welcome. I found a 10.2.0.2 database and tried your test case. The problem of density calculation in 10.2.0.1 disappears in 10.2.0.2.

    I spent many hours reading Wolfgang's CBO articles in 2003 and 2004. You don't need to ask, because it's not possible for me to miss any important articles or books on Oracle.

    To add to the list of formulas of density calculation, there's one more rarely mentioned. Tom Kyte showed a "limited"-published Metalink document at
    http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2969235095639
    That may be applicable only to certain old versions.

    Yong Huang

    On Wed, 4/8/09, Neeraj Bhatia wrote:

    From: Neeraj Bhatia
    Subject: Re: Density statistic calculation in case of histogram
    To: yong321_at_yahoo.com
    Cc: oracle-l@freelists.org
    Date: Wednesday, April 8, 2009, 8:51 AM

    Thanks a lot Yong for very straight forward answer and reference. It helped
    me and now concept it clear to me. I'll also install newer patch set and
    check densities on that.

    BTW, have you get a chance to look into Wolfgang's presentation, where he
    discussed the calculation of density for various buckets?

    Again thanks a lot!
    Neeraj Bhatia

    On Wed, Apr 8, 2009 at 1:15 AM, Yong Huang
    wrote:
    Neeraj,

    The calculation of density in case of a histogram is
    documented in the US
    patent 6732085:
    http://www.freepatentsonline.com/6732085.html
    Jonathan Lewis's CBO book p.172 says the same:
    "sum of the square of the
    frequency of the nonpopular values / (number of
    nonnull rows * number of
    nonpopular nonull rows)", where
    "frequency" is called repetition count in
    the patent.

    Intuitive understanding is beyond me. But it's not
    hard to apply the
    proposed formula. In your case of 5 bucket
    height-balanced histogram, it is:
    bucket# 1 2 3 4 5
    col_skew 1-10 10 10 10 10

    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2
    -------------------------------------------- = 0.001
    10000 * 90

    So the correct answer is 0.001. If you run your test
    in 10.2.0.4, you'll
    get this number. In 10.2.0.1, you get 0.98209 for some
    reason. You would
    "accidentally" get that number if you
    applied the formula to your 10-bucket
    frequency (you call width-based) histogram:

    bucket# 1 2 3 4 5 6 7 8 9 10
    col_skew 1 2 3 4 5 6 7 8 9 10
    10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+10^2+9910^2
    --------------------------------------------------- = .98209
    10000 * 10000

    Yong Huang


    Neeraj Bhatia wrote:

    Version: 10.2.0.1.0

    After reading an excellent paper "Histograms -
    Myths and Facts" by Wolfgang
    Breitling, some points get cleared.

    1) In case of width-based histograms, density= 1/
    (2*number of distinct
    values)

    Thus come the figure 0.00005

    2) Without histograms, density should be equal to
    1/number of distinct
    values. But in my case, it is null. I am not using
    method_opt while
    collecting statistics. The global default setting is
    FOR ALL INDEXED
    COLUMNS
    SIZE 1.
    3) In case of height-based histograms, I didn't
    get these calculations.
    density = Σ cnt2 / ( num_rows˜ * Σ cnt )

    “the sum of the squared frequencies of all
    non-popular values divided by
    the
    sum of the frequencies of all non-popular values times
    the count of rows
    with not null values of the histogram column�.


    --
    http://www.freelists.org/webpage/oracle-l
  • Yong Huang at Apr 9, 2009 at 8:46 pm
    Alberto's article is very interesting. It looks like Oracle keeps making changes to histogram algorithm. Not answering Neeraj's question. I want to point out that even the type, height-balanced vs frequency, is determined differently in different versions. Between 10gR1 and 10.2.0.3, the type is determined by

    SQL> select text from dba_views where view_name = 'ALL_TAB_COLS';
    ...
    case when nvl(h.row_cnt,0) = 0 then 'NONE'

    when (h.bucket_cnt > 255
    or
    (h.bucket_cnt > h.distcnt and h.row_cnt = h.distcnt
    and h.density*h.bucket_cnt*2 <= 1))
    then 'FREQUENCY'
    else 'HEIGHT BALANCED'

    In 10.2.0.4, the "*2" bit on the 5th line is dropped. In 11gR1, "<=" changes to "<" (in addition to the drop of "*2"). We normally determine the type by the difference between bucket count and distinct value count. But the above formula tells us there may be some more criteria. Bucket_cnt>255 is not possible; it's there for historical reason? Row_cnt=distcnt, and the density vs bucket_cnt comparison? And how can density become part of type determination? It would cause density calculation to be postponed till the type has been determined, causing circular logic.

    Yong Huang

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouporacle-l @
categoriesoracle
postedApr 6, '09 at 6:47a
activeApr 9, '09 at 8:46p
posts15
users5
websiteoracle.com

People

Translate

site design / logo © 2022 Grokbase