OR Dear Me!  When Cost Based OR  Expansion Gets it Wrong – Part 3

…continued from Part 1 and Part 2.

OK, so next we are going to return to original query before with no index in place and review the optimizer’s estimate of how many rows will be returned, and try to deduce where it came from, then in Part 4 we’ll compare with the behavior after the index is created. I’m going to omit the BETWEEN 0 and 1 and BETWEEN 98 and 99 predicates, they are edge cases which complicate the calculations, without significantly affecting the observed behavior.

Throughout the article, I will be referring to sections in Jonathan Lewis‘s book, Cost-Based Oracle Fundamentals (CBOF).

I’m going to refer to each individual BETWEEN predicate as Bn (where n is a number between 1 and 48), so Bn is shorthand for C BETWEEN (2n) AND (2n+1). I’ll refer to Selectivity as SEL, and Cardinality (the number of rows returned after applying relevant predicates) as CARD.

So we need to understand how the optimizer calculates that CARD(BP1 OR BP2 OR … BP48)=76,939, rather than the 96,000 that are actually returned as shown below.

SQL> CREATE TABLE t (
  2      c NUMBER(2, 0) NOT NULL,
  3      p VARCHAR2(4000)
  4* );
SQL>
SQL> INSERT /*+append*/ INTO t
  2      SELECT
  3          mod(ROWNUM, 100) c,
  4          rpad('X', 4000)  padding
  5      FROM
  6          dual
  7      CONNECT BY
  8*         ROWNUM <= 100000;
SQL>
SQL> COMMIT;
SQL>
SQL> EXEC dbms_stats.gather_table_stats(null, 't', method_opt=>'for all columns size 1');
SQL>
SQL> SELECT /*+ gather_plan_statistics */ DISTINCT
  2                       TRIM(p) p
  3                   FROM
  4                       t WHERE
  5                                        (c BETWEEN  2 AND  3) OR (c BETWEEN  4 AND  5) OR (c BETWEEN  6 AND  7) OR (c BETWEEN  8 AND  9)
  6            OR (c BETWEEN 10 AND 11) OR (c BETWEEN 12 AND 13) OR (c BETWEEN 14 AND 15) OR (c BETWEEN 16 AND 17) OR (c BETWEEN 18 AND 19)
  7            OR (c BETWEEN 20 AND 21) OR (c BETWEEN 22 AND 23) OR (c BETWEEN 24 AND 25) OR (c BETWEEN 26 AND 27) OR (c BETWEEN 28 AND 29)
  8            OR (c BETWEEN 30 AND 31) OR (c BETWEEN 32 AND 33) OR (c BETWEEN 34 AND 35) OR (c BETWEEN 36 AND 37) OR (c BETWEEN 38 AND 39)
  9            OR (c BETWEEN 40 AND 41) OR (c BETWEEN 42 AND 43) OR (c BETWEEN 44 AND 45) OR (c BETWEEN 46 AND 47) OR (c BETWEEN 48 AND 49)
 10            OR (c BETWEEN 50 AND 51) OR (c BETWEEN 52 AND 53) OR (c BETWEEN 54 AND 55) OR (c BETWEEN 56 AND 57) OR (c BETWEEN 58 AND 59)
 11            OR (c BETWEEN 60 AND 61) OR (c BETWEEN 62 AND 63) OR (c BETWEEN 64 AND 65) OR (c BETWEEN 66 AND 67) OR (c BETWEEN 68 AND 69)
 12            OR (c BETWEEN 70 AND 71) OR (c BETWEEN 72 AND 73) OR (c BETWEEN 74 AND 75) OR (c BETWEEN 76 AND 77) OR (c BETWEEN 78 AND 79)
 13            OR (c BETWEEN 80 AND 81) OR (c BETWEEN 82 AND 83) OR (c BETWEEN 84 AND 85) OR (c BETWEEN 86 AND 87) OR (c BETWEEN 88 AND 89)
 14*           OR (c BETWEEN 90 AND 91) OR (c BETWEEN 92 AND 93) OR (c BETWEEN 94 AND 95) OR (c BETWEEN 96 AND 97);
   P
____
X
SQL>
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -predicate -cost -bytes');
                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
SQL_ID  11ucbv2rb2wq5, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ DISTINCT
...
Plan hash value: 1793979440
----------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:01.23 |     100K|    100K|
|   1 |  HASH UNIQUE       |      |      1 |      1 |      1 |00:00:01.23 |     100K|    100K|
|   2 |   TABLE ACCESS FULL| T    |      1 |  76939 |  96000 |00:00:04.33 |     100K|    100K|
----------------------------------------------------------------------------------------------

Let’s query the statistics to see what information the database has about table T and column C.

SELECT num_rows FROM user_tables WHERE table_name = 'T';
   NUM_ROWS
___________
     100000

SQL> SELECT histogram,
  2      num_nulls,
  3      num_distinct,
  4      utl_raw.cast_to_number(low_value)  low_value,
  5      utl_raw.cast_to_number(high_value) high_value
  6  FROM
  7      user_tab_cols
  8  WHERE
  9          table_name = 'T'
 10*     AND column_name = 'C';
   HISTOGRAM    NUM_NULLS    NUM_DISTINCT    LOW_VALUE    HIGH_VALUE
____________ ____________ _______________ ____________ _____________
NONE                    0             100            0            99

The values listed will be needed in subsequent calculations.

NUM_DISTINCT = 100
LOW_VALUE    = 0
HIGH_VALUE   = 99
NUM_ROWS     = 100,000

CBOR (p53 case 6) gives us the formula for the generic SEL(Bn) and CARD(Bn) into which we can insert the values from the statistics.

SEL(Bn) = 
((2n+1)-(2n)))/(HIGH_VALUE-LOW_VALUE)+2/NUM_DISTINCT
=1/99+2/100
=0.03010101...
CARD(Bn)=ROUND(SEL(Bn)*NUM_ROWS)
=0.030101*100000
=3010

As shown below for n=1, we can verify this value for any value of n, and also observe that this estimate (E-Rows=3010) is larger than the actual (A-Rows=2000) number of rows matched.

SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -predicate -cost -bytes');
                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
SQL_ID  6k4aym89jwk28, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */  DISTINCT trim(p) p FROM t WHERE
(c BETWEEN  2 AND  3)
Plan hash value: 1793979440
----------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.68 |     100K|    100K|
|   1 |  HASH UNIQUE       |      |      1 |      1 |      1 |00:00:00.68 |     100K|    100K|
|   2 |   TABLE ACCESS FULL| T    |      1 |   3010 |   2000 |00:00:01.35 |     100K|    100K|
----------------------------------------------------------------------------------------------

Now we need to start combining the predicates, starting with CARD(B1 OR B2). Now whilst it should be obvious that there is no row that can be matched by both B1 and B2 (a number between 2 and 3 cannot also be between 4 and 5), the optimizer doesn’t know this (consider the combined predicates C BETWEEN 1 AND 3 OR C BETWEEN 2 AND 4) so it has to use the calculations (CBOF p56) below

SEL(B1 OR B2)
=SEL(B1) + SEL(B2) - SEL(B1)* SEL(B2)
=0.03010101+0.03010101-0.03010101*0.03010101
=0.05929594...

CARD(B1 OR B2)
=ROUND(CARD(B1 OR B2) * NUM_ROWS)
=ROUND(0.05929594*100000)
=5930

Again we can compare this with both what the optimizer tells us it has estimated, 5930, and the actual number of rows matched 4000.

SQL> SELECT /*+ gather_plan_statistics */  DISTINCT trim(p) p FROM t WHERE (c BETWEEN  2 AND  3) OR (c BETWEEN  4 AND  5);
   P
____
X
SQL>
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -predicate -cost -bytes');
                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
SQL_ID  2rsbk7yf41ky4, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */  DISTINCT trim(p) p FROM t WHERE
(c BETWEEN  2 AND  3) OR (c BETWEEN  4 AND  5)
Plan hash value: 1793979440
----------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.73 |     100K|    100K|
|   1 |  HASH UNIQUE       |      |      1 |      1 |      1 |00:00:00.73 |     100K|    100K|
|   2 |   TABLE ACCESS FULL| T    |      1 |   5930 |   4000 |00:00:02.76 |     100K|    100K|
----------------------------------------------------------------------------------------------

Calculation for CARD(B1 OR B2 OR B3) is as follows

 SEL(B1 OR B2 OR B3)
=SEL((B1 OR B2) OR B3)
=SEL(B1 OR B2) + SEL(B3) - SEL(B1 OR B2) * SEL(B3)
=SEL(B1 OR B2) + SEL(B3) - SEL(B1 OR B2) * SEL(B3)
=0.05929594+0.03010101-0.05929594*0.03010101
=0.08761208...
 CARD(B1 OR B2 OR B3)
=ROUND(SEL(B1 OR B2 OR B3)*NUM_ROWS)
=ROUND(0.08761208*100000)
=8761

Let’s compare with what the optimizer tells us…

SQL> SELECT /*+ gather_plan_statistics */  DISTINCT trim(p) p FROM t WHERE (c BETWEEN  2 AND  3) OR (c BETWEEN  4 AND  5) OR  (c BETWEEN  6 AND  7);
   P
____
X
SQL> SELECT * FROM dbms_xplan.display_cursor(format=>'allstats last -predicate -cost -bytes');
                                                                                PLAN_TABLE_OUTPUT
_________________________________________________________________________________________________
SQL_ID  d4sbkshm3dj0y, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */  DISTINCT trim(p) p FROM t WHERE
(c BETWEEN  2 AND  3) OR (c BETWEEN  4 AND  5) OR  (c BETWEEN  6 AND  7)
Plan hash value: 1793979440
----------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      1 |00:00:00.71 |     100K|    100K|
|   1 |  HASH UNIQUE       |      |      1 |      1 |      1 |00:00:00.71 |     100K|    100K|
|   2 |   TABLE ACCESS FULL| T    |      1 |   8761 |   6000 |00:00:01.43 |     100K|    100K|
----------------------------------------------------------------------------------------------

I used a spreadsheet to continue this pattern up to CARD(B1 OR .. C48), which finishes with the final calculation..

SEL(B1 OR ... B48)
=SEL((B1 OR ... B47) OR B48)
=SEL(B1 OR ... B47) + SEL(B48) - (B1 OR ... B47) * SEL(B48)
=0.76223613+0.03010101-0.76223613*0.03010101
=0.76939306...
CARD(B1 OR ... B48)
=ROUND(SEL(B1 OR ... B48) * NUM_ROWS)
=ROUND(0.76939306*100000)
=76939

Fortunately this matches the optimizer has said is it’s estimated cardinality for the full query as we noted previously.

I think it’s useful at this point to compare how the estimated number of rows deviates from the actual number of rows matched as the number of predicates increases. This is shown in the chart below, and you can see that with less than 30 predicates the optimizer overestimates the number of rows returned, and that as the we get more than 30 predicates this turns into an under-estimation.

In the next article, we will compare this behavior with what happens case where Cost Based Or Expansion occurs.

Leave a Comment

Your email address will not be published. Required fields are marked *