Why does the CBO eliminate `distinct` and not `group by` in this query

distinctoptimizationoracleoracle-11g-r2

The CBO chooses to eliminate the distinct in the 'slow' query – presumably it can tell that the operation isn't needed because of the outer group by.

My questions are:

1) Why does it choose to do so in this case – I can't see a reason from the costings and predicted cardinalities in the plans

2) If it chooses to eliminate the distinct, why not apply the same logic and eliminate the group by?

testbed:

create table t1 as
select rownum product_id, mod(rownum,3)+1 company_id
from dual
connect by rownum<=500;

create table t2 as
select t1.product_id from t1 t1 cross join t1 t12;

create table t3 as
select distinct company_id from t1;

analyze table t1 compute statistics;
analyze table t2 compute statistics;
analyze table t3 compute statistics;

fast (55ms):

select company_id
from t1 
     join t2 using(product_id) 
     join ( select company_id
           from (select company_id from t1 group by company_id) 
                join t3 using(company_id) ) using(company_id)
group by company_id;

slow (5240ms):

select company_id
from t1 
     join t2 using(product_id)
     join ( select company_id 
            from (select distinct company_id from t1) 
                 join t3 using(company_id) ) using(company_id)
group by company_id;

SQLFiddle here.

Best Answer

Output is from a 11.2.0.4.6 Enterprise Edition database on Oracle Linux 7.1 x86-64 platform.

Lets start with question 2 and an easy example.

DISTINCT and GROUP BY are handled differently: the optimizer is able to completely eliminate a DISTINCT under certain circumstances, but it can not do the same with GROUP BY. Here is an example:

create table t4 as
select rownum product_id
from dual
connect by rownum<=5;

exec dbms_stats.gather_table_stats(user, 'T4');

alter session set statistics_level=all;

Note that setting statistics_level to ALL significantly increases the execution time for the original queries in the question.

The 2 queries will be:

select distinct product_id from t4

Plan hash value: 641655586

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     4 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH UNIQUE       |      |      1 |      5 |     4  (25)|      5 |00:00:00.01 |  2441K|  2441K| 1503K (0)|
|   2 |   TABLE ACCESS FULL| T4   |      1 |      5 |     3   (0)|      5 |00:00:00.01 |       |       |          |
-------------------------------------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 581042373

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     4 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY     |      |      1 |      5 |     4  (25)|      5 |00:00:00.01 |  2441K|  2441K|  863K (0)|
|   2 |   TABLE ACCESS FULL| T4   |      1 |      5 |     3   (0)|      5 |00:00:00.01 |       |       |          |
-------------------------------------------------------------------------------------------------------------------

This is what we expect. Full table scan and HASH UNIQUE for DISTINCT, HASH GROUP BY for GROUP BY. Now add a NOT NULL constraint and an index.

create index t4_i1 on t4(product_id);
alter table t4 modify (product_id not null);

select distinct product_id from t4

Plan hash value: 4231414870

-----------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |     2 (100)|      5 |00:00:00.01 |
|   1 |  SORT UNIQUE NOSORT|       |      1 |      5 |     2  (50)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN  | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-----------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 1989519822

-------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-------------------------------------------------------------------------------------------

The optimizer noticed the index on the relevant column, and because of the NOT NULL constraint it is able to use it avoid sorting the data for producing the unique values, because the data is already sorted in the index. Now add a UNIQUE constraint to this column:

alter table t4 add constraint t4_u1 unique (product_id) using index t4_i1;

select distinct product_id from t4

Plan hash value: 3974767428

---------------------------------------------------------------------------------------
| Id  | Operation        | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  INDEX FULL SCAN | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
---------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 1989519822

-------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-------------------------------------------------------------------------------------------

Notice how Oracle completely skipped SORT/HASH UNIQUE when using DISTINCT, but nothing changed with the GROUP BY query.

Enable tracing the optimizer when running the DISTINCT version:

alter system flush shared_pool;
alter session set events '10053 trace name context forever, level 1';
select distinct product_id from t4;

In the trace file we can see the following:

**************************
Query transformations (QT)
**************************
...
Eliminated SELECT DISTINCT from query block SEL$1 (#0)
...
********************************
COST-BASED QUERY TRANSFORMATIONS
********************************
...
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "T4"."PRODUCT_ID" "PRODUCT_ID" FROM "BP"."T4" "T4"
...

Note that this is a Query Transformation, but NOT a Cost-based Query Transformation. As you can see from the "Final query" part, the optimizer removed DISTINCT from the query. But there is no such optimization for GROUP BY. DISTINCT is used for retrieving distinct values, but GROUP BY is used for producing aggregates, not just distinct values. Even if the optimizer can skip sorting or hashing the data, it can not skip counting, adding, calculating the average, etc, and this is the important difference, so DISTINCT and GROUP BY are not handled in the same way (even if aggregates are not specified).

Another case of eliminating the DISTINCT part, when it is obviously unnecessary, for example:

alter session set "_complex_view_merging"=false;

This was set to prevent Complex View Merging Transformations like DISTINCT Placement, GROUP BY Placement, Subquery Unnesting appearing and making things complicated.

select distinct product_id from (select distinct product_id from (select distinct product_id from (select distinct product_id from t4)));

Plan hash value: 3974767428

---------------------------------------------------------------------------------------
| Id  | Operation        | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  INDEX FULL SCAN | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
---------------------------------------------------------------------------------------

select product_id from (select distinct product_id from (select product_id from (select distinct product_id from t4 ) group by product_id)) group by product_id;

Plan hash value: 4029011489

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |     1 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY         |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  862K (0)|
|   2 |   VIEW                 |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   3 |    SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   4 |     INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

select product_id from (select product_id from (select product_id from (select product_id from t4 group by product_id) group by product_id) group by product_id) group by product_id;

Plan hash value: 1970696362

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |     1 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY         |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  883K (0)|
|   2 |   VIEW                 |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   3 |    HASH GROUP BY       |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  848K (0)|
|   4 |     VIEW               |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   5 |      HASH GROUP BY     |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  840K (0)|
|   6 |       VIEW             |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   7 |        HASH GROUP BY   |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  848K (0)|
|   8 |         INDEX FULL SCAN| T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

Notice how the optimizer completetely eliminated the DISTINCTs in the first and second queries, but not the GROUP BYs.

Unfortunately in these cases, since the DISTINCT elimination happens in subqueries ("views"), this information is not present in the optimizer trace, just like for the original queries in the question.

So now we know that DISTINCT and GROUP BY are handled indeed differently, lets go back to question 1.

To be continued in the next post...

(Both answers together exceed the 30000 characters limit.)

Related Solutions

Why does the CBO not chose a lower cost skip scan without stats

It seems the CBO does not consider a skip scan at all with dynamic sampling, is this true?

Actually this is really easy to verify, you can do this by enabling 10053 trace. You will see that the optimizer does not even consider skip scan at all. The reason for this, is the "_optimizer_skip_scan_guess" parameter. The default value for this parameter is FALSE, meaning the optimizer will not consider skip scan when all it has is "guessed" selectivity, which is the case with dynamic sampling.

If you set "_optimizer_skip_scan_guess" to TRUE, skip scan will be considered, this can be also confirmed with the 10053 trace again.

PS: your db_file_multiblock_read_count parameter seems to be lower than the default value. On my 11.2.0.4 sandbox, with the default value of 128, after collecting statistics on the table, index FFS had about one-third of the cost of index SS.

Edit: added output

SQL> create table foo( bar integer, baz integer, qux char(99), constraint pk_foo primary key (bar, baz) ) organization index compress;

Table created.

SQL> insert into foo(bar,baz) select mod(level,1000), level from dual connect by level<1000000;

999999 rows created.

SQL> commit;

Commit complete.

Original:

SQL> explain plan for
select * from foo where baz=1
union all
select /*+ full(foo) */ * from foo where baz=1
union all
select /*+ index_ss(foo) */ * from foo where baz=1;  2    3    4    5    6

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 1715140356

--------------------------------------------------------------------------------
| Id  | Operation             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |    66 |  8382 |  1302   (1)| 00:00:16 |
|   1 |  UNION-ALL            |        |       |       |            |          |
|*  2 |   INDEX FAST FULL SCAN| PK_FOO |    22 |  2794 |   546   (1)| 00:00:07 |
|*  3 |   INDEX FAST FULL SCAN| PK_FOO |    22 |  2794 |   546   (1)| 00:00:07 |
|*  4 |   INDEX SKIP SCAN     | PK_FOO |    22 |  2794 |   211   (0)| 00:00:03 |
--------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("BAZ"=1)
   3 - filter("BAZ"=1)
   4 - access("BAZ"=1)
       filter("BAZ"=1)

Note
-----

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
   - dynamic sampling used for this statement (level=2)

Enable skip scan for guessed selectivity:

SQL> alter session set "_optimizer_skip_scan_guess"=true;

Session altered.

SQL> explain plan for
select * from foo where baz=1
union all
select /*+ full(foo) */ * from foo where baz=1
union all
select /*+ index_ss(foo) */ * from foo where baz=1;  2    3    4    5    6

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3033162421

--------------------------------------------------------------------------------
| Id  | Operation             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |    66 |  8382 |   772   (1)| 00:00:10 |
|   1 |  UNION-ALL            |        |       |       |            |          |
|*  2 |   INDEX SKIP SCAN     | PK_FOO |    22 |  2794 |   211   (0)| 00:00:03 |
|*  3 |   INDEX FAST FULL SCAN| PK_FOO |    22 |  2794 |   350   (1)| 00:00:05 |
|*  4 |   INDEX SKIP SCAN     | PK_FOO |    22 |  2794 |   211   (0)| 00:00:03 |
--------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BAZ"=1)
       filter("BAZ"=1)
   3 - filter("BAZ"=1)
   4 - access("BAZ"=1)
       filter("BAZ"=1)

Note

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
-----
   - dynamic sampling used for this statement (level=2)

24 rows selected.

Just for completeness:

SQL> analyze table foo compute statistics;

Table analyzed.

SQL> explain plan for
select * from foo where baz=1
union all
select /*+ full(foo) */ * from foo where baz=1
union all
select /*+ index_ss(foo) */ * from foo where baz=1;  2    3    4    5    6

Explained.

SQL>  select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 1715140356

--------------------------------------------------------------------------------
| Id  | Operation             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |        |     3 |    30 |  1717   (2)| 00:00:21 |
|   1 |  UNION-ALL            |        |       |       |            |          |
|*  2 |   INDEX FAST FULL SCAN| PK_FOO |     1 |    10 |   357   (3)| 00:00:05 |
|*  3 |   INDEX FAST FULL SCAN| PK_FOO |     1 |    10 |   357   (3)| 00:00:05 |
|*  4 |   INDEX SKIP SCAN     | PK_FOO |     1 |    10 |  1002   (0)| 00:00:13 |
--------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("BAZ"=1)
   3 - filter("BAZ"=1)
   4 - access("BAZ"=1)
       filter("BAZ"=1)

19 rows selected.

Why the optimizer does not use an Index Unique Scan

I have a big Index Organized Table (IOT) with 74 columns

Why do you think that this table is a candidate for an IOT? IMHO, IOT's are meant to be narrow tables that don't change often. I have done testing with narrow tables and have found IOT's to do inserts at a slower rate than a traditional table with indexes. You may want to start by recreating this as a traditional table and then think about what indexes you want on the table. Also there is a note in this article on IOT's. Performance Tuning Guide

Best Answer

Related Solutions

Why does the CBO not chose a lower cost skip scan without stats

Why the optimizer does not use an Index Unique Scan

Related Question