Why does the CBO eliminate `distinct` and not `group by` in this query

distinctoptimizationoracleoracle-11g-r2

The CBO chooses to eliminate the distinct in the 'slow' query – presumably it can tell that the operation isn't needed because of the outer group by.

My questions are:

1) Why does it choose to do so in this case – I can't see a reason from the costings and predicted cardinalities in the plans

2) If it chooses to eliminate the distinct, why not apply the same logic and eliminate the group by?

testbed:

create table t1 as
select rownum product_id, mod(rownum,3)+1 company_id
from dual
connect by rownum<=500;

create table t2 as
select t1.product_id from t1 t1 cross join t1 t12;

create table t3 as
select distinct company_id from t1;

analyze table t1 compute statistics;
analyze table t2 compute statistics;
analyze table t3 compute statistics;

fast (55ms):

select company_id
from t1 
     join t2 using(product_id) 
     join ( select company_id
           from (select company_id from t1 group by company_id) 
                join t3 using(company_id) ) using(company_id)
group by company_id;

slow (5240ms):

select company_id
from t1 
     join t2 using(product_id)
     join ( select company_id 
            from (select distinct company_id from t1) 
                 join t3 using(company_id) ) using(company_id)
group by company_id;

SQLFiddle here.

Best Answer

Output is from a 11.2.0.4.6 Enterprise Edition database on Oracle Linux 7.1 x86-64 platform.

Lets start with question 2 and an easy example.

DISTINCT and GROUP BY are handled differently: the optimizer is able to completely eliminate a DISTINCT under certain circumstances, but it can not do the same with GROUP BY. Here is an example:

create table t4 as
select rownum product_id
from dual
connect by rownum<=5;

exec dbms_stats.gather_table_stats(user, 'T4');

alter session set statistics_level=all;

Note that setting statistics_level to ALL significantly increases the execution time for the original queries in the question.

The 2 queries will be:

select distinct product_id from t4

Plan hash value: 641655586

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     4 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH UNIQUE       |      |      1 |      5 |     4  (25)|      5 |00:00:00.01 |  2441K|  2441K| 1503K (0)|
|   2 |   TABLE ACCESS FULL| T4   |      1 |      5 |     3   (0)|      5 |00:00:00.01 |       |       |          |
-------------------------------------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 581042373

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |     4 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY     |      |      1 |      5 |     4  (25)|      5 |00:00:00.01 |  2441K|  2441K|  863K (0)|
|   2 |   TABLE ACCESS FULL| T4   |      1 |      5 |     3   (0)|      5 |00:00:00.01 |       |       |          |
-------------------------------------------------------------------------------------------------------------------

This is what we expect. Full table scan and HASH UNIQUE for DISTINCT, HASH GROUP BY for GROUP BY. Now add a NOT NULL constraint and an index.

create index t4_i1 on t4(product_id);
alter table t4 modify (product_id not null);

select distinct product_id from t4

Plan hash value: 4231414870

-----------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |     2 (100)|      5 |00:00:00.01 |
|   1 |  SORT UNIQUE NOSORT|       |      1 |      5 |     2  (50)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN  | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-----------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 1989519822

-------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-------------------------------------------------------------------------------------------

The optimizer noticed the index on the relevant column, and because of the NOT NULL constraint it is able to use it avoid sorting the data for producing the unique values, because the data is already sorted in the index. Now add a UNIQUE constraint to this column:

alter table t4 add constraint t4_u1 unique (product_id) using index t4_i1;

select distinct product_id from t4

Plan hash value: 3974767428

---------------------------------------------------------------------------------------
| Id  | Operation        | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  INDEX FULL SCAN | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
---------------------------------------------------------------------------------------

select product_id from t4 group by product_id

Plan hash value: 1989519822

-------------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
|   2 |   INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
-------------------------------------------------------------------------------------------

Notice how Oracle completely skipped SORT/HASH UNIQUE when using DISTINCT, but nothing changed with the GROUP BY query.

Enable tracing the optimizer when running the DISTINCT version:

alter system flush shared_pool;
alter session set events '10053 trace name context forever, level 1';
select distinct product_id from t4;

In the trace file we can see the following:

**************************
Query transformations (QT)
**************************
...
Eliminated SELECT DISTINCT from query block SEL$1 (#0)
...
********************************
COST-BASED QUERY TRANSFORMATIONS
********************************
...
Final query after transformations:******* UNPARSED QUERY IS *******
SELECT "T4"."PRODUCT_ID" "PRODUCT_ID" FROM "BP"."T4" "T4"
...

Note that this is a Query Transformation, but NOT a Cost-based Query Transformation. As you can see from the "Final query" part, the optimizer removed DISTINCT from the query. But there is no such optimization for GROUP BY. DISTINCT is used for retrieving distinct values, but GROUP BY is used for producing aggregates, not just distinct values. Even if the optimizer can skip sorting or hashing the data, it can not skip counting, adding, calculating the average, etc, and this is the important difference, so DISTINCT and GROUP BY are not handled in the same way (even if aggregates are not specified).

Another case of eliminating the DISTINCT part, when it is obviously unnecessary, for example:

alter session set "_complex_view_merging"=false;

This was set to prevent Complex View Merging Transformations like DISTINCT Placement, GROUP BY Placement, Subquery Unnesting appearing and making things complicated.

select distinct product_id from (select distinct product_id from (select distinct product_id from (select distinct product_id from t4)));

Plan hash value: 3974767428

---------------------------------------------------------------------------------------
| Id  | Operation        | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |      1 |        |     1 (100)|      5 |00:00:00.01 |
|   1 |  INDEX FULL SCAN | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |
---------------------------------------------------------------------------------------

select product_id from (select distinct product_id from (select product_id from (select distinct product_id from t4 ) group by product_id)) group by product_id;

Plan hash value: 4029011489

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |     1 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY         |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  862K (0)|
|   2 |   VIEW                 |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   3 |    SORT GROUP BY NOSORT|       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   4 |     INDEX FULL SCAN    | T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

select product_id from (select product_id from (select product_id from (select product_id from t4 group by product_id) group by product_id) group by product_id) group by product_id;

Plan hash value: 1970696362

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |     1 (100)|      5 |00:00:00.01 |       |       |          |
|   1 |  HASH GROUP BY         |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  883K (0)|
|   2 |   VIEW                 |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   3 |    HASH GROUP BY       |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  848K (0)|
|   4 |     VIEW               |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   5 |      HASH GROUP BY     |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  840K (0)|
|   6 |       VIEW             |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
|   7 |        HASH GROUP BY   |       |      1 |      5 |     1   (0)|      5 |00:00:00.01 |  2441K|  2441K|  848K (0)|
|   8 |         INDEX FULL SCAN| T4_I1 |      1 |      5 |     1   (0)|      5 |00:00:00.01 |       |       |          |
------------------------------------------------------------------------------------------------------------------------

Notice how the optimizer completetely eliminated the DISTINCTs in the first and second queries, but not the GROUP BYs.

Unfortunately in these cases, since the DISTINCT elimination happens in subqueries ("views"), this information is not present in the optimizer trace, just like for the original queries in the question.

So now we know that DISTINCT and GROUP BY are handled indeed differently, lets go back to question 1.

To be continued in the next post...

(Both answers together exceed the 30000 characters limit.)