Oracle – Best Indexes for Three Columns Join

oracleoracle-11g-r2

I have a table which has the following columns:

CREATE TABLE example (
   exid INT,
   extype INT,
   studentid INT,
   (more columns)
)

The table will have hundreds of thousands rows, and each record will have distinct value of exid, extype and studentid. While studentid and exid will have plenty of distinct values, extype will have max. 6 values. The table will be joined in the following way:

SELECT ... 
FROM example ex 

LEFT JOIN otherTable1 ot1 ON ex.studentid = ot1.studentid
    AND ex.exid = ot1.exid 
    AND ex.extype = 1

LEFT JOIN otherTable2 ot2 ON ex.studentid = ot2.studentid
    AND ex.exid = ot2.exid 
    AND ex.extype = 2

Can you suggest me what index(es) on example table will be the most suitable in regards of performance here (I should also add other tables used for joins will go into millions of rows)?

Best Answer

You should benchmark it using autotrace (http://betteratoracle.com/posts/10-using-autotrace) for example.

If you expect millions of rows to be returned, then I do not think indexing will help at all. Simple full table scan and hash joining is the most efficient way of executing this kind of query (https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:9422487749968).

Putting the least selective column first and using key compression can decrease the index size and reduce IO and CPU consumption (http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:8806017524595).

The following index should be quite sufficient if you fetch only few rows:

create unique index test1 on example(extype,studentid,exid) compress 1;

Related Solutions

Should I group by and then join, or join and then group by

The optimizer always tries to reduce the amount of data as quickly as possible. If not, your statistics might not be good.

Your plan 1 shows that less rows are processed this is good. The optimizer was able to reduce the amount of data more quickly. The numbers might not be exactly true but it gives you an idea based on the optimizer statistics.

Plan 1:

---------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     1 |    79 |     7  (15)| 00:00:01 |       |       |
|   1 |  HASH GROUP BY                         |      |     1 |    79 |     7  (15)| 00:00:01 |       |       |
|   2 |   NESTED LOOPS                         |      |       |       |            |          |       |       |
|   3 |    NESTED LOOPS                        |      |     1 |    79 |     6   (0)| 00:00:01 |       |       |
|   4 |     NESTED LOOPS                       |      |     1 |    63 |     5   (0)| 00:00:01 |       |       |
|   5 |      TABLE ACCESS BY GLOBAL INDEX ROWID| t1   |     1 |    22 |     3   (0)| 00:00:01 | ROWID | ROWID |
|*  6 |       INDEX RANGE SCAN                 | t1pk |     1 |       |     2   (0)| 00:00:01 |       |       |
|   7 |      PARTITION RANGE ITERATOR          |      |     1 |    41 |     2   (0)| 00:00:01 |   KEY |   KEY |
|*  8 |       TABLE ACCESS BY LOCAL INDEX ROWID| t3   |     1 |    41 |     2   (0)| 00:00:01 |   KEY |   KEY |
|*  9 |        INDEX RANGE SCAN                | t3fk |     1 |       |     1   (0)| 00:00:01 |   KEY |   KEY |
|* 10 |     INDEX UNIQUE SCAN                  | t2pk |     1 |       |     0   (0)| 00:00:01 |       |       |
|* 11 |    TABLE ACCESS BY GLOBAL INDEX ROWID  | t2   |     1 |    16 |     1   (0)| 00:00:01 | ROWID | ROWID |
---------------------------------------------------------------------------------------------------------------

It shows that Oracle did a HASH GROUP BY while plan 2 did a SORT GROUP BY.

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      |     1 |   311 |     7  (15)| 00:00:01 |       |       |
|   1 |  NESTED LOOPS                            |      |       |       |            |          |       |       |
|   2 |   NESTED LOOPS                           |      |     1 |   311 |     7  (15)| 00:00:01 |       |       |
|   3 |    NESTED LOOPS                          |      |     1 |   253 |     6  (17)| 00:00:01 |       |       |
|   4 |     TABLE ACCESS BY GLOBAL INDEX ROWID   | t1   |     1 |   164 |     3   (0)| 00:00:01 | ROWID | ROWID |
|*  5 |      INDEX RANGE SCAN                    | t1pk |     1 |       |     2   (0)| 00:00:01 |       |       |
|   6 |     VIEW PUSHED PREDICATE                |      |     1 |    89 |     3  (34)| 00:00:01 |       |       |
|   7 |      SORT GROUP BY                       |      |     1 |    41 |     3  (34)| 00:00:01 |       |       |
|*  8 |       FILTER                             |      |       |       |            |          |       |       |
|   9 |        PARTITION RANGE SINGLE            |      |     1 |    41 |     2   (0)| 00:00:01 |   KEY |   KEY |
|* 10 |         TABLE ACCESS BY LOCAL INDEX ROWID| t3   |     1 |    41 |     2   (0)| 00:00:01 |   KEY |   KEY |
|* 11 |          INDEX RANGE SCAN                | t3fk |     1 |       |     1   (0)| 00:00:01 |   KEY |   KEY |
|* 12 |    INDEX UNIQUE SCAN                     | t2pk |     1 |       |     0   (0)| 00:00:01 |       |       |
|* 13 |   TABLE ACCESS BY GLOBAL INDEX ROWID     | t2   |     1 |    58 |     1   (0)| 00:00:01 | ROWID | ROWID |
-----------------------------------------------------------------------------------------------------------------

Conclusion: In most cases I would prefer a HASH GROUP BY. So in plan 1 Oracle does a better the GROUP BY much later with much less data. 2 reasons why plan 1 is better.

I assume that this is not the only SQL statement which will be using this join constellation. You will have to decide for each SQL which way is the best.

If you have a SQL similar to the following, the with clause can be much better:

with t3g as ( -- as in 't3 grouped'
    select id2, count(1) t3count
    from t3
    where id = :1 -- new line !!!!!!!!
    group by id2
)
select t1.id t1id, t2id, t3count
from t1 join t2  on (t1.p = t2.p  and t1.id = t2.id1)
        join t3g on (t2.p = t3g.p and t2.id = id2)

Oracle 11g R2 – Cumulative Sum with Reset Value

The following works but there's probably some more clever version. Explanation of the query's logic:

We first find how many "resets" have been done up to and including the current row by counting the not nulls of the reset_val column, so we can separate the rows into subgroups.

We also use another window function LAST_VALUE() with IGNORE NULLS, so we can find the last reset_value.

Note that both these window functions COUNT() and LAST_VALUE() have an ORDER BY, therefore the default window ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Omitted in the query, to make the code more clear.

Assuming that the val is not nullable, the other window function could also be shortened, from:

       COALESCE(SUM(val) OVER 
         (PARTITION BY group_id, reset_count 
          ORDER BY order_val 
          ROWS BETWEEN UNBOUNDED PRECEDING 
                   AND 1 PRECEDING), 0)

(avoiding the COALESCE() as well) to:

       SUM(val) OVER 
         (PARTITION BY group_id, reset_count 
          ORDER BY order_val)
       - val

Finally, in the second cte, we use the above found subgroups (using PARTITION BY group_id, reset_count) to find the cumulative sums.

WITH x AS
  ( SELECT temp.*, 
           COUNT(reset_val) OVER 
               (PARTITION BY group_id 
                ORDER BY order_val)
             AS reset_count,
           COALESCE(LAST_VALUE(reset_val IGNORE NULLS) OVER 
               (PARTITION BY group_id 
                ORDER BY order_val), 0)
             AS reset_value
    FROM temp
  ) ,
y AS 
  ( SELECT x.*,
           COALESCE(SUM(val) OVER 
             (PARTITION BY group_id, reset_count 
              ORDER BY order_val 
              ROWS BETWEEN UNBOUNDED PRECEDING 
                       AND 1 PRECEDING), 0)            
           + reset_value AS cumsum      
    FROM x
  )
SELECT *
FROM y ;

Test at SQLfiddle.

Another variation, based on @Chris' recursive answer. (slightly improved, works with non-consecutive order_val, avoids the final GROUP BY).
Also works in case the first row of a group has a reset_val:

WITH row_nums AS
  ( SELECT id, group_id, order_val, reset_val, val, 
           ROW_NUMBER() OVER (PARTITION BY group_id
                              ORDER BY order_val)
             AS rn
    FROM temp
  ) ,
updated_temp (id, group_id, order_val, reset_val, val, rn, cumsum) AS
  ( SELECT id, group_id, order_val, reset_val, val, rn, 
           COALESCE(reset_val, 0)
    FROM row_nums
    WHERE rn = 1
  UNION ALL
    SELECT curr.id, curr.group_id, curr.order_val, curr.reset_val, curr.val, curr.rn, 
           COALESCE(curr.reset_val, prev.val + prev.cumsum) 
    FROM row_nums  curr 
      JOIN updated_temp  prev 
        ON  curr.rn-1 = prev.rn 
        AND curr.group_id = prev.group_id
  )
SELECT id, group_id, order_val, reset_val, val, cumsum
FROM updated_temp
ORDER BY group_id, order_val ;

Test at SQLfiddle-2.

One more variation, using the older (proprietary) CONNECT BY syntax for recursive queries. More compact but I find it more difficult to write and read than the CTE version:

WITH row_nums AS
  ( SELECT id, group_id, order_val, reset_val, val, 
           ROW_NUMBER() OVER (PARTITION BY group_id
                              ORDER BY order_val)
             AS rn,
           COALESCE(reset_val, 0) AS cumsum
    FROM temp
  ) 
SELECT id, group_id, order_val, reset_val, val, rn,  
       COALESCE(reset_val, PRIOR val + PRIOR cumsum, 0) AS cumsum
FROM row_nums
START WITH rn = 1 OR reset_val IS NOT NULL
CONNECT BY  rn-1 = PRIOR rn 
        AND group_id = PRIOR group_id
        AND reset_val IS NULL 
ORDER BY group_id, order_val ;

Tested at SQLfiddle-3.

Best Answer

Related Solutions

Should I group by and then join, or join and then group by

Oracle 11g R2 – Cumulative Sum with Reset Value

Related Question