Mysql – Analyzing a query plan for a query with TWO predicates, but index only has ONE

database-theoryexecution-planMySQLoptimization

Note: This is from an introductory database systems course at my university so I'm not entirely sure if it matters, but for what it's worth we are being taught MySQL.

One homework question I'm doing right now is to analyze some query plans for a query with two predicates, specifically:

SELECT *
FROM Student
WHERE wam > 75 AND faculty = 'Arts';

however the plans provided assume an index only on one of the predicates, such as an unclustered B+ tree index on (wam).

One example part of the question is as follows:

Compute the estimated cost of the best plan assuming that an
unclustered B+ tree index on (wam) is the only index available.
Suppose there are 60 index pages. Discuss and calculate alternative
plans

How would I go about calculating the cost of a plan like this? Would I have to calculate costs for unclustered/clustered B+ tree and hash indexes for the other predicate, faculty, as well? I am aware that one alternative plan is also simply a full table scan.

"Student" has 800 pages, 64000 tuples. "faculty" can take 10 unique values. "wam" can take 100 unique values.

Thank you, if I'm not being specific enough feel free to point out anything lacking in this post.

Best Answer

The optimal index for that query is INDEX(faculty, wam), in that order.

MySQL does not (in most situations) use two different indexes. However, it will use a 'composite' index to good use in some situations, such as this.

Without knowing how many rows have Arts and home many have > 75, you cannot accurately conclude which of INDEX(wam) or INDEX(faculty) would be better. Nor can the Optimizer, since it's statistics gathering is not very good. SHOW INDEXES and EXPLAIN SELECT provide numbers that come from the same statistics, so you won't necessarily come to the optimal conclusion.

A "Hash" index (which InnoDB does not have) is essentially no better than a BTree index. A Hash index is useless for a range (eg, > 75). (Hence, MySQL uses BTrees, and does not provide Hash indexes.)

Another point: If more than a small percentage of the table has wam > 75, INDEX(wam) will be shunned; scanning the entire table is likely to be faster than bouncing between the index's BTree and the data BTree. Again > 75 and 60 pages and 10 and 100 give no good clues of which path the Optimizer should take. The answer to the quoted question is "it depends" or "not enough info".

You could tackle the question by presenting two answers: One where very few rows have wam > 75, and one where most meet that predicate.

Has your instructor had any experience using databases, or only textbooks?

More discussion on devising optimal indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Related Solutions

Execution plan using B+ tree index, but also sorts

Oracle says about Indexes and Index-Organized Tables under Full Index Scan: In a full index scan, the database reads the entire index in order.

Yet, unter Fast Full Index Scan, it reads: A fast full index scan is a full index scan in which the database accesses the data in the index itself without accessing the table, and the database reads the index blocks in no particular order. (Emphasis mine)

Now, probably, the question should be: why did the optimzier choose INDEX FAST FULL SCAN over INDEX FULL SCAN.

A hint to the answer of the latter question is given in under 11.2.3.7 Fast Full Index Scans: A fast full scan is faster than a normal full index scan because it can use multiblock I/O and can run in parallel just like a table scan.

If you insist that Oracle use a full index scan, you might want to try the /*+ index() */ hint:

create table tq84_foo (
  fooID number not null
);

create table tq84_bar (
  fooID number not null
);

create unique index ix_foo on tq84_foo(fooID);
create unique index ix_bar on tq84_bar(fooID);

explain plan for
 select /*+ index(f ix_foo) */fooID from tq84_foo f
        MINUS
select /*+ index(b ix_bar) */ fooID from tq84_bar b;

select * from table(dbms_xplan.display);

resulting in

------------------------------------------------------------------------------
| Id  | Operation           | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |     1 |    26 |     4  (75)| 00:00:01 |
|   1 |  MINUS              |        |       |       |            |          |
|   2 |   SORT UNIQUE NOSORT|        |     1 |    13 |     2  (50)| 00:00:01 |
|   3 |    INDEX FULL SCAN  | IX_FOO |     1 |    13 |     1   (0)| 00:00:01 |
|   4 |   SORT UNIQUE NOSORT|        |     1 |    13 |     2  (50)| 00:00:01 |
|   5 |    INDEX FULL SCAN  | IX_BAR |     1 |    13 |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------

Sql-server – Why does the estimated cost of (the same) 1000 seeks on a unique index differ in these plans

Anyone know why this task is costed at 0.172434 in the first plan but 3.01702 in the second?

Generally speaking, an inner side seek below a nested loops join is costed assuming a random I/O pattern. There is a simple replacement-based reduction for subsequent accesses, accounting for the chance that the required page has already been brought into memory by a previous iteration. This basic assessment produces the standard (higher) cost.

There is another costing input, Smart Seek Costing, about which little detail is known. My guess (and that is all it is at this stage) is that SSC attempts to assess inner side seek I/O cost in more detail, perhaps by considering local ordering and/or the range of values to fetch. Who knows.

For example, the first seeking operation brings in not just the requested row, but all rows on that page (in index order). Given the overall access pattern, fetching the 1000 rows in 1000 seeks requires only 2 physical reads, even with read-ahead and prefetching disabled. From that perspective, the default I/O costing represents a significant overestimate, and the SSC-adjusted cost is closer to reality.

It seems reasonable to expect that SSC would be most effective where the loop drives an index seek more or less directly, and the join outer reference is the basis of the seeking operation. From what I can tell, SSC is always attempted for suitable physical operations, but most often produces no downward adjustment when the seek is separated from the join by other operations. Simple filters are one exception to this, perhaps because SQL Server can often push these into the data access operator. In any case, the optimizer has pretty deep support for selections.

It is unfortunate that the Compute Scalar for the subquery outer projections seems to interfere with SSC here. Compute Scalars are usually relocated above the join, but these ones have to stay where they are. Even so, most normal Compute Scalars are pretty transparent to optimization, so this is a bit surprising.

Regardless, when the physical operation PhyOp_Range is produced from a simple selection on an index SelIdxToRng, SSC is effective. When the more complex SelToIdxStrategy (selection on a table to an index strategy) is employed, the resulting PhyOp_Range runs SSC but results in no reduction. Again, it seems that simpler, more direct operations work best with SSC.

I wish I could tell you exactly what SSC does, and show the exact calculations, but I don't know those details. If you want to explore the limited trace output available for yourself, you can employ undocumented trace flag 2398. An example output is:

Smart seek costing (7.1) :: 1.34078e+154 , 0.001

That example relates to memo group 7, alternative 1, showing a cost upper bound, and a factor of 0.001. To see cleaner factors, be sure to rebuild the tables without parallelism so the pages are as dense as possible. Without doing that, the factor is more like 0.000821 for your example Target table. There are some fairly obvious relationships there, of course.

SSC can also be disabled with undocumented trace flag 2399. With that flag active, both costs are the higher value.

Best Answer

Related Solutions

Execution plan using B+ tree index, but also sorts

Sql-server – Why does the estimated cost of (the same) 1000 seeks on a unique index differ in these plans

Related Question