Why don’t databases create their own indexes automatically

indexrdbms

I would have thought that databases would know enough about what they encounter often and be able to respond to the demands they're placed under that they could decide to add indexes to highly requested data.

Best Answer

Update

This is now implemented in SQL Server Azure. It generates recommendations

and index management can be configured to be automatic.

Enable automatic index management

You can set the SQL Database Advisor to implement recommendations automatically. As recommendations become available they will automatically be applied. As with all index operations managed by the service if the performance impact is negative the recommendation will be reverted.

Original Answer

Some databases do already (kind of) create indexes automatically.

In SQL Server the execution plan can sometimes include an Index Spool operator where the RDBMS dynamically creates an indexed copy of the data. However this spool is not a persistent part of the database kept in synch with the source data and it cannot be shared between query executions, meaning execution of such plans may end up repeatedly creating and dropping temporary indexes on the same data.

Perhaps in the future RDBMSs will have the capacity to dynamically drop and create persistent indexes according to workload.

The process of index optimisation is in the end just a cost benefit analysis. Whilst it is true that humans may have more information about relative importance of queries in a workload in principle there is no reason why this information could not be made available to the optimiser. SQL Server already has a resource governor that allows sessions to be classified into different workload groups with different resource allocations according to priority.

The missing index DMVs mentioned by Kenneth are not intended to be implemented blindly as they only consider the benefits to a specific query and make no attempt to take account of the cost of the potential index to other queries. Nor does it consolidate similar missing indexes. e.g. the output of this DMV may report missing indexes on A,B,C and A,B INCLUDE(C)

Some current issues with the idea are

The quality of any automated analysis that does not actually create the index will be highly dependant upon the accuracy of the costing model.
Even within the field of automated analysis an offline solution will be able to be more thorough than an online solution as it is imperative that an online solution should not add large book keeping overhead to the live server and interfere with its primary purpose of executing queries.
The indexes created automatically in response to workload will necessarily be created in response to queries that would have found them useful so will lag behind solutions that create the indexes in advance.

It is probably reasonable to expect the accuracy of costing models to improve over time but point 2 looks trickier to solve and point 3 is inherently insoluble.

Nevertheless probably the vast majority of installs are not in this idealised situation with skilled staff who continuously monitor, diagnose, and anticipate (or at least react to) changes in workloads.

The AutoAdmin project at Microsoft Research has been running since 1996

The goal of this project is to make databases self-tuning and self-administering by exploiting knowledge of the workload

The project home page lists several intriguing projects. One is particularly relevant to the question here

Another interesting problem arises when there is no DBA available (e.g. an embedded database or a small business). In such scenarios, a low touch continuous index tuning approach may become important. We have explored solutions ...[in] “An Online Approach to Physical Design Tuning” in ICDE 2007.

The authors state

With increasingly common DBMS features like online indexes , it is appealing to explore more automatic solutions to the physical design problem that advance the state of the art.

The paper introduces an algorithm

Its main characteristics are:

As queries are optimized, we identify a relevant set of candidate indexes that would improve performance. This feature allows query processing to continue in parallel with indexes that are built in the background.

At execution time, we track the potential beneﬁts that we lose by not having such candidate indexes and also the utility of existing indexes in the presence of queries, updates, and space constraints.

After we gather enough “evidence” that a physical design change is beneﬁcial, we automatically trigger index creations or deletions.

The online nature of our problem implies that we will generally lag behind optimal solutions that know the future. However, by carefully measuring evidence, we ensure that we do not suffer from “late” decisions signiﬁcantly, thus bounding the amount of incurred loss

The implementation of the algorithm allows for throttling in response to changes in server load and also can abort index creation if during creation the workload changes and expected benefit falls below the point that it is deemed worthwhile.

The conclusion of the authors on the topic of Online versus traditional physical tuning.

The online algorithms in this work are useful when DBAs are uncertain about the future behavior of the workload, or have no possibility of doing a comprehensive analysis or modelling. If a DBA has full information about the workload characteristics, then a static analysis and deployment by existing tools (e.g., [2, 3]) would be a better alternative.

The conclusions here are similar to those in another paper Autonomous Query-driven Index Tuning

Our approach cannot beat the index advisor if the whole workload is known in advance. However, in dynamic environments with evolving and changing workloads the query-driven approach produces better results.

Related Solutions

Db2 – Should indexes automatically be placed into their own bufferpool

Lets run two hypothetical databases (H1 and H2) in parallel, with the same total amount of RAM for bufferpools (R).

Let H1 have a single bufferpool of size R.
Let H2 have two bufferpools: one of size I for index pages, the other of size D for data pages. (And D+I==R of course.)

The question is:

How do you determine the right values for I and D to make H2 perform better than H1?

My answer is that you can't in general. The database engine for H1 has more room for optimizing its cache than H2. If there are times of day where more index pages would lead to better performance, it can discard unused data pages and have a "mostly index pages" cache. If at a later time, data pages become hotter, it can evict more index pages and have a "mostly data pages" cache.

H2 can't do that. Once it has I pages-worth of index pages cached, it can't cache more even if that's what would have been best right now. It's stuck there with a sub-optimal use of RAM.

The only way for H2 to perform just as well as H1 is if the D/I split initially chosen is ideal, and the workload is very stable. That sure can happen, but I'd wager that it is not a very usual database scenario. If that's not your case, think of H1 as being the same as H2 but with dynamic partitioning of the cache between data and index pages managed directly by the thing that knows the most about how it needs to optimizing I/O (i.e. the database engine).

This is not to say that maintaining different bufferpools is never a good idea.
One scenario for isolating specific pages in a specific cache I've encountered is having a critical "report" that needed to run fast (obviously) at all times, and happened to use a few tables that were pretty much never used elsewhere. So those pages kept getting evicted, and that "report" had runtimes that varied a lot from run to run. Moving a specific set of tables (and their indexes) to a specific pool removed much of the "non-deterministic" performance part.
But that's sub-optimal for the database as a whole, more of a kludge and nearer Voodoo optimization IMO. (And that wasn't on DB2 but I believe that's irrelevant here.)

So my advice is: if you have X Gb of RAM available for cache, use a single buffer, put everything in it, and let the database engine do its magic in there.
If you run into a corner case that would appear to benefit from cache segregation, try it out, benchmark it, think about the overhead of having to maintain the magic numbers for each cache size, and go tune the queries, schema or disk layout instead :)

Postgresql – Copy Postgres databases with indexes between instances

If you want to copy a complete PostgreSQL database within its cluster, the fastest method is to use it as TEMPLATE in a CREATE DATABASE statement. I quote the manual:

By default, the new database will be created by cloning the standard system database template1. A different template can be specified by writing TEMPLATE name. In particular, by writing TEMPLATE template0, you can create a virgin database containing only the standard objects predefined by your version of PostgreSQL. This is useful if you wish to avoid copying any installation-local objects that might have been added to template1.

CREATE DATABASE db_copy TEMPLATE db_org;

This effectively copies underlying files around like you tried manually. Except that is sets everything up to work correctly.

You may want to clean up your original before you do (depends):

VACUUM FULL ANALYZE;

Best Answer

Related Solutions

Db2 – Should indexes automatically be placed into their own bufferpool

Postgresql – Copy Postgres databases with indexes between instances

Related Question