Lets run two hypothetical databases (H1
and H2
) in parallel, with the same total amount of RAM for bufferpools (R
).
- Let
H1
have a single bufferpool of size R
.
- Let
H2
have two bufferpools: one of size I
for index pages, the other of size D
for data pages. (And D+I==R
of course.)
The question is:
- How do you determine the right values for
I
and D
to make H2
perform better than H1
?
My answer is that you can't in general. The database engine for H1
has more room for optimizing its cache than H2
. If there are times of day where more index pages would lead to better performance, it can discard unused data pages and have a "mostly index pages" cache. If at a later time, data pages become hotter, it can evict more index pages and have a "mostly data pages" cache.
H2
can't do that. Once it has I
pages-worth of index pages cached, it can't cache more even if that's what would have been best right now. It's stuck there with a sub-optimal use of RAM.
The only way for H2
to perform just as well as H1
is if the D
/I
split initially chosen is ideal, and the workload is very stable. That sure can happen, but I'd wager that it is not a very usual database scenario. If that's not your case, think of H1
as being the same as H2
but with dynamic partitioning of the cache between data and index pages managed directly by the thing that knows the most about how it needs to optimizing I/O (i.e. the database engine).
This is not to say that maintaining different bufferpools is never a good idea.
One scenario for isolating specific pages in a specific cache I've encountered is having a critical "report" that needed to run fast (obviously) at all times, and happened to use a few tables that were pretty much never used elsewhere. So those pages kept getting evicted, and that "report" had runtimes that varied a lot from run to run. Moving a specific set of tables (and their indexes) to a specific pool removed much of the "non-deterministic" performance part.
But that's sub-optimal for the database as a whole, more of a kludge and nearer Voodoo optimization IMO. (And that wasn't on DB2 but I believe that's irrelevant here.)
So my advice is: if you have X Gb of RAM available for cache, use a single buffer, put everything in it, and let the database engine do its magic in there.
If you run into a corner case that would appear to benefit from cache segregation, try it out, benchmark it, think about the overhead of having to maintain the magic numbers for each cache size, and go tune the queries, schema or disk layout instead :)
If you want to copy a complete PostgreSQL database within its cluster, the fastest method is to use it as TEMPLATE
in a CREATE DATABASE
statement. I quote the manual:
By default, the new database will be created by cloning the standard
system database template1. A different template can be specified by
writing TEMPLATE name. In particular, by writing TEMPLATE template0,
you can create a virgin database containing only the standard objects
predefined by your version of PostgreSQL. This is useful if you wish
to avoid copying any installation-local objects that might have been
added to template1.
CREATE DATABASE db_copy TEMPLATE db_org;
This effectively copies underlying files around like you tried manually. Except that is sets everything up to work correctly.
You may want to clean up your original before you do (depends):
VACUUM FULL ANALYZE;
Best Answer
Update
This is now implemented in SQL Server Azure. It generates recommendations
and index management can be configured to be automatic.
Original Answer
Some databases do already (kind of) create indexes automatically.
In SQL Server the execution plan can sometimes include an Index Spool operator where the RDBMS dynamically creates an indexed copy of the data. However this spool is not a persistent part of the database kept in synch with the source data and it cannot be shared between query executions, meaning execution of such plans may end up repeatedly creating and dropping temporary indexes on the same data.
Perhaps in the future RDBMSs will have the capacity to dynamically drop and create persistent indexes according to workload.
The process of index optimisation is in the end just a cost benefit analysis. Whilst it is true that humans may have more information about relative importance of queries in a workload in principle there is no reason why this information could not be made available to the optimiser. SQL Server already has a resource governor that allows sessions to be classified into different workload groups with different resource allocations according to priority.
The missing index DMVs mentioned by Kenneth are not intended to be implemented blindly as they only consider the benefits to a specific query and make no attempt to take account of the cost of the potential index to other queries. Nor does it consolidate similar missing indexes. e.g. the output of this DMV may report missing indexes on
A,B,C
andA,B INCLUDE(C)
Some current issues with the idea are
It is probably reasonable to expect the accuracy of costing models to improve over time but point 2 looks trickier to solve and point 3 is inherently insoluble.
Nevertheless probably the vast majority of installs are not in this idealised situation with skilled staff who continuously monitor, diagnose, and anticipate (or at least react to) changes in workloads.
The AutoAdmin project at Microsoft Research has been running since 1996
The project home page lists several intriguing projects. One is particularly relevant to the question here
The authors state
The paper introduces an algorithm
The implementation of the algorithm allows for throttling in response to changes in server load and also can abort index creation if during creation the workload changes and expected benefit falls below the point that it is deemed worthwhile.
The conclusion of the authors on the topic of Online versus traditional physical tuning.
The conclusions here are similar to those in another paper Autonomous Query-driven Index Tuning