Things are more complicated than that. Here are a few points of consideration.
First, this entire discussion assumes B-Trees or B+ Trees (Hence the o(log(n))). There are other types of indexes, like hash indexes, where access is in O(1). Your question insinuates you're looking up values using "equals" search (e.g. looking for X=17
). But in this particular scenario, a Hash index is preferable, when possible.
I do agree though that most indexes you'll find today are B/B+ Trees, so let's continue with this assumption.
You've also implicitly indicated that there's always only one resulting row in a SELECT
, which is hardly a representative case; plenty times do we look for 1,000 rows at a time. But let's continue with the assumption of a single matching row.
Your next assumption is that searches are always done by the indexed column. This is fine, but I'm just noting that DELETE
ing a record by some unindexed column Y
turns out to be more expensive: you're both wasting O(n) time in finding the record, and then paying an additional O(log(n)) for updating the index.
But let us continue with the assumption that we're only discussing queries which are looking up at indexed columns.
Some tables use unclustered index format (which fits into your calculations) - the table is one entity and the index is another. Others use clustered index format: table rows (or rather blocks of rows, or rather yet pages of rows) are actually stored as leaves inside the clustering index. In such scenario, you will pay O(log(n)) for finding a record in an INSERT
command. An optimization for that is in the case you're inserting to the end of the table, and a decent implementation would hold a pointer to the last record/page in the index. (Oh, yes, you should not the possibility that your record gets INSERT
ed to the middle of the table).
Actually, records could get inserted to the middle of the table even in unclustered tables; it would make sense to spend more search time so as to avoid fragmentation, and at least one implementation that I know of does that. I'm assuming other may, too.
Deleting/Inserting an index from a BTree is O(1) on average, but may cost up to log(n) operations in case of propagated page merging/splitting.
Also, the fact that always comes as a surprise to many, is that sometimes a table scan is faster than index lookup. This is particularly true for queries resulting in multiple rows. It turns out looking up the index adds overhead; when compared to full table scan the overhead could actually make total cost higher. For single row lookup the vast majority of index lookups should be faster than a full table scan.
But do consider the following general convention: you pay with dollars any action that accesses disk. You pay with nickels actions that act on in memory data. This is actually at the heart of database disk I/O optimization. If index pages are on disk, and table pages happen to be in memory, you will possibly pay less for table scan.
And that's where havoc comes in: it really all depends on your workloads, on your memory size, on your dataset size.
Did you ever take a math class where you had to solve some complex integral? It took you hours to solve it, and got point off for missing some tiny minus sign somewhere?
Did you even take a physics class where you had to solve some complex integral? The professor would just throw away chunks of the equation, saying "this is neglect-able", and you would rip your hair off? WHY is it neglect-able? Why not other things?
Computer science is based on math. Computers are based on physics. They are physical objects. They need to spin disks, access a memory bus, manage billions of transistors... You just can't actually anticipate what will happen and put it all under some equation.
It may just turn out that for some particular dataset your entire equation does not hold water. In other times, it may be just fine.
Well, the issue is now resolved:
While it would seem logical to use filtered indexes (NOT NULL), to reduce database size and as so many sources on the web say, increase performance, the reality it seems is something else entirely.
In layman's terms, SQL Server query planner resolves even your basic inner joins without making any assumptions as to the content of the columns. Even though NULL values do not form a join, they must be included in the column index in order for query planner to use it, unless otherwise specified with predicates such as WHERE joinCol_ID IS NOT NULL. Basically, SQL Server does not use filtered indexes for joins at all, unless the queries themselves are modified to account for the filter value. Instead, it will create new statistics on these columns and / or use a clustered index scan or other indexes including the column, whichever it deems most effective. Using filtered indexes on foreign keys is therefore an absolutely horrid idea.
We still have no idea how months worth of testing this in multiple other environments never produced the same results outside of this one, single DB, but this is the way it's supposed to work. Apparently something that as far as we know is not related to cache, statistics or configurations, caused the production DB to behave differently and correctly detect and use the filtered indexes, while all of the testing environments simply used the old indexes (seeing as the indexes were dropped and recreated with the same name, this seems a valid theory even if there is no real proof).
So the lesson of the story: The web is filled with examples of how underused filtered indexes are, how awesome they can be. But this serious downside never popped up except as a nagging thought in the back of my head saying "if these are so great, then why aren't NULL values filtered out of indexes by default, since they only take up space and only serve a purpose in special circumstances"? Well, now I know why. :)
Best Answer
No hard-and-fast rules like that are generally applicable. If you have more than ten columns (or combinations of columns as an index can cover several) that are regularly searched by then you probably need more than 10 indexes.
The key problems with too many indexes which you need to consider are:
Nothing beats application specific benchmarking for that sort of thing. You know how your application and its users hit the database (or if you don't, you need to study this (or apply some careful thought if the app isn't live yet so doesn't have real users)) so build some benchmarks based on that activity and see what difference it makes. Remember to test queries/updates that don't happen very often as well as those that do though - sometimes infrequent reports are the most important (i.e. in the app that I manage there are reports the users access once or twice per month, but while they aren't used often they are vital to their business and need to respond in reasonable time) so you don't want them timing out or simply taking hours to produce.