Low Selectivity
Here is a good quote from SqlServerCentral:
In general, a nonclustered index should be selective. That is, the
values in the column should be fairly unique and queries that filter
on it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations
and if a nonclustered index is to be used to evaluate a query it needs
to be covering or sufficiently selective that the costs of the lookups
aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the
query would be seeking on) insufficiently selective then it is very
likely that the index will be ignored and the query executed as a
clustered index (table) scan.
It is important to note that this does not just apply to the leading
column. There are scenarios where a very unselective column can be
used as the leading column, with the other columns in the index making
it selective enough to be used.
Couple of things to consider when indexing:
When To Avoid Indexing
- Indexes should not be used on small tables.
- Tables that have frequent, large batch update or insert operations.
- Indexes should not be used on columns that contain a high number of
NULL values.
- Columns that are frequently manipulated should not be indexed.
- When you already have a lot of indexes (do not over index a table)
- Avoid Nonclustered, Unfiltered Indexes on Columns that have few Distinct Values (Low Cardinality)
Effectiveness of Indexing
I'm going to quote another stack exchange post:
There are a few concepts and terms that are important to understand
when dealing with indexes. Seeks, scans, and lookups are some of the
ways that indexes will be utilized through select statements.
Selectivity of key columns is integral to determining how effective an
index can be.
A clustered index's key columns are
called a clustering key. This is how records are made unique in the
context of a clustered index. All nonclustered indexes will include
the clustered key by default, in order to perform lookups when
necessary. All indexes will be inserted to, updated to, or deleted
from for every respective DML statement. That having been said, it is
best to balance performance gains in select statements against
performance hits in insert, delete, and update statements.
In order to determine how effective an index is, you must determine
the selectivity of your index keys. Selectivity can be defined as a
percentage of distinct records to total records. If I have a [person]
table with 100 total records and the [first_name] column contains 90
distinct values, we can say that the [first_name] column is 90%
selective. The higher the selectivity, the more efficient the index
key. Keeping selectivity in mind, it is best to put your most
selective columns first in your index key. Using my previous [person]
example, what if we had a [last_name] column that was 95% selective?
We would want to create an index with [last_name], [first_name] as the
index key.
Personally
I index on how the table is being used. This includes the holistic view of all queries on the tables as well what queries are not taking advantage of indexes. As Jeremiah Peshka said in the same thread, if the percentage of missing indexes is high, then an index on how it's being used is likely needed.
Key Takeaway
The SQL query engine loves highly selective key columns. Index on how a table is being used. Keep It Simple Stupid (KISS).
It's hard to tell if this will hurt or improve performance but adding a composite index on start_date
and end_date
shouldn't worsen your query, the index will either be used or not used. You didn't mention your RDBMS but I don't think there's a big chance your engine's optimizer will pick a worse plan (but there is always a possibility I guess).
That being said, there might a slight impact on the plan generation time as your optimizer has an extra index to consider, and unused indexes could hurt overall performance since they need to be maintained when updating/inserting.
Whether or not the index will help your query is hard to tell, it depends on the selectivity of your query and cardinality statistics. The optimizer of your RDBMS should pick the fastest way to get your data, and even if an index exists on the columns your query selects on a full scan might still be the better option, especially since you are selecting all fields.
Why is selectivity important?
When an index is used, the index points to the actual record and the record might need to be fetched using the pointer. Most RDBMS work that way.
See for example the Oracle documentation
In general, index access paths should be used for statements that
retrieve a small subset of table rows, while full scans are more
efficient when accessing a large portion of the table
or the SQL Server documentation
In many cases, the optimizer forces a table scan for queries with
result sets of approximately 5 percent, although the table scan
becomes more efficient than index access at selectivities of 8 to 10
percent.
Why is select * important
If you fetch all fields from a table your database engine will have to fetch the actual record using the pointer stored in the index and you lose the option of using included columns
(if your RDBMS supports them) or covering indexes
.
See for example this article for SQL server (but the principle goes for a lot of vendors): Using Covering Indexes to Improve Query Performance
However, under some circumstances, the overhead associated with
nonclustered indexes may be deemed too great by the query optimizer
and SQL Server will resort to a table scan to resolve the query.
and
The observed improvement is due to the fact that the nonclustered index
contained all of the required information to resolve the query. No Key
Lookups were required.
Best Answer
Yes, there can be downsides. If another query looks at a different data segment not determined by the date, it might take a performance hit if rows are spread out over more data pages now. Just the same way as your first query profits. That completely depends on information not in your question.
That could be anything. It depends on what you have and what you query exactly. Querying a single row is not affected either way, but multiple rows might be.
Be aware that
CLUSTER
rewrites the table in pristine condition likeVACUUM FULL
does (removes dead tuples, compacts the physical size of the table, rewrites indexes) So you might see an immediate positive effect on read performance independent of the sort order. (Much like you would get withVACUUM FULL
.)After
CLUSTER
, you may want to run a plainVACUUM
on the table to update the visibility map, too - which may allow index-only scans.All benefits of
CLUSTER
shrink with the write frequency.Also, if you have many updates to the table,
CLUSTER
can actually hurt write performance by removing "wiggle room" for HOT updates on the same data page. You might be able to counter that effect with aFILLFACTOR
setting below 100. Again, depends on locality of updated rows, etc.Related:
Either way, I would probably not index and cluster on
my_timestamp::date
, but onmy_timestamp
directly. Nothing lost, something gained. The cast is very cheap, but it's still cheaper not to cast at all. And the index can support more queries.Even though a
date
occupies only 4 bytes on disk and atimestamp
occupies 8 bytes, the difference is typically lost to alignment padding for your case, and both indexes have exactly the same size.The order of multiple rows on the same day resulting from your expression index is arbitrary. There can still be two identical timestamps, but with 6 fractional digits that's normally very unlikely. Aside from that you get a deterministic order of rows, which can have various advantages.
I also dropped the
DESC
key word since Postgres can read indexes backwards virtually as fast a forwards. (Sort order matters for multicolumn indexes, though!) More:Instead of:
You would now use:
Same performance.
If you don't need the time component of the column at all, convert the column to
date
...How to roll back
CLUSTER
?CLUSTER
on a single table can be rolled back withROLLBACK
like any other regular command as long as the transaction has not been committed.However, I quote the manual:
You can always run
CLUSTER
with a different index to change the physical order of rows once more.