Sql-server – What index should I use for a read-only database

sql server

The database looks like this:

Parameters has 126000 rows, which link an ExperimentID to the starting parameters for that ExperimentID. Simulations has ~107M rows, but I really only query against Xa (a floating point value with mostly unique entries). I'm interested in the other data, but for now it's just there. The current indices I have are the Primary Key index on Simulations, which is clustered, and a nonclustered columnstore index on the following columns:

Here's an example query I might be running against this database:

USE IMTsimulations
SELECT DISTINCT p.ControlRange, p.SensorError,
    PERCENTILE_DISC(0) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MIN,
    PERCENTILE_DISC(.25) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile25,
    PERCENTILE_DISC(.5) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Median,
    PERCENTILE_DISC(.75) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile75,
    PERCENTILE_DISC(1) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MAX,
    AVG(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Mean,
    STDEV(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_STDEV,
    COUNT(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as TotalEvents

FROM dbo.Simulations s JOIN Parameters p ON s.ExperimentID = p.ExperimentID
WHERE s.TimeElapsed <= 7200 and p.ExogDexCurve =  4
ORDER BY p.ControlRange, p.SensorError

As you can see, I'll be mostly looking at aggregates of Xa according to certain starting parameters. I feel like my primary key shouldn't be cluster, and the columnstore index should be clustered, and it should only have Xa, but I don't know for sure. The dataset is complete, so there will be no INSERTs, UPDATEs, DELETEs, or anything like that on either table.

What indexes should I be building for this thing? What's going to let me efficiently query aggregates on Xa that are partitioned according to things in the Parameters table?

I'd appreciate any depth you go into, as I'm doing this on my own, and don't have much technical background with databases.

Best Answer

You said, that TimeElapsed is not really filtering much, so there is no point in indexing it.

Just looking at the query structure, I'd add an index to Parameters table on (ExogDexCurve, ControlRange, SensorError). The order of columns is important.

I would also make sure that all indexes have Fill Factor=100, because the data is static.

But, overall, the main bottleneck is reading all Xa values for the needed ControlRange and SensorError.

The server has to read all this data. Storing it in columnstore index should help.

I would also consider denormalizing the data and making just one table (persisting the join). In this case I'd create a covering index. The order of columns is important.

(ExogDexCurve, ControlRange, SensorError, Xa) INCLUDE (TimeElapsed)

Xa is part of an index rather than included, because you order by Xa.

With one joined table and this covering index the query should scan the index without extra sorts.

Also, to "hide" the persisted joined table it may be possible to create an indexed view of the join instead of explicit table, but I'm not sure.

Related Solutions

Sql-server – Partition Key questions in SQL Server 2008

Assuming that you have the primary key on a clustered index then the partitioning key needs to be part of the primary key.

You will not loose the benefit of partitioning by joining to non-partitioned tables, providing that the queries are designed to make use of the partitioned table, for example the following query WILL benefit from partitioning

SELECT F.Col1, F.Col2, D.Col3
FROM Fact_Partitioned F
    INNER JOIN Dim_MyDim D ON F.Col1 = D.Col1
WHERE F.Col1 = 5

But the following query WILL NOT benefit from partition elimination

SELECT F.Col1, F.Col2, D.Col3
FROM Fact_Partitioned F
    INNER JOIN Dim_MyDim D ON F.Col1 = D.Col1
WHERE D.Col1 = 5

It is a subtle difference, but in the first query, the join key is filtered in the partitioned table, taking advantage of elimination and then joined to the dimension. In the second query, the key is filtered in the dimension and then joined against the whole of the fact table, rather than just required partitions.

It goes without saying that the partitioning key needs to be in the WHERE clause for elimination to work, otherwise SQL Server does not know which partition(s) the data is in.

Adding a filter criteria on the JOIN clause will not help you. It needs to be in the WHERE clause to benefit from elimination.

The Partition Key does not need to be part of a non-clustered index (NCI) but if the NCI is unique, then it needs to contain the partitioning key in order to align the index. This is where the NCI is built on the same partition scheme as the table. NCIs should also be partition aligned unless there is an exceedingly good reason not to. I have never come across a good enough reason!

SQL Server – Aligned Index Column Order

You need to think about your partitions as if they are separate tables.

A clustered index on a partitioned table effectively has the partition number as its first key. For you, with a partition per unique PartitionKey, the clustered index will essentially have that column in the front of it. Or as good as...

I'd expect having Id first would feel better, in case you don't know the month and need to search all partitions. It'll seek on each partition rather than scan them.

Best Answer

Related Solutions

Sql-server – Partition Key questions in SQL Server 2008

SQL Server – Aligned Index Column Order

Related Question