The database looks like this:
Parameters has 126000 rows, which link an ExperimentID to the starting parameters for that ExperimentID. Simulations has ~107M rows, but I really only query against Xa (a floating point value with mostly unique entries). I'm interested in the other data, but for now it's just there. The current indices I have are the Primary Key index on Simulations, which is clustered, and a nonclustered columnstore index on the following columns:
Here's an example query I might be running against this database:
USE IMTsimulations
SELECT DISTINCT p.ControlRange, p.SensorError,
PERCENTILE_DISC(0) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MIN,
PERCENTILE_DISC(.25) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile25,
PERCENTILE_DISC(.5) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Median,
PERCENTILE_DISC(.75) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Quartile75,
PERCENTILE_DISC(1) WITHIN GROUP( ORDER BY s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_MAX,
AVG(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_Mean,
STDEV(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as Xa_STDEV,
COUNT(s.Xa) OVER (PARTITION BY p.ControlRange, p.SensorError) as TotalEvents
FROM dbo.Simulations s JOIN Parameters p ON s.ExperimentID = p.ExperimentID
WHERE s.TimeElapsed <= 7200 and p.ExogDexCurve = 4
ORDER BY p.ControlRange, p.SensorError
As you can see, I'll be mostly looking at aggregates of Xa according to certain starting parameters. I feel like my primary key shouldn't be cluster, and the columnstore index should be clustered, and it should only have Xa, but I don't know for sure. The dataset is complete, so there will be no INSERTs, UPDATEs, DELETEs, or anything like that on either table.
What indexes should I be building for this thing? What's going to let me efficiently query aggregates on Xa that are partitioned according to things in the Parameters table?
I'd appreciate any depth you go into, as I'm doing this on my own, and don't have much technical background with databases.
Best Answer
You said, that
TimeElapsed
is not really filtering much, so there is no point in indexing it.Just looking at the query structure, I'd add an index to
Parameters
table on(ExogDexCurve, ControlRange, SensorError)
. The order of columns is important.I would also make sure that all indexes have Fill Factor=100, because the data is static.
But, overall, the main bottleneck is reading all
Xa
values for the neededControlRange
andSensorError
.The server has to read all this data. Storing it in columnstore index should help.
I would also consider denormalizing the data and making just one table (persisting the join). In this case I'd create a covering index. The order of columns is important.
Xa
is part of an index rather than included, because you order byXa
.With one joined table and this covering index the query should scan the index without extra sorts.
Also, to "hide" the persisted joined table it may be possible to create an indexed view of the join instead of explicit table, but I'm not sure.