SQL Server – Store Permutations of Simulation Parameters Efficiently

sql server

I'm going to run a bunch of simulations using an application I wrote in the near future, and I'm having trouble designing my database around this.

Here's how it works. Each simulation has 6 parameters that can vary, and each parameter has ~10 options. So we have about 1 million permutations of this to somehow store in the DB. The simplest way I can think is to have a column for each parameter, with an additional column for my primary key (ExperimentID). I could then access the parameters for a given experiment directly by doing a simple

SELECT * FROM Parameters WHERE ExperimentID = (givenID)

Each parameter will be stored in the DB as an int, which will be interpreted by my application as the necessary parameters at runtime.

A more complex, but I think faster, way to do it is to take advantage of the speed of binary shifting. Instead of storing the parameters directly, I'd store the information about each parameter in 4 bits of an unsigned int. I don't actually know how ints are stored (binary wise) in the DB, so that's the major hold up.

Any advice on what to do in this situation? Also, I don't yet have these columns, so if I decided to go with the first option, how would I fill a table with all of the permutations of a set of parameters?

Best Answer

Why do you think that obscuring real data into a bit-masked field will be faster? You are only talking about 1 million rows and 24 bytes per row. That really isn't much, especially since you are only doing singleton / seek operations.

A better optimization would be to use TINYINT fields (range of 0 - 255 ; byte in .NET). Using TINYINT instead of INT (and assuming a FILLFACTOR** of 100) would result in:

  • the table would effectively be 10 MB instead of 28 MB (1 byte per each of 6 fields + 4 bytes for the INT [ID] field = 10 bytes per row).
  • you would get 806 rows per 8060-byte data page instead of 287 with six INT fields

The main efficiencies gained here are:

  • quicker initial lookup time for 2244 records. How so? With 1 million rows, 287 rows per 8k data page equates to 3484 pages, while 806 rows per 8k data page equates to only 1240 pages. Once a page is read to get the first row's values, it should stay in the buffer pool (i.e. memory) for subsequent lookups (i.e. the other rows on that same data page). More rows per page means fewer pages to read from disk--2244 fewer pages in this case. But once all of the pages are in the buffer pool, the lookups should be the same speed whether it was 806 rows or 287 rows per page. So, 2244 lookup operations out of the 1 million are slightly faster. Certainly not bad since there was no cost associated with this optimization, but just wanted it to be put into perspective.

  • less memory required to hold the entire table in memory. this will make it more likely to remain in memory.

Lastly, it is definitely possible to squeeze this data into an even smaller representation. But at this point it is 10 bytes per row with no complication added to the processing / coding and all of the values are human readable. There isn't a lot of room to go below 10 bytes per row, but even if it got down to 7 bytes per row (INT IDENTITY + 3 bytes for Parameters, which should be the minimum required to handle these ranges), I am not sure the added complexity would be worth it (i.e. diseconomies of scale).


CREATE TABLE dbo.Experiment
(
  ExperimentID INT NOT NULL IDENTITY(1, 1),
  Parameter1  TINYINT NOT NULL,
  Parameter2  TINYINT NOT NULL,
  Parameter3  TINYINT NOT NULL,
  Parameter4  TINYINT NOT NULL,
  Parameter5  TINYINT NOT NULL,
  Parameter6  TINYINT NOT NULL,
  CONSTRAINT [PK_Experiment] PRIMARY KEY (ExperimentID) WITH (FILLFACTOR = 100) ON [Tables]
)
ON [Tables];

And you can populate it using the following (which assumes two parameters of 0 - 4, two parameters of 0 - 7, and two parameters of 0 - 9 ; based on comments on the Question):

;WITH Five(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
), Eight(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7
), Ten(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
  UNION ALL SELECT 9
)
--INSERT INTO dbo.Experiment
--     (Parameter1, Parameter2, Parameter3, Parameter4, Parameter5, Parameter6)
SELECT f1.[num] AS [Parameter1],
       f2.[num] AS [Parameter2],
       e1.[num] AS [Parameter3],
       e2.[num] AS [Parameter4],
       t1.[num] AS [Parameter5],
       t2.[num] AS [Parameter6]
FROM   Five f1
CROSS JOIN Five f2
CROSS JOIN Eight e1
CROSS JOIN Eight e2
CROSS JOIN Ten t1
CROSS JOIN Ten t2;

The query above only generates 160k permutations, not 1 million, again based on comments on the Question. But it should be should be fairly easy to figure out how to adjust it for different ranges of values for any combination of the six parameters.


** FILLFACTOR: Rows are stored on data pages that are roughly 8060 bytes. Smaller rows = more rows that fit into that fixed size container. FILLFACTOR is how much space to reserve (when creating / rebuilding an index) for new rows or updates that take up more space due to variable length fields requiring more bytes for the new version of the row. A value of 100 means that there is no reserved space, that 100% of the 8060 should be taken up by current rows. In this scenario there are only fixed length fields and no out of sequence inserts.