SQL Server – Store Permutations of Simulation Parameters Efficiently

sql server

I'm going to run a bunch of simulations using an application I wrote in the near future, and I'm having trouble designing my database around this.

Here's how it works. Each simulation has 6 parameters that can vary, and each parameter has ~10 options. So we have about 1 million permutations of this to somehow store in the DB. The simplest way I can think is to have a column for each parameter, with an additional column for my primary key (ExperimentID). I could then access the parameters for a given experiment directly by doing a simple

SELECT * FROM Parameters WHERE ExperimentID = (givenID)

Each parameter will be stored in the DB as an int, which will be interpreted by my application as the necessary parameters at runtime.

A more complex, but I think faster, way to do it is to take advantage of the speed of binary shifting. Instead of storing the parameters directly, I'd store the information about each parameter in 4 bits of an unsigned int. I don't actually know how ints are stored (binary wise) in the DB, so that's the major hold up.

Any advice on what to do in this situation? Also, I don't yet have these columns, so if I decided to go with the first option, how would I fill a table with all of the permutations of a set of parameters?

Best Answer

Why do you think that obscuring real data into a bit-masked field will be faster? You are only talking about 1 million rows and 24 bytes per row. That really isn't much, especially since you are only doing singleton / seek operations.

A better optimization would be to use TINYINT fields (range of 0 - 255 ; byte in .NET). Using TINYINT instead of INT (and assuming a FILLFACTOR^** of 100) would result in:

the table would effectively be 10 MB instead of 28 MB (1 byte per each of 6 fields + 4 bytes for the INT [ID] field = 10 bytes per row).
you would get 806 rows per 8060-byte data page instead of 287 with six INT fields

The main efficiencies gained here are:

quicker initial lookup time for 2244 records. How so? With 1 million rows, 287 rows per 8k data page equates to 3484 pages, while 806 rows per 8k data page equates to only 1240 pages. Once a page is read to get the first row's values, it should stay in the buffer pool (i.e. memory) for subsequent lookups (i.e. the other rows on that same data page). More rows per page means fewer pages to read from disk--2244 fewer pages in this case. But once all of the pages are in the buffer pool, the lookups should be the same speed whether it was 806 rows or 287 rows per page. So, 2244 lookup operations out of the 1 million are slightly faster. Certainly not bad since there was no cost associated with this optimization, but just wanted it to be put into perspective.
less memory required to hold the entire table in memory. this will make it more likely to remain in memory.

Lastly, it is definitely possible to squeeze this data into an even smaller representation. But at this point it is 10 bytes per row with no complication added to the processing / coding and all of the values are human readable. There isn't a lot of room to go below 10 bytes per row, but even if it got down to 7 bytes per row (INT IDENTITY + 3 bytes for Parameters, which should be the minimum required to handle these ranges), I am not sure the added complexity would be worth it (i.e. diseconomies of scale).

CREATE TABLE dbo.Experiment
(
  ExperimentID INT NOT NULL IDENTITY(1, 1),
  Parameter1  TINYINT NOT NULL,
  Parameter2  TINYINT NOT NULL,
  Parameter3  TINYINT NOT NULL,
  Parameter4  TINYINT NOT NULL,
  Parameter5  TINYINT NOT NULL,
  Parameter6  TINYINT NOT NULL,
  CONSTRAINT [PK_Experiment] PRIMARY KEY (ExperimentID) WITH (FILLFACTOR = 100) ON [Tables]
)
ON [Tables];

And you can populate it using the following (which assumes two parameters of 0 - 4, two parameters of 0 - 7, and two parameters of 0 - 9 ; based on comments on the Question):

;WITH Five(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
), Eight(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7
), Ten(num) AS
(
  SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
  UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8
  UNION ALL SELECT 9
)
--INSERT INTO dbo.Experiment
--     (Parameter1, Parameter2, Parameter3, Parameter4, Parameter5, Parameter6)
SELECT f1.[num] AS [Parameter1],
       f2.[num] AS [Parameter2],
       e1.[num] AS [Parameter3],
       e2.[num] AS [Parameter4],
       t1.[num] AS [Parameter5],
       t2.[num] AS [Parameter6]
FROM   Five f1
CROSS JOIN Five f2
CROSS JOIN Eight e1
CROSS JOIN Eight e2
CROSS JOIN Ten t1
CROSS JOIN Ten t2;

The query above only generates 160k permutations, not 1 million, again based on comments on the Question. But it should be should be fairly easy to figure out how to adjust it for different ranges of values for any combination of the six parameters.

** FILLFACTOR: Rows are stored on data pages that are roughly 8060 bytes. Smaller rows = more rows that fit into that fixed size container. FILLFACTOR is how much space to reserve (when creating / rebuilding an index) for new rows or updates that take up more space due to variable length fields requiring more bytes for the new version of the row. A value of 100 means that there is no reserved space, that 100% of the 8060 should be taken up by current rows. In this scenario there are only fixed length fields and no out of sequence inserts.

Related Solutions

Best Data Type for Storing an Array of Flags (Bitmap/Bit Array)

I can't advocate strongly enough not to use a single field for this.

I'm currently dealing with maintaining a very large dataset with a bigint bitmask field and it's a bit of a performance nightmare.

If you check a single bit it's fine. If you check more than one bit performance degrades very quickly.

Due to the nature of bitmask integers, the data distribution will be very imbalanced and you'll get suboptimal plans.

Multiple bit checks result in range or index scans with a function running against every row. It's a mess.

My workaround was simple - I made a table to store the PK for each of the conditions to be checked. This is counter-intuitive initially but the space needed is low (you only store the PK) and lookups are lightning fast, especially if you use a UNIQUE CLUSTERED INDEX.

You can add as many conditions as you want without affecting your main table, and updates also don't affect your main table.

Indexing is simple since you just index all the lookup tables individually, and since your clustered key is the same on your main table and the lookups all your evaluations are merge joins which are very efficient.

SQL Server – How to Pass Multi-Valued Characters in SSRS Report

I found a solution. I did not properly split the values in the stored procedure with a UDF.

To pass multi-values correctly in this stored procedure, I would need to add the following code to the dataset parameter that I am using:

=join(Parameters!<your param name>.Value,",")

This is basically going to join multiple values into an array and pass it through the @Flag parameter. The next step is adding SQL to the stored procedure to receive and digest the values correctly so it reads the values with the IN clause.

Google search any UDF string parser online. There are many to choose from. I used dba_parseString_udf from Michelle Ufford http://sqlfool.com.

Once I had my UDF installed, I can now alter my IN clause to receive the new multi-valued parameter being passed by SSRS as follows:

WHERE [Flag] IN (SELECT * FROM dba_parseString_udf(@Flag, ','))

Therefore, SSRS will pass the following value:

@Flag = 'A,B,C'

Then my UDF will parse that string out correctly to:

A
B
C

And populate my @Flag parameter correctly with SELECT * FROM UDF()...

Best Answer

Related Solutions

Best Data Type for Storing an Array of Flags (Bitmap/Bit Array)

SQL Server – How to Pass Multi-Valued Characters in SSRS Report

Related Question