Sql-server – Columnstore index for Key/Value Pair table on SQL Server

columnstorejsonperformancesql server

I need to build a database for storing assessments.

Each assessment can have an indefinite number of questions, and an indefinite number of responses.

Responses for each assessment can grow up to 500k.

Questions for each assessment can be from 10 to 200.

While the Assessments and AssessmentResponses tables will be designed with the "normal" relational table, Questions and Answers should be stored as Key/Value pair.

Assessments
|AssessmentID|Name|JsonSchema|

Questions
|QuestionID|AssessmentID|QuestionValue|QuestionType|

AssessmentResponses
|ResponseID|AssessmentID|RespondentName|Date|JsonResponse|

Answers
|ResponseID|QuestionID|AnswerValueText|AnswerValueDecimal|

As you can see I store also the assessment schema and the responses in JSON format as I found it useful to quickly visualize on a web UI. JsonResponse, in particular, contains all the answers given in the related response, like the following:

{
  "interviewDate":"2001/12/28",
  "city":"Mombasa",
  "phone":"123456789",
  "name":"Marco",
  "age":16
}

Typical queries can be:

  1. Extract all the answers (paged) of a particular AssesmentID.
  2. Calculate the average on all the answers of a particular
    AssessmentID of the age key.
  3. Extract all the responses where the key "age" has a value greater
    than 25 and the key "city" is equal to "New York"

Please note that for query No.2, the age values will be stored in the column AnswerValueDecimal, where I will store all the numeric values.

I'm wondering if the column store index will increase the performance in this structure?

Please note that I know that an ElasticSearch instance will help me a lot but for a budget issue we can't implement at this stage.

Some more details about data

Answers to the questions in the assessments are generally chosen from a limited number of choices. In example:

  • Are you married? > [YES,NO]
  • Type of building? > [Concrete,Skeleton,Eternite Sheets,Iron Sheets,Other]
  • City Area? > [North,South,East,West,center]

The queries can be built by the end-user depending on the information they need from a particular assessment. Giving the above example, they want to know how many Eternite buildings are in the North and West area of the city, a count in the end.

But for example there is also a question "How old are you?", they want to filter the previous query to know how many Eternite buildings are in the North and West area where there's a person with an age lower than 25 years…

USING JSON FIELD TO STORE DATA

I've tried using the new feature in SQL Server 2016/2017 to store JSON document inside an NVARCHAR(MAX) field. It seems to work pretty well but maybe only once the whole table is loaded in memory.

Here is the dbfiddle: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=2ce135dc37d72f9db951bbb4e2708baa

Best Answer

What better way than to test?

TL:DR up front. The way you structure your data makes it awkward when there are multiple conditions, as you have to check different rows for the same response. Columnstore works better simply for using batch mode in the self join of my test query.

I would expect you to gain far more from structuring your data in a different way than by looking at storage options.

Here's a setup with 5m rows, and an example query at the end looking for average age of people in the city center for a range of responses (assuming one particular survey). Dbfiddle link as well

DROP TABLE IF EXISTS dbo.AnswersRow

CREATE TABLE AnswersRow (
ResponseID INT NOT NULL,
QuestionID INT NOT NULL,
AnswerValueText VARCHAR(20) NULL,
AnswerValueDecimal DECIMAL(15,2) NULL,
CONSTRAINT PK_R_Q PRIMARY KEY CLUSTERED (ResponseID,QuestionID)
)

INSERT dbo.AnswersRow
SELECT TOP 5000000
FLOOR((-1+ROW_NUMBER() OVER(ORDER BY(SELECT 'Joe')))/5+1) AS ResponseID,
(-1+ROW_NUMBER() OVER(ORDER BY(SELECT 'Joe')))%5+1 AS QuestionID,
NULL AS AnswerValueText,
NULL AS AnswerValueDecimal
FROM master.dbo.spt_values a
CROSS JOIN master.dbo.spt_values b

--5 questions
--1) Married 0 or 1
--2) BuildingType
--3) CityArea
--4) Age
--5) Income

--randomize answers
UPDATE ar
SET AnswerValueText = CASE
WHEN QuestionID = 2 THEN (CASE x2.seed
                            WHEN 0 THEN 'Concrete'
                            WHEN 1 THEN 'Skeleton'
                            WHEN 2 THEN 'Eternite Sheets'
                            WHEN 3 THEN 'Iron Sheets'
                            WHEN 4 THEN 'Other'
                            END)
WHEN QuestionID = 3 THEN (CASE x3.seed
                            WHEN 0 THEN 'North'
                            WHEN 1 THEN 'South'
                            WHEN 2 THEN 'East'
                            WHEN 3 THEN 'West'
                            WHEN 4 THEN 'Center'
                            END)
ELSE AnswerValueText END,
ar.AnswerValueDecimal = CASE
WHEN QuestionID = 1 THEN (FLOOR(RAND(CONVERT(BINARY(8),NEWID()))*2))
WHEN QuestionID = 4 THEN (FLOOR(RAND(CONVERT(BINARY(8),NEWID()))*60+17))
WHEN QuestionID = 5 THEN (FLOOR(RAND(CONVERT(BINARY(8),NEWID()))*200000))
ELSE AnswerValueDecimal END
FROM dbo.AnswersRow ar
CROSS APPLY (SELECT FLOOR(RAND(CONVERT(BINARY(8),NEWID()))*5) AS seed) x2
CROSS APPLY (SELECT FLOOR(RAND(CONVERT(BINARY(8),NEWID()))*5) AS seed) x3

SELECT *
INTO dbo.AnswersCol
FROM dbo.AnswersRow

CREATE CLUSTERED COLUMNSTORE INDEX CX_Answers ON dbo.AnswersCol
CREATE NONCLUSTERED INDEX IX_Answer_text ON dbo.AnswersRow(AnswerValueText)
CREATE NONCLUSTERED INDEX IX_Answer_decimal ON dbo.AnswersRow(AnswerValueDecimal)
CREATE NONCLUSTERED INDEX IX_Answer_question ON dbo.AnswersRow(QuestionID) INCLUDE(AnswerValueText,AnswerValueDecimal)


--average age of particular survey, city center
SELECT AVG(AnswerValueDecimal), COUNT(*)
FROM dbo.AnswersRow ar1
WHERE 1=1
AND ar1.ResponseID >= 100000 AND ar1.ResponseID < 200000
AND ar1.QuestionID = 4
AND EXISTS (SELECT 1
            FROM dbo.AnswersRow ar2
            WHERE ar1.ResponseID = ar2.ResponseID
            AND ar2.QuestionID = 3
            AND ar2.AnswerValueText = 'Center'
            )


SELECT AVG(AnswerValueDecimal), COUNT(*)
FROM dbo.AnswersCol ar1
WHERE 1=1
AND ar1.ResponseID >= 100000 AND ar1.ResponseID < 200000
AND ar1.QuestionID = 4
AND EXISTS (SELECT 1
            FROM dbo.AnswersCol ar2
            WHERE ar1.ResponseID = ar2.ResponseID
            AND ar2.QuestionID = 3
            AND ar2.AnswerValueText = 'Center'
            )