Sql-server – Performance Implications of encrypted data in a Clustered Columnstore Index

columnstoreencryptionsql serversql-server-2016

I am using SQL Server 2016 and expecting some details on below points regarding performance point of view:

Is there any performance gain or additional benefit on creating a primary key – unique constraint on same table on which a clustered columnstore index is already created?
If my table contains encrypted columns (out of 10 columns, 9 columns are encrypted), is there any benefit in case of creating clustered columnstore index? I am using SYMMETRIC KEY for encryption.

Best Answer

It's best to avoid asking multiple, unrelated questions in a single question when possible. Your first question is very broad. In general, creating an index is always a trade-off and some queries will benefit from that index. The same is true for NCIs on columnstore tables. I'll focus on the second question instead.

As far as I can tell, encryption by SYMMETRIC KEY can be a very poor match for columnstore. Inserting encrypted data into a CCI means that you're inserting long, relatively unique strings into varbinary columns. CCIs don't do well with those data types and data distributions. You're likely to hit dictionary pressure which will limit the size of your rowgroups and you'll miss out on most of the features of columnstore, except for column elimination.

I found the code at How to use SQL Server Encryption with Symmetric Keys helpful in developing a simple example. I don't know what your data looks like, so I'll just insert unique integers into a table and insert the same unique integers with encryption into a different table. Definition of the key:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'you will never guess this';

CREATE CERTIFICATE MyFirstCertificate
WITH SUBJECT = 'no u';

CREATE SYMMETRIC KEY key_to_answering_217242 WITH
IDENTITY_VALUE = 'yeah nah',
ALGORITHM = AES_256,
KEY_SOURCE = 'closet full of jandals'
ENCRYPTION BY CERTIFICATE MyFirstCertificate;

T-SQL to create two tables and insert data into both of them:

DROP TABLE IF EXISTS dbo.CCI_NOT_ENCRYPTED;

CREATE TABLE dbo.CCI_NOT_ENCRYPTED (
ID BIGINT,
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_NOT_ENCRYPTED WITH (TABLOCK)
SELECT TOP (1048576) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

GO

USE master;
GO

OPEN SYMMETRIC KEY key_to_answering_217242
DECRYPTION BY CERTIFICATE MyFirstCertificate


DROP TABLE IF EXISTS [TEST].dbo.CCI_ENCRYPTED;

CREATE TABLE [TEST].dbo.CCI_ENCRYPTED (
ID VARBINARY(256),
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO [TEST].dbo.CCI_ENCRYPTED WITH (TABLOCK)
SELECT EncryptByKey(Key_GUID('key_to_answering_217242'), CAST(RN AS VARCHAR(7)))
FROM (
    SELECT TOP (1048576) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);

CLOSE SYMMETRIC KEY key_to_answering_217242;

GO

There's a clear difference in total table size. The table with encrypted data is over 20X bigger:

Looking at sys.dm_db_column_store_row_group_physical_stats, we can see that the table with encrypted data had its rowgroup size limited by dictionary pressure:

Query performance will of course be greatly impacted as well. The following query takes approximately 0 ms of CPU time:

SELECT COUNT(*)
FROM dbo.CCI_NOT_ENCRYPTED
WHERE ID = 1;

If I run that same query on the encrypted table:

USE master;
GO

OPEN SYMMETRIC KEY key_to_answering_217242
DECRYPTION BY CERTIFICATE MyFirstCertificate

SELECT COUNT(*)
FROM [TEST].dbo.CCI_ENCRYPTED
WHERE CAST(CAST(DecryptByKey(ID) AS VARCHAR(7)) AS BIGINT) = 1;

CLOSE SYMMETRIC KEY key_to_answering_217242;

GO

It takes over 600 ms of CPU time. With all of that said, there appears to be a large overhead in general in working with encrypted data. I cannot say if a columnstore will offer any benefit for your exact data and scenario, only that you definitely won't see the usual benefits of columnstore.

Related Solutions

Sql-server – Clustered columnstore index performance SQL Server 2014

If you can post the specific data and queries you are using, that is probably the only way we can help answer the question in the context of your specific case. You can use a script that generates anonymous data in roughly the same scale as your real example.

However, I went ahead and created a similar type of script myself. For the sake of simplicity, I am using fewer than 225 columns. But I am using the same number of rows and random data (which is unfavorable for columnstore) and I saw results that are much different than yours. So my initial thought is that yes, you do have some sort of problem with either your configuration or your test queries.

A few of the key takeaways:

Columnstore has dramatically faster performance than rowstore for simple aggregations across all rows in a column
If loaded carefully, columnstore can perform surprisingly well for singleton seeks. There is an I/O hit, but with a warm cache performance was very good. But not as good as rowstore for this use case, of course.
If you need to be able to perform both singleton seeks and large aggregation queries, you might consider using a non-clustered columnstore index on top of a standard b-tree table.
You mention that you have 225 columns, but an average row is just 181 bytes. This seems a little unusual; is your table mostly BIT columns? That might be something to look into further. I did see very good compression ratios on a simple BIT column columnstore (over 99%), but it may be the case that much of that is due to the absence of row overhead and this advantage would disappear with many BIT columns on a single row.
If you want to learn (a lot) more about columnstore, Niko's 66-part (and counting) blog series has been the most valuable reference that I've come across.

And now on to the details:

Create rowstore data set

Nothing too exciting here; we create 40MM rows of pseudo-random data.

SELECT @@VERSION
--Microsoft SQL Server 2014 - 12.0.4213.0 (X64) 
--  Jun  9 2015 12:06:16 
--  Copyright (c) Microsoft Corporation
--  Developer Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
GO

-- Create a rowstore table with 40MM rows of pseudorandom data
;WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 
    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
)
, E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b)
, E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
, E8(N) AS (SELECT 1 FROM E4 a CROSS JOIN E4 b)
SELECT TOP 40000000 ISNULL(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), 0) AS id
    , ISNULL((ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) % 5) + 1, 0) AS col1
    , ISNULL(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) * RAND(), 0) AS col2
    , ISNULL(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) * RAND(), 0) AS col3
    , ISNULL(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) * RAND(), 0) AS col4
    , ISNULL(ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) * RAND(), 0) AS col5
INTO dbo.test_row
FROM E8
GO
ALTER TABLE test_row
ADD CONSTRAINT PK_test_row PRIMARY KEY (id)
GO

Create columnstore data set

Let's create the same data set as a CLUSTERED COLUMNSTORE, using the techniques described to load data for better segment elimination on Niko's blog.

-- Create a columnstore table with the same 40MM rows
-- The data is first ordered by id and then a single thread
-- use to build the columnstore for optimal segment elimination
SELECT *
INTO dbo.test_column
FROM dbo.test_row
GO
CREATE CLUSTERED INDEX cs_test_column
ON dbo.test_column (id)
GO
CREATE CLUSTERED COLUMNSTORE INDEX cs_test_column 
ON dbo.test_column WITH (DROP_EXISTING = ON, MAXDOP = 1)
GO

Size comparison

Because we are loading random data, columnstore achieves only a modest reduction in table size. If the data was not as random, the columnstore compression would dramatically decrease the size of the columnstore index. This particular test case is actually quite unfavorable for columnstore, but it's still nice to see that we get a little bit of compression.

-- Check the sizes of the two tables
SELECT t.name, ps.row_count, (ps.reserved_page_count*8.0) / (1024.0) AS sizeMb
FROM sys.tables t WITH (NOLOCK)
JOIN sys.dm_db_partition_stats ps WITH (NOLOCK)
    ON ps.object_id = t.object_id
WHERE t.name IN ('test_row','test_column')
--name          row_count   sizeMb
--test_row      40000000    2060.6328125
--test_column   40000000    1352.2734375
GO

Performance comparison

In the following two test cases, I try two very different use cases.

The first is the singleton seek mentioned in your question. As commenters point out, this is not at all the use case for columnstore. Because an entire segment has to be read for each column, we see a much greater number of reads and slower performance from a cold cache (0ms rowstore vs. 273ms columnstore). However, columnstore is down to 2ms with a warm cache; that's actually quite an impressive result given that there is no b-tree to seek into!

In the second test, we compute an aggregate for two columns across all rows. This is more along the lines of what columnstore is designed for, and we can see that columnstore has fewer reads (due to compression and not needing to access all columns) and dramatically faster performance (primarily due to batch mode execution). From a cold cache, columnstore executes in 4s vs 15s for rowstore. With a warm cache, the difference is a full order of magnitude at 282ms vs 2.8s.

SET STATISTICS TIME, IO ON
GO

-- Clear cache; don't do this in production!
-- I ran this statement between each set of trials to get a fresh read
--CHECKPOINT
--DBCC DROPCLEANBUFFERS
GO

-- Trial 1: CPU time = 0 ms,  elapsed time = 0 ms.
    -- logical reads 4, physical reads 4, read-ahead reads 0
-- Trial 2: CPU time = 0 ms,  elapsed time = 0 ms
    -- logical reads 4, physical reads 0, read-ahead reads 0
SELECT *
FROM dbo.test_row
WHERE id = 12345678
GO 2
-- Trial 1: CPU time = 15 ms,  elapsed time = 273 ms..
    -- lob logical reads 9101, lob physical reads 1, lob read-ahead reads 25756
-- Trial 2: CPU time = 0 ms,  elapsed time = 2 ms.  
    -- lob logical reads 9101, lob physical reads 0, lob read-ahead reads 0
SELECT *
FROM dbo.test_column
WHERE id = 12345678
GO 2

-- Trial 1: CPU time = 8441 ms,  elapsed time = 14985 ms.
    -- logical reads 264733, physical reads 3, read-ahead reads 263720
-- Trial 2: CPU time = 9733 ms,  elapsed time = 2776 ms.
    -- logical reads 264883, physical reads 0, read-ahead reads 0
SELECT AVG(id), SUM(col3)
FROM dbo.test_row
GO 2
-- Trial 1: CPU time = 1233 ms,  elapsed time = 3992 ms.
    -- lob logical reads 207778, lob physical reads 1, lob read-ahead reads 341196
-- Trial 2: CPU time = 1030 ms,  elapsed time = 282 ms. 
    -- lob logical reads 207778, lob physical reads 0, lob read-ahead reads 0
SELECT AVG(id), SUM(col3)
FROM dbo.test_column
GO 2

Sql-server – Decrypting Previously Encrypted Data Returns NULL

This sounds like a case where the key_guid in the encrypted data does not match the key_guid for the symmetric key used to encrypt it.

When data is encrypted using a symmetric key and is then saved in a varbinary column, the first 16 bytes are the Key_guid of the symmetric key used to encrypt the data, which explains how DECRYPTBYKEYAUTOCERT can automatically locate the correct symmetric key.

Can you run the script below and verify that this is the case? And, did you encrypt and decrypt the data using the same database?

select 
    name,cast(key_guid as varbinary(max))
from sys.symmetric_keys
where
    name ='key_dbKeys'

select distinct table.data from table
where 
    table.data is not null

After attempting to recreate the problem, there could be a problem opening the symmetric key prior to inserting the data or, after reviewing the documentation for DECRYPTBYKEYAUTOCERT, you need to cast the return, which is varbinary, to varchar. I'm also posting the working script that is based on what you have developed. I had to use a different password.

declare  @clean_data varchar(100)

set  @clean_data='Hello World'

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@ssword!';

CREATE CERTIFICATE cert_dbKeys
   ENCRYPTION BY PASSWORD = 'P@ssword!'
   WITH SUBJECT = 'Database encryption key', 
   EXPIRY_DATE = '20201031';

CREATE SYMMETRIC KEY key_dbKeys 
   WITH ALGORITHM = AES_128
   ENCRYPTION BY CERTIFICATE cert_dbKeys;
create table EncryptedData
( secretmessage varbinary(max))

OPEN SYMMETRIC KEY key_dbKeys
DECRYPTION BY CERTIFICATE cert_dbKeys WITH PASSWORD = 'P@ssword!'

insert into EncryptedData (secretmessage) values(EncryptByKey(Key_GUID('key_dbKeys'), @clean_data))

CLOSE SYMMETRIC KEY key_dbKeys;

SELECT cast(DecryptByKeyAutoCert(cert_id('cert_dbKeys'), N'P@ssword!', secretmessage) as varchar(100)) FROM EncryptedData

Best Answer

Related Solutions

Sql-server – Clustered columnstore index performance SQL Server 2014

Sql-server – Decrypting Previously Encrypted Data Returns NULL

Related Question