Sql-server – Do BIT columns offer any performance advantages for CCIs

columnstoresql serversql-server-2016

Do BIT columns offer any performance advantages when used in Clustered Columnstore Indexes? I am interested in any performance benefits gained from defining a column in a CCI as BIT instead of BIGINT, for example. I'm working with SQL Server 2016.

I have a very limited understanding of how CCI compression works, but based on what I've read and some testing it seems like the data type (limited to exact numerics that store whole numbers) really shouldn't matter when it comes to columnstore compression. For example, if I insert 10 full rowgroups into tables with BIGINT columns as opposed to BIT columns I don't see a size difference between the compressed rowgroups. Here is the source data for one test:

DROP TABLE IF EXISTS dbo.CCI_BIT_TEST_SOURCE;

CREATE TABLE dbo.CCI_BIT_TEST_SOURCE (
    ID1 BIGINT NOT NULL,
    ID2 BIGINT NOT NULL,
    ID_BIT BIT NOT NULL,
    ID_BIGINT BIGINT NOT NULL,
    INDEX CCI__CCI_BIT_TEST_SOURCE CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_BIT_TEST_SOURCE WITH (TABLOCK)
SELECT
  t.RN
, t.RN
, t.RN % 2
, t.RN % 2
FROM
(
    SELECT TOP (10485760) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
    CROSS JOIN master..spt_values t3
) t
OPTION (MAXDOP 1);

Inserting into a CCI with 8 BIT columns took an average of 18729 ms of CPU time. The table has 56960 KB of reserved space:

DROP TABLE IF EXISTS dbo.CCI_BIT;

CREATE TABLE dbo.CCI_BIT (
    ID1 BIGINT NOT NULL,
    ID2 BIGINT NOT NULL,
    ID_BOOL_1 BIT NOT NULL,
    ID_BOOL_2 BIT NOT NULL,
    ID_BOOL_3 BIT NOT NULL,
    ID_BOOL_4 BIT NOT NULL,
    ID_BOOL_5 BIT NOT NULL,
    ID_BOOL_6 BIT NOT NULL,
    ID_BOOL_7 BIT NOT NULL,
    ID_BOOL_8 BIT NOT NULL,
    INDEX CCI__CCI_BIT CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_BIT WITH (TABLOCK)
SELECT
  ID1
, ID2
, ID_BIT
, ID_BIT
, ID_BIT
, ID_BIT
, ID_BIT
, ID_BIT
, ID_BIT
, ID_BIT
FROM dbo.CCI_BIT_TEST_SOURCE
OPTION (MAXDOP 1);

Inserting into a CCI with 8 BIGINT columns took an average of 18531 ms of CPU time. The table has 56960 KB of reserved space, the same as before:

DROP TABLE IF EXISTS dbo.CCI_NO_BIT;

CREATE TABLE dbo.CCI_NO_BIT (
    ID1 BIGINT NOT NULL,
    ID2 BIGINT NOT NULL,
    ID_BOOL_1 BIGINT NOT NULL,
    ID_BOOL_2 BIGINT NOT NULL,
    ID_BOOL_3 BIGINT NOT NULL,
    ID_BOOL_4 BIGINT NOT NULL,
    ID_BOOL_5 BIGINT NOT NULL,
    ID_BOOL_6 BIGINT NOT NULL,
    ID_BOOL_7 BIGINT NOT NULL,
    ID_BOOL_8 BIGINT NOT NULL,
    INDEX CCI__CCI_NO_BIT CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.CCI_NO_BIT WITH (TABLOCK)
SELECT
  ID1
, ID2
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
, ID_BIGINT
FROM dbo.CCI_BIT_TEST_SOURCE
OPTION (MAXDOP 1);

We can see this in the columnstore DMVs as well:

cci dmvs

There are some advantages to using BIT columns in CCIs. For example, data loaded into delta stores will take up less space with BIT columns because delta stores are basically uncompressed heaps. In query plans, the formula for the estimated data size is based on the data types of the columns as opposed to the size of the table on disk. The table with BIT columns has a total data size of 250 MB and the table with BIGINT columns has a total data size of 880 MB. In some cases the estimated size of 250 MB could lead to a better plan.

Are there any other performance benefits of BIT columns for CCIs? Or does the data type not really matter as long as you're using an exact numeric that stores whole numbers (BIT, TINYINT, SMALLINT, INT, or BIGINT)?

Best Answer

First, we are not comparing like data types.

Bit is defined as:

An integer data type that can take a value of 1, 0, or NULL.

BIT

Whereas BIGINT is a larger integer that consumes a significant amount of space by default.

That means by default SQL Server has statistics on each of the BIT columns and only one set for the BIGINT.

As you rightfully noted, BIT columns are optimized:

The SQL Server Database Engine optimizes storage of bit columns. If there are 8 or less bit columns in a table, the columns are stored as 1 byte. If there are from 9 up to 16 bit columns, the columns are stored as 2 bytes, and so on.

int, bigint, smallint, and tinyint

there must be a reason you are even considering taking at least 8 - 10 binary values and shoving them into a numerical number.

After all, why lot just use INT and save half the space? Since last I checked, 2,147,483,648 is 10 characters for just 4 bits and BIGINT has about 19, which technically is less space than BIT of youvspoit then into columns.

But this is losing sight of what your data is. How does BIT answer questions on what it represents? 10010 is just a number greater than ten thousand and yet in binary that actually represents something. If in "Saving" space you are forcing transformation before even using the data, will it still be efficient?

But please, do not conflate BIT with a numeric data type like Tinyint or BIGINT. They serve two different purposes.