Sql-server – Clustered columnstore index space usage

columnstoresql serversql-server-2017

I have a simple table with a clustered columnstore index:

ID INT NOT NULL,
Hash BINARY(20) NOT NULL

This table has some billion rows and according to sp_spaceused, sys.allocation_units and SSMS reports, its size is about 25GB.

My problem is I can't account for all of this space. Querying sys.column_store_row_groups and sys.column_store_segments only gives me about 7,8GB. The index uses no dictionary: primary_dictionary_id and secondary_dictionary_id are -1 for all segments. Querying sys.column_store_dictionaries returns no rows at all.

The tuple mover has done its job and all row groups are in the compressed state. I already tried an ALTER INDEX REORGANIZE just in case.

My only idea for the size difference is some dictionary-like stuff I'm not accounting for. Any ideas on what I might be missing?

I'm running SQL Server 2017 (RTM-CU4).


EDIT 1:

This is the output from sp_spaceused for the table in question:

+--------+------------+-------------+-------------+------------+----------+
|  name  |    rows    |  reserved   |    data     | index_size |  unused  |
+--------+------------+-------------+-------------+------------+----------+
| IdsBin | 1073741824 | 25028112 KB | 25007432 KB | 16 KB      | 20664 KB |
+--------+------------+-------------+-------------+------------+----------+

EDIT 2:

This is a repro script with 1 million rows. It runs in about 1 minute on my machine. warning: it drops and recreates a new database

USE master;
GO
DROP DATABASE IF EXISTS MyDbWeirdTest;
GO
CREATE DATABASE MyDbWeirdTest;
GO
USE MyDbWeirdTest;
GO

CREATE TABLE IdsBin (
    ID INT NOT NULL,
    Hash BINARY(20) NOT NULL
);
CREATE CLUSTERED COLUMNSTORE INDEX ix1 ON IdsBin
GO

CREATE TYPE tBin AS TABLE (
    ID INT,
    Hash BINARY(20)
);
GO

CREATE OR ALTER PROCEDURE pBin (
    @ids AS dbo.tBin READONLY
)
AS
BEGIN
    SET NOCOUNT ON;

    INSERT dbo.IdsBin
    SELECT ID, Hash FROM @ids;
END;
GO

SET NOCOUNT ON;
DECLARE @i INT = 1, @t INT = 1;
DECLARE @tvp dbo.tBin;

WHILE @t <= 1000000
BEGIN
    DELETE @tvp;
    BEGIN TRAN;
    WHILE @i <= 1000
    BEGIN
        INSERT @tvp VALUES (@t, HASHBYTES('SHA1', CAST(@t AS BINARY(4))));
        SET @i = @i + 1;
        SET @t = @t + 1;
    END;

    EXEC pBin @tvp;
    COMMIT;
    SET @i = 1;
END;
GO

ALTER INDEX ix1 on IdsBin REBUILD;
GO

For this repro, sp_spaceused shows:

+--------+----------------------+----------+----------+------------+--------+
|  name  |         rows         | reserved |   data   | index_size | unused |
+--------+----------------------+----------+----------+------------+--------+
| IdsBin | 1000000              | 22728 KB | 22640 KB | 0 KB       | 88 KB  |
+--------+----------------------+----------+----------+------------+--------+

sys.column_store_row_groups:

+-----------+----------+------------------+--------------+---------------------+-------+-------------------+------------+--------------+---------------+
| object_id | index_id | partition_number | row_group_id | delta_store_hobt_id | state | state_description | total_rows | deleted_rows | size_in_bytes |
+-----------+----------+------------------+--------------+---------------------+-------+-------------------+------------+--------------+---------------+
| 901578250 |        1 |                1 |            0 | NULL                |     3 | COMPRESSED        |    1000000 |            0 |       5896938 |
+-----------+----------+------------------+--------------+---------------------+-------+-------------------+------------+--------------+---------------+

So sp_spaceused gives me about 22MB and sys.allocation_units (not shown) agrees. But no columnstore DMVs seems to agree on that number and they say the index is less than 6MB in size.

Best Answer

sys.column_store_segments and sys.column_store_row_groups store some of the metadata information about the ColumnStore data, but I believe that ends up representing the compressed size. There are LOB structures that are allocated as well, and you can see the uncompressed size in the allocation unit / partition DMVs (and if you get to the pages somehow with DBCC PAGE, I bet you would see that they're relatively empty). In other words, sys.column_store_row_groups shows you how much data is stored on those pages, but doesn't add in the free space on those pages (which still take up space in the data file and in memory, like a fragmented index or an index with a really low fill factor).

I ran your repro and here is what I saw:

SELECT 
  a.[type_desc], 
  p.[rows],
  a.total_pages, reserved_kb = a.total_pages * 8, 
  a.used_pages,  data_kb     = a.used_pages  * 8
FROM sys.allocation_units AS a
INNER JOIN sys.partitions AS p 
   ON a.container_id = p.[partition_id]
WHERE p.[object_id] = OBJECT_ID(N'dbo.IdsBin');

Results:

AU/partition query

Other than unused, which I didn't bother calculating, the results match up with sp_spaceused:

sp_spaceused

And you can confirm these numbers in sys.dm_db_partition_stats as well:

SELECT 
  lob_reserved_page_count, reserved_kb = lob_reserved_page_count * 8,
  lob_used_page_count,     data_kb     = lob_used_page_count     * 8
FROM sys.dm_db_partition_stats 
WHERE [object_id] = OBJECT_ID(N'dbo.IdsBin');

Results:

dm_db_partition_stats

You mentioned this but I just wanted to call out explicitly that sp_spaceused and allocation_units reflect the page count, regardless of how full or empty any page might be. The row groups DMV is reflecting just the actual data. The documentation states (emphasis mine):

Size in bytes of all the data in this row group (not including metadata or shared dictionaries)

Whereas sys.dm_db_partition_stats, for example, explicitly states pages throughout, not data, although I would argue that they should specify that each LOB is an 8K page here:

Total number of LOBs used to store and manage columnstore index in the partition.

Which number you want to trust, well, that's up to you.

As an aside, Niko Neugebauer talked about the fact that only the compressed size of the dictionaries is exposed in the columnstore DMVs here, and raised a feedback item about it. Seems to me there is other information that could be exposed in the columnstore DMVs too.