Sql-server – storing as smallint, querying as int

computed-columnoptimizationsql serversql-server-2012t-sql

A multi-billion-row fact table in our database has 10 measures stored as int columns. The value ranges for some of these columns won't ever be above the +/-32K range of a smallint. To save I/O, we're investigating whether it's practical to store these columns as smallint instead of int.

But we're concerned about what problems might crop up from doing this, including:

  • SUM(SomeSmallInt) will frequently overflow, so many queries would have to be rewritten to cast these smallints to ints before aggregating, e.g. SUM(CAST(SomeSmallInt as int)).
  • This table is read in many places, so changing the data type of these columns might involve a lot of review, change, and testing.

So we're wondering if there's a lower-cost solution that would store as smallint but expose the colunms as ints to readers. Like this:

  • create smallint "storage" columns that the table's (only) writer will use, but otherwise no other client will care about.
  • to support existing readers, create int-typed computed columns that are the same names as the original int columns.

What are the pros and cons of this approach? What can go wrong? Is there a better way to reduce storage & I/O without causing problems with overflow and requiring existing readers to be rewritten?

Best Answer

To prove the benefit of data compression for this question, I quickly built the following test:

USE tempdb;

CREATE TABLE dbo.IntTest
(
    IntTest_ID INT NOT NULL
        CONSTRAINT PK_IntTest
        PRIMARY KEY CLUSTERED
)
WITH (DATA_COMPRESSION = PAGE);

CREATE TABLE dbo.SmallIntTest
(
    SmallIntTest_ID SMALLINT NOT NULL
        CONSTRAINT PK_SmallIntTest
        PRIMARY KEY CLUSTERED
)
WITH (DATA_COMPRESSION = NONE);
GO

INSERT INTO dbo.IntTest (IntTest_ID)
SELECT TOP(65535) 
    CONVERT(SMALLINT, ROW_NUMBER() OVER (ORDER BY o.object_id, o1.object_id, o2.object_id) - 32768)
FROM sys.objects o
    , sys.objects o1
    , sys.objects o2
ORDER BY o.object_id, o1.object_id, o2.object_id;

INSERT INTO dbo.SmallIntTest (SmallIntTest_ID)
SELECT TOP(65535) 
    CONVERT(SMALLINT, ROW_NUMBER() OVER (ORDER BY o.object_id, o1.object_id, o2.object_id) - 32768)
FROM sys.objects o
    , sys.objects o1
    , sys.objects o2
ORDER BY o.object_id, o1.object_id, o2.object_id;


SELECT o.name
    , i.name
    , p.partition_number
    , p.rows
    , au.used_pages
    , au.total_pages
    , AvgRowsPerPage = p.rows / CONVERT(DECIMAL(10,2), au.used_pages) 
FROM sys.allocation_units au WITH (NOLOCK)
    INNER JOIN sys.partitions p WITH (NOLOCK) ON ((au.type = 1 OR au.type = 3) AND au.container_id = p.hobt_id) OR (au.type = 2 AND au.container_id = p.partition_id)
    INNER JOIN sys.indexes i ON p.object_id = i.object_id AND p.index_id = i.index_id
    INNER JOIN sys.objects o WITH (NOLOCK) ON p.object_id = o.object_id
    INNER JOIN sys.schemas s WITH (NOLOCK) ON o.schema_id = s.schema_id
WHERE s.name = 'dbo'
    AND o.name IN (
        'IntTest'
        , 'SmallIntTest'
    );

The output from the last query, showing actual pages used by each table, is:

+-----------------+-----------+-------+------------+-------------+-------------+
|      NAME       | PARTITION | ROWS  | USED_PAGES | TOTAL_PAGES | AVG ROWS    |
|                 | NUMBER    |       |            |             | PER PAGE    |
+-----------------+-----------+-------+------------+-------------+-------------+
| PK_IntTest      | 1         | 65535 | 92         | 97          | 712.3369573 |
| PK_SmallIntTest | 1         | 65535 | 92         | 97          | 712.3369573 |
+-----------------+-----------+-------+------------+-------------+-------------+

As you can see from the above results, compressed INTs containing SMALLINTs actually consume precisely the same amount of space.