Sql-server – Are string columns with very few different entries automatically compressed

compressionsql server

We have a table which contains a column with only a dozen different strings in it. Is there anything I need to do in order to get the DB to compress this information? We don't have too much disk space available and it seems like a waste to store all these long stings over and over again.

I would like the DB to store this information internally in a compressed form. For example, the current data looks like this:

Column "TypeInfo"
---------------------------------------------------------
very_long_descriptor_someone_came_up_with_a_long_time_ago
this_desciptor_is_also_very_long
nobody_knows_why_this_descriptor_is_so_long
very_long_descriptor_someone_came_up_with_a_long_time_ago
very_long_descriptor_someone_came_up_with_a_long_time_ago
nobody_knows_why_this_descriptor_is_so_long

I'd like to see SQL Server store some kind of small key instead that represents those longer strings that are repeated:

Column "TypeInfo"
-----------------
1
2
3
1
1
3
(+ mapping information)

Best Answer

No, SQL Server does not do anything like this automatically. You can achieve some of what you're after with data compression, but its highest granularity is at the page level. So you'll get some good compression in the index pages if you have an index on only this column (or at least with this column as the leading key column), but assuming the clustered index is not on this column, you'll get much less benefit from it when the data and other index pages contain many variations of the string.

Data compression uses multiple algorithms, including dictionary, which would have the best impact if all of the values on a page are the same. Even at its best, though, this is not going to be extremely valuable on a large table - imagine you have a million pages and each page has two copies of one of those values. Sure, compression buys you some savings by storing the value only once instead of twice on each page (plus some negligible pointer overhead), but SQL Server still stores a million copies - one per page!

What I suggest, instead of storing the same dozen strings over and over again, is to create a lookup table with a TINYINT key, allowing you to only store each description string once, no matter how many times it's used. You can always retrieve the description at query time without ever having to store it with the data, and you can even create views to make this mostly transparent to queries, applications and users. This can also be a more attractive option if your workload is CPU-bound; while data compression saves you on storage and memory, there is some CPU overhead involved in compressing and decompressing each page.

For example:

CREATE TABLE dbo.TypeInfo
(
  TypeInfoID TINYINT PRIMARY KEY,
  Description VARCHAR(64) NOT NULL UNIQUE
);

INSERT dbo.TypeInfo(TypeInfoID, Description) VALUES
  (1,'very_long_descriptor_someone_came_up_with_a_long_time_ago'),
  (2,'this_desciptor_is_also_very_long'),
  (3,'nobody_knows_why_this_descriptor_is_so_long');
-- ... 9 other values ...

Now you need to add the TypeInfoID column to the original table:

ALTER TABLE dbo.OtherTable 
  ADD TypeInfoID TINYINT;

Then you can update the existing data as follows:

UPDATE o
  SET TypeInfoID = t.TypeInfoID
  FROM dbo.OtherTable AS o
  INNER JOIN dbo.TypeInfo AS t
  ON o.TypeInfo = t.Description;

(And process this in batches, if the table is large, and you want to minimize impact to the log and blocked operations.)

Then you can add a foreign key constraint:

ALTER TABLE dbo.OtherTable 
  ADD CONSTRAINT fk_TypeInfo 
  FOREIGN KEY (TypeInfoID) REFERENCES dbo.TypeInfo(TypeInfoID);

Once all of the data is verified to be correct, you can drop the column (first drop any constraints and indexes that reference it):

ALTER TABLE dbo.OtherTable 
  DROP COLUMN TypeInfo;

Then you can create a view that your queries can use to keep the semantics the same:

CREATE VIEW dbo.vOtherTable
AS
  SELECT /* o.columns */, TypeInfo = t.Description
    FROM dbo.OtherTable AS o
    INNER JOIN dbo.TypeInfo AS t
    ON o.TypeInfoID = t.TypeInfoID;

(Or in some scenarios you could even rename the table, and give the view the name the table used to have, making it completely transparent.)

And finally, if you don't want to allow NULLs in this column, and if you are on SQL Server 2012 or above, you can alter the column to NOT NULL without incurring updates and logging:

ALTER TABLE dbo.OtherTable 
  ALTER COLUMN TypeInfo TINYINT NOT NULL;

If you are on an earlier version, you might consider adding a CHECK CONSTRAINT instead:

ALTER TABLE dbo.OtherTable 
  WITH NOCHECK ADD CONSTRAINT ck_TypeInfoNotNULL
  CHECK (TypeInfoID IS NOT NULL);

But this may not be ideal, as an untrusted constraint can affect the ability of the optimizer to use it... in a lot of cases it is better to just take the one-time performance hit up front.

I think much of this is way beyond the scope of your original question, and you may not even consider any of this solution compelling, but if you do, you may want to do your own research and testing to ensure the least intrusive method of making this change. You can start with the following questions here and on Stack Overflow: