Sql-server – Are string columns with very few different entries automatically compressed

compressionsql server

We have a table which contains a column with only a dozen different strings in it. Is there anything I need to do in order to get the DB to compress this information? We don't have too much disk space available and it seems like a waste to store all these long stings over and over again.

I would like the DB to store this information internally in a compressed form. For example, the current data looks like this:

Column "TypeInfo"
---------------------------------------------------------
very_long_descriptor_someone_came_up_with_a_long_time_ago
this_desciptor_is_also_very_long
nobody_knows_why_this_descriptor_is_so_long
very_long_descriptor_someone_came_up_with_a_long_time_ago
very_long_descriptor_someone_came_up_with_a_long_time_ago
nobody_knows_why_this_descriptor_is_so_long

I'd like to see SQL Server store some kind of small key instead that represents those longer strings that are repeated:

Column "TypeInfo"
-----------------
1
2
3
1
1
3
(+ mapping information)

Best Answer

No, SQL Server does not do anything like this automatically. You can achieve some of what you're after with data compression, but its highest granularity is at the page level. So you'll get some good compression in the index pages if you have an index on only this column (or at least with this column as the leading key column), but assuming the clustered index is not on this column, you'll get much less benefit from it when the data and other index pages contain many variations of the string.

Data compression uses multiple algorithms, including dictionary, which would have the best impact if all of the values on a page are the same. Even at its best, though, this is not going to be extremely valuable on a large table - imagine you have a million pages and each page has two copies of one of those values. Sure, compression buys you some savings by storing the value only once instead of twice on each page (plus some negligible pointer overhead), but SQL Server still stores a million copies - one per page!

What I suggest, instead of storing the same dozen strings over and over again, is to create a lookup table with a TINYINT key, allowing you to only store each description string once, no matter how many times it's used. You can always retrieve the description at query time without ever having to store it with the data, and you can even create views to make this mostly transparent to queries, applications and users. This can also be a more attractive option if your workload is CPU-bound; while data compression saves you on storage and memory, there is some CPU overhead involved in compressing and decompressing each page.

For example:

CREATE TABLE dbo.TypeInfo
(
  TypeInfoID TINYINT PRIMARY KEY,
  Description VARCHAR(64) NOT NULL UNIQUE
);

INSERT dbo.TypeInfo(TypeInfoID, Description) VALUES
  (1,'very_long_descriptor_someone_came_up_with_a_long_time_ago'),
  (2,'this_desciptor_is_also_very_long'),
  (3,'nobody_knows_why_this_descriptor_is_so_long');
-- ... 9 other values ...

Now you need to add the TypeInfoID column to the original table:

ALTER TABLE dbo.OtherTable 
  ADD TypeInfoID TINYINT;

Then you can update the existing data as follows:

UPDATE o
  SET TypeInfoID = t.TypeInfoID
  FROM dbo.OtherTable AS o
  INNER JOIN dbo.TypeInfo AS t
  ON o.TypeInfo = t.Description;

(And process this in batches, if the table is large, and you want to minimize impact to the log and blocked operations.)

Then you can add a foreign key constraint:

ALTER TABLE dbo.OtherTable 
  ADD CONSTRAINT fk_TypeInfo 
  FOREIGN KEY (TypeInfoID) REFERENCES dbo.TypeInfo(TypeInfoID);

Once all of the data is verified to be correct, you can drop the column (first drop any constraints and indexes that reference it):

ALTER TABLE dbo.OtherTable 
  DROP COLUMN TypeInfo;

Then you can create a view that your queries can use to keep the semantics the same:

CREATE VIEW dbo.vOtherTable
AS
  SELECT /* o.columns */, TypeInfo = t.Description
    FROM dbo.OtherTable AS o
    INNER JOIN dbo.TypeInfo AS t
    ON o.TypeInfoID = t.TypeInfoID;

(Or in some scenarios you could even rename the table, and give the view the name the table used to have, making it completely transparent.)

And finally, if you don't want to allow NULLs in this column, and if you are on SQL Server 2012 or above, you can alter the column to NOT NULL without incurring updates and logging:

ALTER TABLE dbo.OtherTable 
  ALTER COLUMN TypeInfo TINYINT NOT NULL;

If you are on an earlier version, you might consider adding a CHECK CONSTRAINT instead:

ALTER TABLE dbo.OtherTable 
  WITH NOCHECK ADD CONSTRAINT ck_TypeInfoNotNULL
  CHECK (TypeInfoID IS NOT NULL);

But this may not be ideal, as an untrusted constraint can affect the ability of the optimizer to use it... in a lot of cases it is better to just take the one-time performance hit up front.

I think much of this is way beyond the scope of your original question, and you may not even consider any of this solution compelling, but if you do, you may want to do your own research and testing to ensure the least intrusive method of making this change. You can start with the following questions here and on Stack Overflow:

Related Solutions

Sql-server – the best practice for storing large data when nvarchar(4000) is sufficient 99% of the time

as soon as DataValue is over nvarchar(4000), internally, sql will store the data as a blob

This is incorrect. SQL Server will use off-row storage for both max and non-max types. In other words your NVARCHAR(4000) may also be stored off-row. The actual storage location will depend on whether the row fits or not on a page and on the exact combination of sp_tableoption values for the large value types out of row and text in row settings.

Your best bet is to change it to max. SQL will store the value in-row whenever possible, offering fast access. This makes application programing much simpler as you don't have to deal with the actual location (Column) based on size. You should only be concerned if you have an objection on adding the first max type column to a table (eg. it will prevent online rebuild operations), but you are already willing to add one so that should not be an issue.

Sql-server – How to avoid duplicating information in a series of nested SELECT statements

;WITH x AS 
(
  SELECT g.guestid, e.[year], e.run, c = COUNT(*)
  FROM dbo.guest AS g
  INNER JOIN dbo.[order] AS o
  ON g.guestid = o.guestid
  INNER JOIN dbo.eventseat AS es
  ON o.orderid = es.orderid
  INNER JOIN dbo.event2 AS e
  ON e.eventid = es.eventid
  WHERE e.incometype = 'T'
  AND e.eventtype in ('SS', 'BMF')
  AND es.status in ('2','4')
  AND es.price NOT IN ( '0.00', '5.00')
  AND e.[year] = '2010'
  AND e.run IN ('SS-DANCE', 'SS-FILM', 'SS-OPERA')
  GROUP BY g.guestid, e.[Year], e.run
),
y AS 
(
  SELECT * FROM x PIVOT (MAX(c) FOR run IN 
  ([SS-DANCE],[SS-FILM],[SS-OPERA])) AS z
)
SELECT y.guestid, y.[year], g.[first], g.[last], 
  [SS-DANCE] = COALESCE(y.[SS-DANCE], 0), 
  [SS-FILM]  = COALESCE(y.[SS-FILM], 0), 
  [SS-OPERA] = COALESCE(y.[SS-OPERA], 0)
FROM y 
INNER JOIN dbo.guest AS g
ON y.guestid = g.guestid;

I'm not sure if my where clause is going to work for you as written. This currently assumes you only care about these three types (dance, film and opera) and you only care about the year 2010. If you need to cross years you can remove that WHERE clause. If you only care about dance in 2010 and film in any year you'll need to write a more complex set of where clauses.

Some comments:

try to avoid reserved words like order, year, first and last. These only complicate matters because they need to be enclosed in square brackets (which are preferable over "double quotes" mostly for readability and unambiguity).
always use dbo. (or whatever schema prefix is appropriate).
as I suggested above, stick to explicit INNER JOIN syntax. FROM x, y, z is for the birds.

Best Answer

Related Solutions

Sql-server – the best practice for storing large data when nvarchar(4000) is sufficient 99% of the time

Sql-server – How to avoid duplicating information in a series of nested SELECT statements

Related Question