Sql-server – SQL Server 2012 compress column for whole table

compressionsql serversql-server-2012

I was wondering if SQL Server supports a kind of dictionary compression for a whole table, and not just a page.

The system I'm working on wasn't initially created for the large amounts of data it processes today. The problem I'm having at the moment is the following:

The application allows users to create legal contracts. These contracts are standardized, however the user is allowed to adjust the contents (text) of the contract if he wishes.

To facilitate this, each contract makes a copy of the standardized contract text. In reality we've discovered that users almost never edit the contract texts so we've ended up with a table with a lot of duplicate data.

Usually I would change the DB model to fit the use case, however this is a legacy system and such a change is pretty expensive. There's being worked on a replacement for it, so investments like these aren't done as easily.

Is it possible to have column dictionary compression for the whole table, and not just 1 page?

We're using SQL Server 2012 in a on premise SQL cluster.

The problem is that the table is 80GB in size, the whole DB is 180GB in size. This table takes up a lot of space and we don't have enough memory so SQL server keeps unloading the data.

The data of this table is used to generate a PDF which represents the contract. Each time a user modifies the state of a contract a new PDF is generated and stored for auditing purposes, this generates a lot of reads on this table.

The reads go down to disk (since SQL server keeps unloading the table from memory). This is creating a huge IO strain on our SAN.

The memory problems are being worked on, but that will take a couple of weeks. It suffices to say for now that simply plugging more memory in isn't an option at the moment.

My thinking is: for a short term solution – compress the data which will drastically reduce the size of the table and thereby making it possible for SQL Server to keep the table in memory which reliefs our IO strain on the SAN.

Best Answer

Is it possible to have column dictionary compression for the whole table, and not just 1 page?

For the situation you are describing SQL Server's built-in compression probably won't do anything at all. I assume that the documents are stored in NVARCHAR(MAX) columns to allow for non-ASCII characters and to allow values longer than 4,000 characters. Off-page values are not compressed by any of SQL Server's methods (row compression, page compression, or the unicode compression that comes with them) so most likely all of your documents won't be touched.

So in short: no there is nothing SQL Server can do automatically to help you here.

Possible Options: (that are a bit more manual and hacky)

If you are using 2016 or later or Azure SQL[1][2] then you could use the COMPRESS/DECOMPRESS functions to store the data as smaller blobs in VARBINARY columns instead. To do this in a way that is transparent to your application(s) you could create a backing table that stores the data, replacing the real table with a view that selects from this and has INSTEAD OF triggers to compress the incoming data. IF you already have a lot of data this could take a long time to roll out. To remove the need for a many hours maintenance window to compress existing content have both NVARCHAR(MAX) and VARBINARY(MAX) columns in the backing table and have the view decide with DocumentText = ISNULL(DECOMPRESS(CompressedText), UncompressedText) and then you can have a process slowly work through the table compressing a batch of rows at a time, keeping the batches small enough to avoid locking issues so you can let the process trundle along in production.

Of course if you can affect the application(s) not just the data layer than it would be more efficient to implement the compression of the documents there instead because you save network transfer as well as storage & memory on the database servers.

If a lot of your documents are truly identical then you may see greater gains from deduplication than from compressing each individually, which you can do using similar techniques: store the documents in their own table with a hash of them as a key/index and store a reference to them in the main table, using the backing-table-plus-view-and-trigger method or elsewhere in the application's data manipulation layer. In 2016/Azure (or if using CLR is an option) you can compress the stored documents too for extra space saving.

Another option if many of your documents are almost identical because they are constructed from templates is to store templates and diffs, though this would likely need to be implemented in the application layer or via CLR as I can't imagine doing string diffs and patches in TSQL would be anywhere near efficient enough. This will only work if the documents are stored in an uncompressed form (i.e. RTF files, old office formats, markdown text; not recent MS Office formats or PDFs) because with compressed formats everything is likely to be different after the first change.

These ideas add complexity and result in drops in write performance, so you'd need to judge them with those negatives in mind. I'm playing with the ideas myself for a similar situation where we have a large collection of pickled data (several tens of Gb in total, a few thousand characters average per item) in text format in a legacy application. I'll see if I can get permission to share the results when I'm done.

--

[1] The question was later updated to note that in their case SQL2012 is being used - the deduplication method will still work, and I've left the compression method in the answer as it could be helpful for others with a similar situation. In fact it would be possible in SQL Server 2012 with a custom CLR module to implement the compression, http://aboutsqlserver.com/2015/04/07/compressing-lob-xml-data-in-the-database/ is the first good looking example from a quick search but there are many more out there.

[2] As noted by Paul in the comments below, with 2017 LOB support is coming to compressed columnstore indexes, so that would be another avenue to research for people using a recent enough version of SQL Server.