SQL Server 2008 – Is Data Compressed for Transmission?

compressionsql-server-2008

Is data retrieved from Microsoft SQL Server compressed? If this is controlled by the connection string, is there any simple way to tell if any particular app is using it?

I'm examining analysis tools, and the volume of data can take minutes to transmit over our network. I'm wondering whether I should expect a performance increase if we pull data from a compressed data store on the same remote server.

As long as we're on the topic, I'm curious: is data transmitted in binary or ASCII? For example, if the value 12345 is queried from an INT column, is it transmitted as the five bytes 0x31, 0x32, 0x33, 0x34, 0x35; the two bytes that are required for the value; or four bytes as required for the column?

To be clear, I understand that there are options regarding storing data with compression, and backing it up. I'm asking about how data is transmitted.

Best Answer

The data you are looking to compress is that sent over the wire via TDS. There is some minor compression here but nowhere near the type of compression you get with page/row compression, backup compression or ColumnStore compression.

It has been asked for before:

http://connect.microsoft.com/SQLServer/feedback/details/412131/enable-network-compression-compress-tds-stream

http://connect.microsoft.com/SQLServer/feedback/details/377479/wan-compression-option

The items are still open, so maybe there is some hope. There is no way to control this via the connection string that I've ever seen.

In the meantime there are some products that claim to do this, e.g.

http://www.nitrosphere.com/products/nitroaccelerator/

http://toonel.net/tcpany.htm

You can also potentially configure the network between your SQL Server and the application servers to support compression (and other things like encryption) but you are beyond my scope here, and I'm not sure if this would be supported by every single feature of SQL Server.

And to be honest, I'm not convinced this is the place you want to focus on optimizing. Compressing this stream might actually slow things down and outweigh the benefits of sending fewer bytes. I'd rather plunk the money down on better network connectivity between server and client(s) than to spend time investing in this type of work and testing whether it has any actual benefits - and not being able to do that until afterward. From from 10/100 to gig fiber has a known and predictable impact on network I/O.

I am not sure about the format of the bytes sent over the wire; you will have to set up some kind of packet sniffer for that (or maybe someone has already done that and will chime in).

As for the impact of compression, unless you are on Fusion-IO or other high-end SSD-type solutions, you are almost certainly I/O bound currently, and not CPU-bound. So as long as you have CPU overhead, you should see faster performance with compression enabled (but this won't change network performance, since the data is uncompressed before transmission). I say that knowing nothing about your servers, your application, your data or your usage patterns - you could very well have an edge case where compression actually hurts performance, or where the data just isn't a good candidate for good compression ratios.

Related Solutions

SQL Server – Best Practices for Storage Allocation in Compressed and Partitioned Tables

Rebuild the index to write out all pages freshly in the optimal way.

If you want to absolutely minimize space usage, specify ONLINE = OFF. Online operations add a small amount of per-row space overhead.

MAXDOP = 1 can be good to reduce fragmentation. Not sure if that can help with space usage, though. I can't think of anything that a DOP of 1 might save.

If the target file of the rebuild already contains fragmented freespace the newly built index/partition can turn out to be fragmented immediately. Consider building into an empty target filegroup that is pre-sized to approximately the right size.

SQL Server – varchar Storage and Comparisons in SQL Server 2008

Add a persistent calculated field that contains a CHECKSUM on the 5 fields, and use that to perform the comparisons.

The CHECKSUM field will be unique for that specific combination of fields, and is stored as an INT that results in a much easier target for comparisons in a WHERE clause.

USE tempdb; /* create this in tempdb since it is just a demo */

CREATE TABLE dbo.t1
(
    Id       bigint constraint PK_t1 primary key clustered identity(1,1)
    , Sequence int
    , Parent   int not null constraint df_T1_Parent DEFAULT ((0))
    , Data1    varchar(20)
    , Data2    varchar(20)
    , Data3    varchar(20)
    , Data4    varchar(20)
    , Data5    varchar(20)
    , CK AS CHECKSUM(Data1, Data2, Data3, Data4, Data5) PERSISTED
);

GO

INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
VALUES (1,1,'test','test2','test3','test4','test5');

SELECT *
FROM dbo.t1;
GO

enter image description here

/* this row will NOT get inserted since it already exists in dbo.t1 */
INSERT INTO dbo.t1 (Sequence, Parent, Data1, Data2, Data3, Data4, Data5)
SELECT 2, 3, 'test', 'test2', 'test3', 'test4', 'test5'
WHERE Checksum('test','test2','test3','test4','test5') NOT IN (SELECT CK FROM t1);

/* still only shows the original row, since the checksum for the row already
exists in dbo.t1 */
SELECT *
FROM dbo.t1;

In order to support a large number of rows, you'd want to create an NON-UNIQUE index on the CK field.

By the way, you neglected to mention the number of rows you are expecting in this table; that information would be instrumental in making great recommendations.

In-row data is limited to a maximum of 8060 bytes, which is the size of a single page of data, less the required overhead for each page. Any single row larger than that will result in some off-page storage of row data. I'm certain other contributors to http://dba.stackexchange.com can give you a much more concise definition of the engine internals regarding storage of large rows. How big is your largest row, presently?

If items in Data1, Data2, Data3... have the same values occurring in a different order, the checksum will be different, so you may want to take that into consideration.

Following a brief discussion with the fantastic Mark Storey-Smith on The Heap, I'd like to offer a similar, although potentially better choice for calculating a hash on the fields in question. You could alternately use the HASHBYTES() function in the calculated column. HASHBYTES() has some gotchas, such as the necessity to concatenate your fields together, including some type of delimiter between the field values, in order to pass HASHBYTES() a single value. For more information about HASHBYTES(), Mark recommended this site. Clearly, MSDN also has some great info at http://msdn.microsoft.com/en-us/library/ms174415.aspx

Best Answer

Related Solutions

SQL Server – Best Practices for Storage Allocation in Compressed and Partitioned Tables

SQL Server – varchar Storage and Comparisons in SQL Server 2008

Related Question