Sql-server – Find uncompressed size of all tables in a database

cachecompressionsize;sql server

In Dynamics AX there is a caching mechanism where tables can be configured to be loaded into memory and cached. This cache is limited to a certain amount of KB to prevent memory issues. The setting I'm talking about is called entiretablecache and loads the whole table in memory as soon as a single record is requested.

Up to recently we relied on some scripts to verify the size of the tables that have this setting to see if the table size is above this limit.

Now however, compression comes into play and things like sp_spaceused or sys.allocation_units seem to report the space actually used by the compressed data.

Obviously, the application server is working with uncompressed data so the data size on disk in SQL Server is irrelevant. I need the actual size the uncompressed data will have.

I know of sp_estimate_data_compression_savings but as the name says, this is just an estimate.
I would prefer to have size as correct as possible.

The only way I could think of was some convoluted dynamic SQL creating uncompressed tables with the same structure as the compressed tables, inserting the compressed data in that shadow table and then check the size of that shadow table.
Needless to say, this is a bit tedious and takes a while to run on a database of several hundreds of GB.

Powershell could be an option, but I wouldn't like to iterate over all tables to perform a select * on them to check the size in the script as that would just flood the cache and would probably take a long time too.

In short, I need a way to get the size for each table as it will be once uncompressed and with fragmentation out of the equation as presented to the application, if that's possible. I'm open to different approaches, T-SQL is preferred but I'm not opposed to Powershell or other creative approaches.

Assume the buffer in the application is the size of the data. A bigint is always the size of a bigint, and a character data type is 2 bytes per character (unicode). BLOB data takes the size of the data too, an enum is basically an int and numeric data is numeric(38,12), datetime is the size of a datetime. Also, there are no NULL values, they are either stored as an empty string, 1900-01-01 or zero.

There is no documentation on how this is implemented, but the assumptions are based on some testing and the scripts used by PFE's and the support team (which also ignore compression apparently, since the check is built in the application and the app can't tell if the underlying data is compressed) which also check the table sizes. This link for example states:

Avoid using EntireTable caches for large tables (in AX 2009 over 128
KB or 16 pages, in AX 2012 over ‘entire table cache size’ application
setting [default: 32KB, or 4 pages]) – move to record caching instead.

Best Answer

I need the actual size the uncompressed data will have.
...
I would prefer to have size as correct as possible.

While the desire for this information is certainly understandable, getting this information, especially in the context of "correct as possible" is trickier than everyone is expecting due to faulty assumptions. Whether doing the uncompressed shadow table idea mentioned in the question, or @sp_BlitzErik's suggestion in a comment about restoring the DB and uncompressing there to check, it should not be assumed that the size of the uncompressed table == the size of said data in memory on the app server:

Are all rows in the table being cached? Or just within a range? The assumption here is that it is all, and that might be correct, but I figured it should at least be mentioned that this might not be the case (unless documentation states otherwise, but this is a minor point anyway, just didn't want it to not be mentioned).

Question was updated to state: yes, all rows are being cached.
Structure overhead
1. On the DB side:
  Page and row-overhead on the DB side: How many rows fit on a page is determined by many factors that could throw off estimates. Even with a FILLFACTOR of 100 (or 0), there is still likely to be some unused space left over on the page due to it not being enough for an entire row. And that is in addition to the page header. Also, if any Snapshot Isolation functionality is enabled, there will be, I believe, an extra 13 bytes per row taken up by the version number, and that will throw off estimates. There is other minutia related to the actual size of the row (NULL bitmap, variable length columns, etc) but the items mentioned thus far should alone make the point.
2. On the app server side:
  What type of collection is being used to store the cached results? I assume this is a .NET app, so is it a DataTable? A generic list? A SortedDictionary? Each type of collection has a different amount of overheard. I would not expect any of the options to necessarily mirror the Page and Row overheads on the DB side, especially at scale (I'm sure a small amount of row might not have enough various to matter, but you aren't looking for differences in hundreds of bytes or just a few kB).
Datatypes
1. On the DB side:
  CHAR / VARCHAR data is stored at 1 byte per character (ignoring double-byte characters for the moment). XML is optimized to not take up nearly as much space as the text representation would imply. This datatype creates a dictionary of element and attribute names and replaces the actual references to them in the document with their respective IDs (kinda nice, actually). Otherwise, the string values are all UTF-16 (2 or 4 bytes per "character"), just like NCHAR / NVARCHAR. DATETIME2 is between 6 and 8 bytes. DECIMAL is between 5 and 17 bytes (depending on the precision).
2. On the app server side:
  Strings (again, assuming .NET) are always UTF-16. There is no optimization for 8-bit strings such as what VARCHAR holds. BUT, strings can also be "interned" which is a shared copy that can be referenced many times (but I don't know if this works for strings in collections, or if so, if it works for all types of collections). XML may or may not be stored the same way in memory (I will have to look that up). DateTime is always 8 bytes (like T-SQL DATETIME, but not like DATE, TIME, or DATETIME2). Decimal is always 16 bytes.

All of that to say: there is pretty much nothing you can do on the DB side to gain even fairly accurate memory footprint size on the app server side. You need to find a way to interrogate the app server itself, after being loaded with a particular table, so know how big it is. And I am not sure if a debugger would let you see the runtime size of a filled collection. If not, then the only way to get close would be to go through all rows of a table, multiplying each column by the appropriate .NET size (e.g. INT = * 4, VARCHAR = DATALENGTH() * 2, NVARCHAR = DATALENGTH(), XML = ?, etc), but that still leaves the question of the overhead of the collection plus each element of the collection.

Given some new definition in the question, one could probably do the following query to get rather close. And it doesn't matter whether the table is compressed or not, though it's up to each person to determine if scanning all rows is appropriate on Production (maybe do from a restore or during off-peak hours):

SELECT
   SUM( DATALENGTH([NVarcharColumn_1]) + DATALENGTH([NVarcharColumn_N]) ) + 
   SUM( (DATALENGTH([VarcharColumn_1]) + DATALENGTH([VarcharColumn_N])) * 2 ) + 
   SUM(4 * [number_of_INT_columns]) +
   SUM(8 * [number_of_BIGINT_and_DATETIME_columns]) +
   SUM(16 * [number_of_DECIMAL/NUMERIC_and_UNIQUEIDENTIFIER_columns]) +
   etc..
FROM [SchemaName].[TableName] WITH (NOLOCK) -- assuming no Snapshot Isolation

But remember, this doesn't account for collection or collection element overhead. And not sure if we can get that value without a debugger (or possibly something like ILSpy, but I am not recommending that as it might violate the EULA depending on local laws).

Best Answer

Related Solutions

Sql-server – Moving Clustered Index to new filegroup results in excessive filegrowth

Sql-server – Table Data Space Used Confusion

Related Question