Sql-server – Database size – MDF too large

shrinksql serversql-server-2005

I'm maintaining a SQL Server 2005 database which hosts approximately 2.9Tb of data (2 x 1.45Tb – I have a RAW schema and an ANALYSIS schema so basically two copies of the data ingested). The recovery model is SIMPLE and the .ldf is at 6Gb.

For whatever reason, the .mdf is 7.5Tb. Now, there are only maybe 2-3 additional columns in ANALYSIS tables and not many NVARCHAR(MAX) columns which, from what I (may have mistakenly understood – please correct me if I'm wrong) may be causing additional space allocation. That's after shrinking the database just now – it was at ~9Tb prior to that. Any thoughts?

And, please, let me know if you have additional questions – I'm very new to database administration and optimization efforts (I usually don't do this side of the job :)).

Many thanks!

Andrija

Best Answer

In your size estimates, have you taken into account the amount of space taken by indexes? Also if you have text fields that are set as multi-byte (N[VAR]CHAR rather than [VAR]CHAR) and the input files are UTF-8 or plain one-byte-per-character then that will push your storage requirements up by up to a factor of two. Furthermore remember that if you have a clustered key/index on a table the size of this affects all other indexes on the table because they include the clustered key value for every row (so to give an extreme example if a table has an NCHAR(10) key where an INT would do and that is your clustered key/index you are not only using an extra 16 bytes per row in the data pages you also waste 16 bytes per row in every other index on that table).

Also, some space will be allocated but unused, either because the DB engine has left some space allocated after deletes so that it can be used again quickly for new data in that table or because the pattern of inserts and deletes has left many pages only part full.

You can run:

SELECT o.name
     , SUM(ps.reserved_page_count)/128.0 AS ReservedMB
     , SUM(ps.used_page_count)/128.0 AS UsedMB
     , SUM(ps.reserved_page_count-ps.used_page_count)/128.0 AS DiffMB
FROM sys.objects o  
JOIN sys.dm_db_partition_stats ps ON o.object_id = ps.object_id  
WHERE OBJECTPROPERTYEX(o.object_id, 'IsMSShipped') = 0  
GROUP BY o.name  
ORDER BY SUM(ps.reserved_page_count) DESC

to get a quick look at what tables are taking up space.

Also EXEC sp_spaceused run within that DB will return two result sets. The first lists the total space allocated in the filesystem for the data files and how much of that is unallocated, the second lists how much of the allocated space is used for data pages, for index pages, or is currently unused.

sp_spaceused will return the space used by a given object too, so you can loop this to build a table for analysis:

-- TEMP TABLES FOR ANALYSIS
CREATE TABLE #tTables (sName NVARCHAR(MAX), iRows BIGINT, iReservedKB BIGINT, iDataKB BIGINT, iIndexKB BIGINT, iUnusedKB BIGINT)
CREATE TABLE #tTmp (sName NVARCHAR(MAX), iRows BIGINT, sReservedKB NVARCHAR(MAX), sDataKB NVARCHAR(MAX), sIndexKB NVARCHAR(MAX), sUnusedKB NVARCHAR(MAX))
-- COLLECT SPACE USE PER TABLE
EXEC sp_msforeachtable 'INSERT #tTmp EXEC sp_spaceused [?];'
-- CONVERT NUMBER-AS-TEXT COLUMNS TO NUMBER TYPES FOR EASIER ANALYSIS
INSERT #tTables SELECT sName, iRows
                     , CAST(REPLACE(sReservedKB, ' KB', '') AS BIGINT)
                     , CAST(REPLACE(sDataKB    , ' KB', '') AS BIGINT)
                     , CAST(REPLACE(sIndexKB   , ' KB', '') AS BIGINT)
                     , CAST(REPLACE(sUnusedKB  , ' KB', '') AS BIGINT) 
                FROM #tTmp
DROP TABLE #tTmp 
-- DO SOME ANALYSIS 
SELECT sName='TOTALS', iRows=SUM(iRows), iReservedKB=SUM(iReservedKB), iDataKB=SUM(iDataKB),  iIndexKB=SUM(iIndexKB), iUnusedKB=SUM(iUnusedKB) FROM #tTables ORDER BY sName
SELECT * FROM #tTables ORDER BY iReservedKB DESC
-- CLEAN UP
DROP TABLE #tTables

The above code will output all the table sizes in one list, plus a single row for the totals. If needed you can use the various system views (like sys.objects and sys.dm_db_partition_stats used in the first query above, see http://technet.microsoft.com/en-us/library/ms177862.aspx for much more detail) to get more details such as the space used by each index.


There are three classes of unused space in a data file:

  1. That which is not allocated to anything (this shows in the first resultset from sp_spaceused with no object specified)
  2. That which is allocated to an object (reserved) but not currently used (this shows in the "unused" count in sp_spaceused's output.
  3. That locked in part-used pages (this will look to be used as everything is allocated in single page chunks, one page being 8,192 bytes long). This is harder to detect/calculate. It is due to a mix of two factors:
    • Split pages. As data gets added you often end up with part empty pages (the storage engine could always normalise page contents, but this would be very inefficient), and as rows are deleted page contents are not automatically packed (again they could be, but the extra I/O load is generally far from worth it).
    • The storage engine won't split a row over multiple pages (this along with the page size where the 8,192 byte-per-row limit comes from). If your rows are fixed size and take 1,100 bytes each then you are going to "waste" at least 492 bytes of each data block allocated to that table (7 rows take 7,700 bytes and an 8th won't fit so the remaining bytes won't be used). The wider the rows, the worse this may be. Tables/indexes with variable length rows (which are far more common than completely fixed length ones) generally fair better (but are less easy to calculate the matter for).
      Another caveat here is large objects (TEXT columns, [N]VARCHAR(MAX) values above a certain size and so on) as they do get placed off-page, just taking 8 bytes in the main row data to hold a pointer to the data elsewhere) so can break the 8,192 bytes-per-row-limit.

tl;dr: Estimating expected database sizes can be a lot more involved than it is natural to initially assume.