Sql-server – Why Azure SQL Server table used space is too high

azure-sql-databasesql server

I have two tables in my SQLServer database (azure managed), and the structure of two tables as below (for easy reference renamed them)

dbo.TABLE1

CREATE TABLE [dbo].[TABLE1] (
  [ID]             VARCHAR (50)  NOT NULL,  -- PK index
  [S_NUMBER]       VARCHAR (50)  NOT NULL,
  [MAKE]           VARCHAR (12)  NULL,
  [MODEL]          VARCHAR (40)  NULL,
  [C_ID]          VARCHAR (50)  NOT NULL,
  [C_CODE]         REAL          NULL,
  [C_DESCRIPTION]  VARCHAR (50)  NULL,
  [S_CODE]         REAL          NULL,
  [S_DESCRIPTION]  VARCHAR (50)  NULL,
  [LS_TIME]       DATETIME2 (7) NOT NULL,
  [LE_TIME]        DATETIME2 (7) NOT NULL,
  [YEAR]           DATE          NULL,
  [MONTH]          DATE          NULL,
  [WEEK]           DATE          NULL,
  [DAY]            DATE          NULL,
  [HOUR]           DATETIME2 (7) NULL,
  [WEEK_DAY]       VARCHAR (12)  NULL,
  [F_USED]         REAL          NULL,
  [F_USED_UOM]     VARCHAR (12)  NULL,
  [DIST_TRAVE]     REAL          NULL,
  [DIST_TRAV_UOM]  VARCHAR (12)  NULL,
  [DURATION]      REAL          NULL,
  [DURATION_UOM]   VARCHAR (12)  NULL,
  [WEIGHT_MOVED]   REAL          NULL,
  [WEIGHT_STORED]  REAL          NULL,
  [WEIGHT_UOM]     VARCHAR (12)  NULL,
  [CREATED_TIME]   DATETIME      NULL,
  [MODIFIED_TIME]  DATETIME      NULL,
  [OPER_ID]        VARCHAR (12)  NULL,
  [OPER_TYPE]      VARCHAR (12)  NULL,
  [START_TIME]     DATETIME2 (7) NULL,
  [END_TIME]       DATETIME2 (7) NULL,
  [UTC_DAY]        DATE          NULL,
  [UTC_WEEK]       DATE          NULL,
  [UTC_MONTH]      DATE          NULL,
  [UTC_YEAR]       SMALLINT      NULL,
  [UTC_HOUR]       TINYINT       NULL,
  [LOCAL_HOUR]     TINYINT       NULL,
  [EFFICIENCY]     DECIMAL (5)   NULL,
  [JS_ID]          BIGINT        NOT NULL,
  [L_ZONE]         NVARCHAR (50) NULL,
  [D_ZONE]         NVARCHAR (50) NULL,
  [SHIFT_TYPE]     NVARCHAR (12) NULL,
  [SHIFT_DATE]     DATE          NULL,
  [D_CATEGORY]     NVARCHAR (50) NULL,
  [REC_CATEGORY]   NVARCHAR (12) NULL,
  [LSERIAL_NUMBER] VARCHAR (50)  NULL,
  [GPS_TIME]       DATETIME      NULL,
  [GPS_STATUS]     VARCHAR (12)  NULL,
  [A_ID]           VARCHAR (50)  NULL,
  [E_ID]           VARCHAR (50)  NULL,
  [TAG_GUID]       VARCHAR (64)  NULL,
  [LZ_ID]          VARCHAR (64)  NULL,
  [DZ_ID]          VARCHAR (64)  NULL,
  [LZ_VERSION]     INT           NULL,
  [DZ_VERSION]     INT           NULL,
  [Z_ID]           BIGINT        NULL,
  [SHIFT_ID]       NVARCHAR (64) NULL,
  [LI_TIME]        DATETIME2 (3) CONSTRAINT [S_CREATED_TIMESTAMP] DEFAULT (sysdatetime()) NULL,
  [PC_ID] INT NOT NULL DEFAULT 0,
  [L_MAKE]               VARCHAR (32)    NULL, 
  [TV_MOVED]        DECIMAL (18, 2) NULL
);

dbo.TABLE2

CREATE TABLE [dbo].[TABLE2]( 
  [ID] NVARCHAR(32) NOT NULL,  -- PK index
  [C_ID] NVARCHAR(64),
  [DURATION] DECIMAL(18,2),
  [P_COUNT] BIGINT,
  [LS_TIME] DATETIME,
  [LE_TIME] DATETIME,
  [S_AGG_ID] NVARCHAR(32)   -- FK index
)

dbo.TABLE2 is subset of dbo.TABLE1 with only 5 columns, and the row count is same. We capture only 5 columns from dbo.TABLE1 and insert in dbo.TABLE2. But the size(space used) of dbo.TABLE2 is almost 3 times the dbo.TABLE1.

Indexes: Apart from PK, FK indexes we have one additional index on dbo.TABLE1 with include columns [JS_ID, MAKE, S_NUMBER, TAG_GUID, LE_TIME]

Why the space used is too high for smaller table even though row count is same as its source table?

Best Answer

Using bigger data types

These columns in dbo.Table1 are varchar(50):

[ID]
[C_ID]

In dbo.Table2, they are nvarchar(32) and nvarchar(64).

The added "n" makes each character take up twice as much space as the varchar version. It's possible the same thing is going on with the FK column and P_COUNT, but I don't see those defined on the original table, so I can't confirm.

There are a couple of datetime2(7) columns changed to datetime, but these have the same storage cost, so that shouldn't make a difference.

If you get your data types back to what they were in the original table, you should see a pretty big reduction in space used. Note that you will need to REBUILD the clustered index (or just drop and reload the table) after making the data types changes.

Fragmentation?

It's possible that fragmentation is contributing to the additional size of dbo.Table2. If the size is still significantly different after fixing the data types and recreating the table, you could try loading the table with ORDER BY [ID] OPTION (MAXDOP 1) to limit fragmentation in the target table.

Related Solutions

Sql-server – How to track why the SQL Azure occupied space pulsates

I'd have you look at tables and indexes with the following code:

SELECT sys.objects.name, SUM(reserved_page_count) * 8.0 / 1024 as [SizeinMB]
FROM sys.dm_db_partition_stats, sys.objects
WHERE sys.dm_db_partition_stats.object_id = sys.objects.object_id
GROUP BY sys.objects.name
ORDER BY [SizeinMB] DESC

You will want to check the size of the objects at regular intervals to see if you can determine which ones are in flux.

Database Design for Handling 1 Billion Rows in SQL Server

5000 inserts per minute are about 83 inserts per second. With 5 indexes that's 400 physical rows inserted per second. If the workload was in-memory this would not pose a problem even to the smallest of servers. Even if this was a row-by-row insert using the most inefficient way I can think of. 83 trivial queries per second are just not interesting from a CPU standpoint.

Probably, you are disk-bound. You can verify this by looking at wait stats or STATISTICS IO.

Your queries probably touch a lot of different pages so that the buffer pool does not have space for all of them. This causes frequent page reads and probably random disk writes as well.

Imagine a table where you only physically insert at the end because of an ever-increasing key. The working set would be one page: the last one. This would generate sequential IO as well wen the lazy writer or checkpoint process writes the "end" of the table to disk.

Imagine a table with randomly-placed inserts (classic example: a guid key). Here, all pages are the working set because a random page will be touched for each insert. IOs are random. This is the worst case when it comes to working set.

You're in the middle. Your indexes are of the structure (SomeValue, SequentialDateTime). The first component partially randomizes the sequentiality provided by the second. I guess there are quite a few possible values for "SomeValue" so that you have many randomly-placed insert-points in your indexes.

You say that data is split into 10GB tables per week. That's a good starting point because the working set is now bounded by 10GB (disregarding any reads you might do). With 12GB of server memory it is unlikely, though, that all relevant pages can stay in memory.

If you could reduce the size of the weekly "partitions" or increase server memory by a bit you are probably fine.

I'd expect that inserts at the beginning of the week are faster then at the end. You can test this theory on a dev server by running a benchmark with a certain data size and gradually reducing server memory until you see performance tank.

Now even if all reads and writes fit into memory you might still have random dirty page flushing IO. The only way to get rid of that is to write into co-located positions in your indexes. If you can at all convert your indexes to use (more) sequential keys that would help a lot.

As a quick solution I'd add a buffering layer between the clients and the main table. Maybe accumulate 15min of writes into a staging table and periodically flush it. That takes away the load spikes and uses a more efficient plan to write to the big table.