Sql-server – FORMAT returns large row size and data size

sql serversql-server-2012

I am surprise with one of my findings that using a FORMAT () does have very big impact on the row size and data size. It is almost 250x more of the size of not applying FORMAT ().

My question is:
1) Why does using FORMAT() have such big impact on the size? To me, is just 1.23 vs $1.23 difference, which is probably 1 character difference. And does it matter with such huge size?

2) Why do we still be encouraged to use FORMAT() in SQL server instead of using string concatenation as below. Since below is only 2X data size, and using format returns 250x data size. OR is that data size is not a critical measurement?

SELECT '$' + CONVERT(varchar(10), UnitPrice) FROM Sales.SalesOrderDetail;

3) Does having data size of 464MB means i will be returning 464MB of data to client?

========================================================

Below is my findings with AdventureWorks2012 database.

SELECT UnitPrice FROM Sales.SalesOrderDetail;

Actual Number of Rows: 121317
Estimated Number of Rows: 121317
Estimated Row Size: 15B
Estimated Data Size: 1777KB

SELECT '$' + CONVERT(varchar (10), UnitPrice) FROM Sales.SalesOrderDetail;

Actual Number of Rows: 121317
Estimated Row Size: 26B
Estimated Data Size: 3060KB

SELECT FORMAT(UnitPrice, 'c') FROM Sales.SalesOrderDetail;

Actual Number of Rows: 121317
Estimated Row Size: 4011B
Estimated Data Size: 464MB

Best Answer

FORMAT() has an (admittedly undocumented) output of nvarchar(4000), at least in the cases of converting ints and dates to strings. The documentation simply says...

The length of the return value is determined by the format.

But then doesn't explain or provide any examples. You can see what I'm describing, though, with:

SELECT TOP (1) object_id, x = FORMAT(object_id, 'en-us') 
  INTO #blat FROM sys.all_objects;

EXEC tempdb.sys.sp_help N'#blat';

Result is that x is an nvarchar with a length of 8,000 (this is the number of bytes, not the number of characters).

Estimated row size is based on an assumption that variable width values will be half-populated. So, it expects 2,000 characters (4,000 bytes) on each row (even if the particular parameters you supply can't possibly result in that many characters). I demonstrate this (but not with FORMAT() specifically) in another answer, Would using varchar(5000) be bad compared to varchar(255)?

This is one reason I prefer to use CONVERT() and TRY_CONVERT() equivalents instead of FORMAT(), in spite of its syntactic sugar. At least with those you can convert to a defined width instead of relying on it "being determined by the format." Which may or may help estimated size, depending on the query. Another example that demonstrates the benefit here (even though it requires uglier code):

DECLARE @m float = 32.74532323;

SELECT 
    a = @m, 
    b = FORMAT(@m, 'c'), 
    c = '$' + CONVERT(varchar(12), CONVERT(decimal(8,2),@m))
 INTO #splunge 
 FROM sys.all_objects;

EXEC tempdb.sys.sp_help N'#splunge';

Results:

a    float
b    nvarchar(4000)
c    varchar(13)

Another reason I prefer to use CONVERT() and TRY_CONVERT() is that FORMAT() sucks from a performance perspective (see FORMAT() is nice and all, but…).

Also please don't ever use variable-width types like varchar without also specifying a length.

Related Solutions

Sql-server – Adding around 200 rows in a table grows size of the table by 400kb where avg row size is 0.2KB

Check that if you have indexes (clustered/nonclustered, full text). Use sp_spaceused 'your_table_name' to check that you can actually discard indexes.
Check what type of table is used. In your example of 0.2KB=205 Bytes you will have 38 rows per data page if your table is heap and 39 rows per data page if it is clustered table.

Please see below the example:

IF EXISTS (SELECT * FROM sys.tables
            WHERE name = 'sparse_pages')
    DROP TABLE sparse_pages;
GO
CREATE TABLE sparse_pages 
(
KeyField SMALLINT --IDENTITY (1,1) PRIMARY KEY
, Filler VARCHAR(8000) null 
)
GO

SET NOCOUNT ON
INSERT INTO sparse_pages( Filler) values ( REPLICATE('a', 192))
GO 39

-- Average row size now 205 Bytes
WAITFOR DELAY '00:00:03';
GO
SELECT 'This is Heap. Note Data Space'
GO
-- Check table size
sp_spaceused 'sparse_pages'
GO


IF EXISTS (SELECT * FROM sys.tables
            WHERE name = 'sparse_pages')
    DROP TABLE sparse_pages;
GO
CREATE TABLE sparse_pages 
(
KeyField SMALLINT IDENTITY (1,1) PRIMARY KEY
, Filler VARCHAR(8000) null 
)
GO

SET NOCOUNT ON
INSERT INTO sparse_pages( Filler) values ( REPLICATE('a', 192))
GO 39

-- Average row size now 205 Bytes
WAITFOR DELAY '00:00:03';
GO
SELECT 'This is Clustered Index. Note Data Space'
GO
-- Check table size
sp_spaceused 'sparse_pages'
GO

Check for random inserts/updates/deletes in your table. This may be an issue as free space is not reclaimed back. The free space may be wasted during page splits too.

Please see below the example with average row size 205 Bytes and 200 rows (just like in your case). Table data size is 1.57 MB:

IF EXISTS (SELECT * FROM sys.tables
            WHERE name = 'sparse_pages')
    DROP TABLE sparse_pages;
GO
CREATE TABLE sparse_pages 
(
KeyField SMALLINT IDENTITY (1,1) PRIMARY KEY
, Filler VARCHAR(8000) null 
)
GO

Enter the data

SET NOCOUNT ON
INSERT INTO sparse_pages( Filler) values ( REPLICATE('a', 8700))
INSERT INTO sparse_pages( Filler) values ( REPLICATE('a', 192))
GO 200

DELETE FROM sparse_pages 
WHERE LEN(Filler)>300
GO
-- Average row size now 205 Bytes
WAITFOR DELAY '00:00:03';
GO
-- Check table size
sp_spaceused 'sparse_pages'
GO

Sql-server – Merge row size overflow in SQL Server – “Cannot create a row of size..”

Why the second time I tried to merge the same row which already was inserted it resulted in an error. If this row exceeded maximum row size, it would expect for it not to be possible to insert it in the first place.

First, thank you for the reproduction script.

The problem is not that SQL Server cannot insert or update a particular user-visible row. As you noted, a row that has already been inserted to a table certainly cannot be fundamentally too large for SQL Server to handle.

The problem occurs because the SQL Server MERGE implementation adds computed information (as extra columns) during intermediate steps in the execution plan. This extra information is needed for technical reasons, to keep track of whether each row should result in a insert, update, or delete; and also related to the way SQL Server generically avoids transient key violations during changes to indexes.

The SQL Server Storage Engine requires indexes to be unique (internally, including any hidden uniquifier) at all times - as each row is processed - rather than at the start and end of the complete transaction. In more complex MERGE scenarios, this requires a Split (converting an update to a separate delete and insert), Sort, and an optional Collapse (turning adjacent inserts and updates on the same key into an update). More information.

_{As an aside, note that the issue does not occur if the target table is a heap (drop the clustered index to see this). I am not recommending this as a fix, just mentioning it to highlight the connection between maintaining index uniqueness at all times (clustered in the present case), and the Split-Sort-Collapse.}

In simple MERGE queries, with suitable unique indexes, and a straightforward relationship between source and target rows (typically matching using an ON clause that features all key columns), the query optimizer can simplify much of the generic logic away, resulting in comparatively simple plans that do not require a Split-Sort-Collapse, or Segment-Sequence Project to check that target rows are only touched once.

In complex MERGE queries, with more opaque logic, the optimizer is usually unable to apply these simplifications, exposing much more of the fundamentally complex logic required for correct processing (product bugs notwithstanding, and there have been plenty).

Your query certainly qualifies as complex. The ON clause does not match the index keys (and I understand why), and the 'source table' is a self-join involving a ranking window function (again, with reasons):

MERGE MERGE_REPRO_TARGET AS targetTable
USING
(
    SELECT * FROM 
    (
        SELECT 
            *, 
            ROW_NUMBER() OVER (
                PARTITION BY ww,id, tenant 
                ORDER BY 
                (
                    SELECT COUNT(1) 
                    FROM MERGE_REPRO_SOURCE AS targetTable
                    WHERE 
                        targetTable.[ibi_bulk_id] = sourceTable.[ibi_bulk_id] 
                        AND targetTable.[ibi_row_id] <> sourceTable.[ibi_row_id] 
                        AND 
                        (
                            (targetTable.[ww] = sourceTable.[ww]) 
                            AND (targetTable.[id] = sourceTable.[id]) 
                            AND (targetTable.[tenant] = sourceTable.[tenant])
                        ) 
                        AND NOT ((targetTable.[sampletime] <= sourceTable.[sampletime]))
                ),
                sourceTable.ibi_row_id DESC
            ) AS idx
        FROM MERGE_REPRO_SOURCE sourceTable 
        WHERE [ibi_bulk_id] in (20150803110418887)
    ) AS bulkData
    where idx = 1
) AS sourceTable 
ON 
    (targetTable.[ww] = sourceTable.[ww]) 
    AND (targetTable.[id] = sourceTable.[id]) 
    AND (targetTable.[tenant] = sourceTable.[tenant])
...

This results in many extra computed columns, primarily associated with the Split and the data needed when an update is converted to an insert/update pair. These extra columns result in an intermediate row exceeding the allowed 8060 bytes at an earlier Sort - the one just after a Filter:

The Problem Sort

Note that the Filter has 1,319 columns (expressions and base columns) in its Output List. Attaching a debugger shows the call stack at the point the fatal exception is raised:

Note in passing that the problem is not at the Spool - the exception there is converted to a warning about the potential for a row to be too large.

Why does update using merge is not succeeding, while insert does, and direct update also does?

A direct update does not have the same internal complexity as the MERGE. It is a fundamentally simpler operation that tends to simplify and optimizer better. Removing the NOT MATCHED clause may also remove enough of the complexity such that the error is not generated in some cases. That does not happen with the repro, however.

Ultimately, my advice is to avoid MERGE for larger or more complex tasks. My experience is that separate insert/update/delete statements tend to optimize better, are simpler to understand, and also often perform better overall, compared with MERGE.

Best Answer

Related Solutions

Sql-server – Adding around 200 rows in a table grows size of the table by 400kb where avg row size is 0.2KB

Sql-server – Merge row size overflow in SQL Server – “Cannot create a row of size..”

Related Question