Sql-server – Using CROSS APPLY OPENJSON causes Azure to hang

azure-sql-databasecross-applyperformancesql-server-2017

I have a table with around 8 million rows with a schema of:

CREATE TABLE [dbo].[Documents](
    [Id] [uniqueidentifier] NOT NULL,
    [RemoteId] [int] NOT NULL,
    [Json] [nvarchar](max) NULL,
    [WasSuccessful] [bit] NOT NULL,
    [StatusCode] [int] NULL,
    [Created] [datetime2](7) NULL,
 CONSTRAINT [PK_Documents] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Documents] ADD  CONSTRAINT [DF_Documents_Id]  DEFAULT (newsequentialid()) FOR [Id]
GO

ALTER TABLE [dbo].[Documents] ADD  CONSTRAINT [DF_Documents_Created]  DEFAULT (getdate()) FOR [Created]
GO

The json documents are of a structure of

{
    "Id": 1,
    "Data": [
        {
            "Id": 99,
            "Name": "Person 1"
        },
        {
            "Id": 100,
            "Name": "Person 2"
        }
    ]
}

I am attempting to extract the JSON array from within the document and insert it into a new table using the following query:

;WITH CTE (Json) AS 
(
    SELECT TOP 10
        JSON_QUERY([Json], '$.Data')
    FROM 
        Documents
    WHERE 
        ISJSON([Json]) > 0
)
INSERT INTO [dbo].[ParsedDocuments] (Id, Name)
SELECT
    JSON_VALUE([Value], '$.Id') AS [Id],
    JSON_VALUE([Value], '$.Name') AS [Name],
FROM 
    CTE
CROSS APPLY 
    OPENJSON([Json]) as X

What I'm finding is that if I extract and query a sample of the data, say 1000 rows. Everything works as expected and the data is inserted into the destination table. However, when I query the main table the server appears to just hang and become unresponsive.

I suspect that it is attempting to run a cross apply across all the rows before inserting data. Is there any way to improve performance? Or allow the job to begin "streaming" results into the destination table rather than attempting to batch them up?

Finally, as you can see I am employing the use of "TOP 10" results. I am however still experiencing the performance hanging. And I am unsure as to why.

Query plans for large table are never generated how when running against a sample of the data the query plan can be found here.

Best Answer

The fact that you cannot obtain even an estimated execution plan for the query against the large table suggests that compilation is waiting on statistics creation (or update) for the nvarchar(max) column.

When running against the smaller sample table, statistics creation completes quickly, so an execution plan is quick to create.

I can reproduce your situation on a low-powered (S0) Azure SQL Database. On a slightly less anaemic S3 instance, the compilation still takes around 30 seconds due to the large amount of I/O involved in reading the nvarchar(max) column to generate statistics. The premium tier is needed to get good I/O performance - even a cold P1 can get these statistics created in 3 seconds or so.

You can confirm the cause by temporarily disabling automatic statistics creation:

ALTER DATABASE CURRENT SET AUTO_CREATE_STATISTICS OFF;

The SQL server will then produce a plan quickly, albeit with warnings about missing statistics:

plan with no stats

Be sure to re-enable this feature after testing because good statistics are vital to general execution plan quality. Or commit to creating necessary statistics manually.

If the issue is instead automatic refresh of stale statistics, you could choose to have these updated asynchronously:

ALTER DATABASE CURRENT SET AUTO_UPDATE_STATISTICS_ASYNC ON;

Alternatively, scale your Azure SQL Database up to a performance level that suits your workload and budget.