Slow JOIN on tables with millions of rows

azure-sql-databaseperformancequery-performance

In my application I have to join tables with millions of rows. I have a query like this:

SELECT DISTINCT "f"."id" AS "FileId"  
, "f"."name" AS "FileName"  
, "f"."year" AS "FileYear"  
, "vt"."value" AS "value"  
FROM files "f"  
JOIN "clients" "cl" ON("f"."cid" = "cl"."id" AND "cl"."id" = 10)  
LEFT JOIN "value_text" "vt" ON ("f"."id" = "vt"."id_file" AND "vt"."id_field" = 65739)  
GROUP BY "f"."id", "f"."name", "f"."year", "vt"."value"  

The table "files" has 10 million rows, and the table "value_text" has 40 million rows.

This query is too slow, it takes between 40s (15000 results) – 3 minutes (65000 results) to be executed.

I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)…

What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".

I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.

I've tried using the client table as a subquery and removing DISTINCT with the same results.

I have 1428 rows in client table.

Additional info

clients table:

CREATE TABLE [dbo].[clients](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [code] [nvarchar](70) NOT NULL,
    [password] [nchar](40) NOT NULL,
    [name] [nvarchar](150) NOT NULL DEFAULT (N''),
    [email] [nvarchar](255) NULL DEFAULT (NULL),
    [entity] [int] NOT NULL DEFAULT ((0)),
    [users] [int] NOT NULL DEFAULT ((0)),
    [status] [varchar](8) NOT NULL DEFAULT ('inactive'),
    [created] [datetime2](7) NULL DEFAULT (getdate()),
    [activated] [datetime2](7) NULL DEFAULT (getdate()),
    [client_type] [varchar](10) NOT NULL DEFAULT ('normal'),
    [current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
CONSTRAINT [clients$code] UNIQUE NONCLUSTERED 
(
    [code] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

files table:

CREATE TABLE [dbo].[files](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [cid] [bigint] NOT NULL DEFAULT ((0)),
    [eid] [bigint] NOT NULL DEFAULT ((0)),
    [year] [bigint] NOT NULL DEFAULT ((0)),
    [name] [nvarchar](255) NOT NULL DEFAULT (N''),
    [extension] [int] NOT NULL DEFAULT ((0)),
    [size] [bigint] NOT NULL DEFAULT ((0)),
    [id_doc] [bigint] NOT NULL DEFAULT ((0)),
    [created] [datetime2](7) NULL DEFAULT (getdate())
CONSTRAINT [PK_files_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
 CONSTRAINT [files$estructure_unique] UNIQUE NONCLUSTERED 
(
    [year] ASC,
    [name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

GO

ALTER TABLE [dbo].[files]  WITH NOCHECK ADD  CONSTRAINT [FK_files_client] FOREIGN KEY([cid])
REFERENCES [dbo].[clients] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[files] CHECK CONSTRAINT [FK_files_client]
GO

value_text table:

CREATE TABLE [dbo].[value_text](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [id_file] [bigint] NOT NULL DEFAULT ((0)),
    [id_field] [bigint] NOT NULL DEFAULT ((0)),
    [value] [nvarchar](255) NULL DEFAULT (NULL),
    [id_doc] [bigint] NULL DEFAULT (NULL)
 CONSTRAINT [PK_value_text_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

GO

ALTER TABLE [dbo].[value_text]  WITH NOCHECK ADD  CONSTRAINT [FK_valuesT_field] FOREIGN KEY([id_field])
REFERENCES [dbo].[fields] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[value_text] CHECK CONSTRAINT [FK_valuesT_field]
GO

Execution plan:

Execution plan

*I translated the tables and the fields in this question for general understanding. In this image, "archivos" is the equivalent of "files", "clientes" of "clients" and "valores_texto" of "value_text".

Execution plan without DISTINCT:

enter image description here

Execution plan without DISTINCT and GROUP BY (query a little faster):

Execution Plan

Query test (Krismorte answer)

This is the execution plan of the query which is slower than before. Here, the query returns me over 400.000 rows, but even paginating the results, there is not changes.

Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=By_UC2aBG

Slow Query

And this is the execution plan of the query which is faster than before. Here, the query returns over 65.000 rows.

Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=r116e6pSM

Fast Query

Best Answer

I think you need this index (as Krismorte suggested):

CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);

The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:

CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
    id,
    [year],
    [name]
);

Express the query as:

SELECT
    FileId = F.id,
    [FileName] = F.[name],
    FileYear = F.[year],
    V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
    ON C.id = F.cid
OUTER APPLY
(
        SELECT DISTINCT
            VT.[value]
        FROM dbo.value_text AS VT
        WHERE
            VT.id_file = F.id
            AND VT.id_field = 65739
) AS V
WHERE 
    C.id = 10
OPTION (RECOMPILE);

This should give an execution plan like:

expected plan

The OPTION (RECOMPILE) is optional. Only add if you find the ideal plan shape is different for different parameter values. There are other possible solutions to such "parameter-sniffing" issues.

With the new index, you may also find the original query text produces a very similar plan, also with good performance.

You may also need to update the statistics on the files table, since the estimate in the supplied plan for cid = 19 is not accurate:

Inaccurate estimate


After updating the statistics on the files table the query is working very fast in all the cases. If in the future I add more fields in "files" table, should I update the index or something?

If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make cid part of a clustering index on that table instead. It depends. Ask a new question if you want clarification on these points.