In my application I have to join tables with millions of rows. I have a query like this:
SELECT DISTINCT "f"."id" AS "FileId"
, "f"."name" AS "FileName"
, "f"."year" AS "FileYear"
, "vt"."value" AS "value"
FROM files "f"
JOIN "clients" "cl" ON("f"."cid" = "cl"."id" AND "cl"."id" = 10)
LEFT JOIN "value_text" "vt" ON ("f"."id" = "vt"."id_file" AND "vt"."id_field" = 65739)
GROUP BY "f"."id", "f"."name", "f"."year", "vt"."value"
The table "files" has 10 million rows, and the table "value_text" has 40 million rows.
This query is too slow, it takes between 40s (15000 results) – 3 minutes (65000 results) to be executed.
I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)…
What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".
I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.
I've tried using the client table as a subquery and removing DISTINCT
with the same results.
I have 1428 rows in client table.
Additional info
clients
table:
CREATE TABLE [dbo].[clients](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[code] [nvarchar](70) NOT NULL,
[password] [nchar](40) NOT NULL,
[name] [nvarchar](150) NOT NULL DEFAULT (N''),
[email] [nvarchar](255) NULL DEFAULT (NULL),
[entity] [int] NOT NULL DEFAULT ((0)),
[users] [int] NOT NULL DEFAULT ((0)),
[status] [varchar](8) NOT NULL DEFAULT ('inactive'),
[created] [datetime2](7) NULL DEFAULT (getdate()),
[activated] [datetime2](7) NULL DEFAULT (getdate()),
[client_type] [varchar](10) NOT NULL DEFAULT ('normal'),
[current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
CONSTRAINT [clients$code] UNIQUE NONCLUSTERED
(
[code] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
files
table:
CREATE TABLE [dbo].[files](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[cid] [bigint] NOT NULL DEFAULT ((0)),
[eid] [bigint] NOT NULL DEFAULT ((0)),
[year] [bigint] NOT NULL DEFAULT ((0)),
[name] [nvarchar](255) NOT NULL DEFAULT (N''),
[extension] [int] NOT NULL DEFAULT ((0)),
[size] [bigint] NOT NULL DEFAULT ((0)),
[id_doc] [bigint] NOT NULL DEFAULT ((0)),
[created] [datetime2](7) NULL DEFAULT (getdate())
CONSTRAINT [PK_files_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
CONSTRAINT [files$estructure_unique] UNIQUE NONCLUSTERED
(
[year] ASC,
[name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
GO
ALTER TABLE [dbo].[files] WITH NOCHECK ADD CONSTRAINT [FK_files_client] FOREIGN KEY([cid])
REFERENCES [dbo].[clients] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[files] CHECK CONSTRAINT [FK_files_client]
GO
value_text
table:
CREATE TABLE [dbo].[value_text](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[id_file] [bigint] NOT NULL DEFAULT ((0)),
[id_field] [bigint] NOT NULL DEFAULT ((0)),
[value] [nvarchar](255) NULL DEFAULT (NULL),
[id_doc] [bigint] NULL DEFAULT (NULL)
CONSTRAINT [PK_value_text_id] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
GO
ALTER TABLE [dbo].[value_text] WITH NOCHECK ADD CONSTRAINT [FK_valuesT_field] FOREIGN KEY([id_field])
REFERENCES [dbo].[fields] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO
ALTER TABLE [dbo].[value_text] CHECK CONSTRAINT [FK_valuesT_field]
GO
Execution plan:
*I translated the tables and the fields in this question for general understanding. In this image, "archivos" is the equivalent of "files", "clientes" of "clients" and "valores_texto" of "value_text".
Execution plan without DISTINCT
:
Execution plan without DISTINCT
and GROUP BY
(query a little faster):
Query test (Krismorte answer)
This is the execution plan of the query which is slower than before. Here, the query returns me over 400.000 rows, but even paginating the results, there is not changes.
Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=By_UC2aBG
And this is the execution plan of the query which is faster than before. Here, the query returns over 65.000 rows.
Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=r116e6pSM
Best Answer
I think you need this index (as Krismorte suggested):
The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:
Express the query as:
This should give an execution plan like:
The
OPTION (RECOMPILE)
is optional. Only add if you find the ideal plan shape is different for different parameter values. There are other possible solutions to such "parameter-sniffing" issues.With the new index, you may also find the original query text produces a very similar plan, also with good performance.
You may also need to update the statistics on the
files
table, since the estimate in the supplied plan forcid = 19
is not accurate:If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make
cid
part of a clustering index on that table instead. It depends. Ask a new question if you want clarification on these points.