Slow JOIN on tables with millions of rows

azure-sql-databaseperformancequery-performance

In my application I have to join tables with millions of rows. I have a query like this:

SELECT DISTINCT "f"."id" AS "FileId"  
, "f"."name" AS "FileName"  
, "f"."year" AS "FileYear"  
, "vt"."value" AS "value"  
FROM files "f"  
JOIN "clients" "cl" ON("f"."cid" = "cl"."id" AND "cl"."id" = 10)  
LEFT JOIN "value_text" "vt" ON ("f"."id" = "vt"."id_file" AND "vt"."id_field" = 65739)  
GROUP BY "f"."id", "f"."name", "f"."year", "vt"."value"

The table "files" has 10 million rows, and the table "value_text" has 40 million rows.

This query is too slow, it takes between 40s (15000 results) – 3 minutes (65000 results) to be executed.

I had thought about divide the two queries, but I can't because sometimes I need to order by the joined column (value)…

What can I do? I use SQL Server with Azure. Specifically, Azure SQL Database with pricing/model tier "PRS1 PremiumRS (125 DTUs)".

I'm receiving a lot of data but I think the internet connection is not a bottleneck, because in other queries I receive a lot of data too and they're faster.

I've tried using the client table as a subquery and removing DISTINCT with the same results.

I have 1428 rows in client table.

Additional info

clients table:

CREATE TABLE [dbo].[clients](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [code] [nvarchar](70) NOT NULL,
    [password] [nchar](40) NOT NULL,
    [name] [nvarchar](150) NOT NULL DEFAULT (N''),
    [email] [nvarchar](255) NULL DEFAULT (NULL),
    [entity] [int] NOT NULL DEFAULT ((0)),
    [users] [int] NOT NULL DEFAULT ((0)),
    [status] [varchar](8) NOT NULL DEFAULT ('inactive'),
    [created] [datetime2](7) NULL DEFAULT (getdate()),
    [activated] [datetime2](7) NULL DEFAULT (getdate()),
    [client_type] [varchar](10) NOT NULL DEFAULT ('normal'),
    [current_size] [bigint] NOT NULL DEFAULT ((0)),
CONSTRAINT [PK_clients_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
CONSTRAINT [clients$code] UNIQUE NONCLUSTERED 
(
    [code] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

files table:

CREATE TABLE [dbo].[files](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [cid] [bigint] NOT NULL DEFAULT ((0)),
    [eid] [bigint] NOT NULL DEFAULT ((0)),
    [year] [bigint] NOT NULL DEFAULT ((0)),
    [name] [nvarchar](255) NOT NULL DEFAULT (N''),
    [extension] [int] NOT NULL DEFAULT ((0)),
    [size] [bigint] NOT NULL DEFAULT ((0)),
    [id_doc] [bigint] NOT NULL DEFAULT ((0)),
    [created] [datetime2](7) NULL DEFAULT (getdate())
CONSTRAINT [PK_files_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON),
 CONSTRAINT [files$estructure_unique] UNIQUE NONCLUSTERED 
(
    [year] ASC,
    [name] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

GO

ALTER TABLE [dbo].[files]  WITH NOCHECK ADD  CONSTRAINT [FK_files_client] FOREIGN KEY([cid])
REFERENCES [dbo].[clients] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[files] CHECK CONSTRAINT [FK_files_client]
GO

value_text table:

CREATE TABLE [dbo].[value_text](
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [id_file] [bigint] NOT NULL DEFAULT ((0)),
    [id_field] [bigint] NOT NULL DEFAULT ((0)),
    [value] [nvarchar](255) NULL DEFAULT (NULL),
    [id_doc] [bigint] NULL DEFAULT (NULL)
 CONSTRAINT [PK_value_text_id] PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

GO

ALTER TABLE [dbo].[value_text]  WITH NOCHECK ADD  CONSTRAINT [FK_valuesT_field] FOREIGN KEY([id_field])
REFERENCES [dbo].[fields] ([id])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[value_text] CHECK CONSTRAINT [FK_valuesT_field]
GO

Execution plan:

_{*I translated the tables and the fields in this question for general understanding. In this image, "archivos" is the equivalent of "files", "clientes" of "clients" and "valores_texto" of "value_text".}

Execution plan without DISTINCT:

Execution plan without DISTINCT and GROUP BY (query a little faster):

Query test (Krismorte answer)

This is the execution plan of the query which is slower than before. Here, the query returns me over 400.000 rows, but even paginating the results, there is not changes.

Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=By_UC2aBG

And this is the execution plan of the query which is faster than before. Here, the query returns over 65.000 rows.

Execution plan more detailed: https://www.brentozar.com/pastetheplan/?id=r116e6pSM

Best Answer

I think you need this index (as Krismorte suggested):

CREATE NONCLUSTERED INDEX [IX dbo.value_text id_file, id_field, value]
ON dbo.value_text (id_file, id_field, [value]);

The following index is probably not required as you appear to have a suitable existing index (not mentioned in the question) but I include it for completeness:

CREATE NONCLUSTERED INDEX [IX dbo.files cid (id, year, name)]
ON dbo.files (cid)
INCLUDE
(
    id,
    [year],
    [name]
);

Express the query as:

SELECT
    FileId = F.id,
    [FileName] = F.[name],
    FileYear = F.[year],
    V.[value]
FROM dbo.files AS F
JOIN dbo.clients AS C
    ON C.id = F.cid
OUTER APPLY
(
        SELECT DISTINCT
            VT.[value]
        FROM dbo.value_text AS VT
        WHERE
            VT.id_file = F.id
            AND VT.id_field = 65739
) AS V
WHERE 
    C.id = 10
OPTION (RECOMPILE);

This should give an execution plan like:

The OPTION (RECOMPILE) is optional. Only add if you find the ideal plan shape is different for different parameter values. There are other possible solutions to such "parameter-sniffing" issues.

With the new index, you may also find the original query text produces a very similar plan, also with good performance.

You may also need to update the statistics on the files table, since the estimate in the supplied plan for cid = 19 is not accurate:

After updating the statistics on the files table the query is working very fast in all the cases. If in the future I add more fields in "files" table, should I update the index or something?

If you add more columns to the file table (and use/return them in your query) you will need to add them to the index (minimally as included columns) to keep the index "covering". Otherwise, the optimizer may choose to scan the files table rather than looking up the columns not present in the index. You might also choose to make cid part of a clustering index on that table instead. It depends. Ask a new question if you want clarification on these points.

Related Solutions

Sql-server – the correct model for related tables with millions of rows

You're right. Your tables would look something like this then ( oracle ddl ) :

create table Users (id int,
                    name varchar2(1000),
                    constraint pk_users primary key (id)
);

create table Questions (id int,
                        user_id int,
                        content CLOB,
                        question_date date,
                        constraint pk_questions primary key (id),
                        constraint fk_questions_user_id foreign key (user_id) references users(id)
                        );

create table Answers (id int,
                      question_id int,
                      user_id int,
                      content CLOB,
                      answer_date date,
                      constraint pk_answers primary key (id),
                      constraint fk_answers_questions foreign key (question_id) references Questions(id), 
                      constraint fk_answers_user_id foreign key (user_id) references users(id)
);

Mysql – Delete millions of rows, fast

As mustaccio states, partitioning the data may help, though that might not be practical as a quick solution and you'd still have to optimise such statements a bit.

mysql is said to be particularly inefficient with IN clauses, in this case it may be running that inner query once for every row in mytable which is not going to be efficient. Better but still far from optimal, it may be running the inner query and spooling the results into a temporary table on disk then joining on that.

To avoid IN, you can rearrange actions of the pattern:

DELETE FROM mytable1 WHERE value IN (SELECT key FROM mytable2 WHERE <filtering_condition>)

into

DELETE t1 
FROM   mytable1 t1
INNER JOIN
       mytable2 t2 
ON     t2.key = ti.id 
WHERE  <filtering_condition>

(in your case both mytable1 and mytable2 are the same table, and that works just as well)

I'm not sure how this will react to the counting-in-a-variable syntax you have there though (I'm not a mysql person specifically and that is not something seen in other DBs I do work with regularly).

If you add an integer identity column (I assume your PK here is id, time hence the current integer ID is not unique) then simply checking it against modulo 5 may be an acceptable approximation of "delete 80% evenly" like so:

DELETE t1 
FROM   mytable t1
WHERE  t1.time < "2013-08-08 00:00:00"
AND    t1.counter MOD 5 != 0

Adding that column initially will be a time consuming process but maintinaing it afterwards should not be a problem (the DB will generate a number for you on each insert, just make sure you don't include it in the VALUES list of any INSERT operation), but you don't need to use IN or JOIN at all. An index over time, counter (instead of just time) may help performance a bit more. I would be tempted to make counter the primary key and id, time a seperate index (as well as an index over time or time, counter), but that would depend a lot on your other operations on the data.

Of course once you are altering the table structure like this, do give consideration the the partitioning option too. It will be more complicated but may have significant beneficial performance impact elsewhere too.