SQL Performance – How to Improve Sorted Join Performance

azure-sql-databasejoin;performancequery-performance

This feels like such a common question, I'll understand if it is closed but if so please suggest a better place I could ask. I have the following two tables of interest:

CREATE TABLE [dbo].[Sessions]
(
    [Id] [int] PRIMARY KEY,
    [DateConnected] [datetime] NOT NULL,
    [Origin] [nvarchar](max) NULL,
    [TrackerId] [int] NULL,
    [Imei] [nvarchar](max) NULL,
    [Sim] [nvarchar](max) NULL,
    [ProtocolVersion] [tinyint] NULL
)

CREATE TABLE [dbo].[PacketTransmissions]
(
    [Id] [int] PRIMARY KEY,
    [RequestId] [int] NULL,
    [SessionId] [int] NOT NULL,
    [DateProcessed] [datetime] NOT NULL,
    [Direction] [int] NOT NULL,
    [Sequence] [int] NOT NULL,
    [Acknowledgement] [int] NOT NULL,
    [DateRecorded] [datetime] NOT NULL,
    [Version] [tinyint] NOT NULL,
    [Command] [tinyint] NOT NULL,
    [Flags] [tinyint] NOT NULL,
    [Checksum] [tinyint] NOT NULL,
    [Data] [varbinary](max) NULL
)

CREATE NONCLUSTERED INDEX [IX_TrackerId_DateConnected] ON [dbo].[Sessions]
(
    [TrackerId] ASC,
    [DateConnected] ASC
)

CREATE NONCLUSTERED INDEX [IX_SessionId_DateProcessed] ON [dbo].[PacketTransmissions]
(
    [SessionId] ASC,
    [DateProcessed] ASC
)
INCLUDE ([Direction], [Sequence], [Acknowledgement], [Command])

My most common query, and most expensive (quite often times out now) involves listing all packet transmissions for a particular tracker.

DECLARE @TrackerId INT = 10
DECLARE @StartDate DATETIME2 = '2018-03-10'
DECLARE @EndDate   DATETIME2 = '2018-03-12'

SELECT [PacketTransmissions].*
FROM [Sessions]
JOIN [PacketTransmissions] ON [PacketTransmissions].[SessionId] = [Sessions].[Id]
WHERE [Sessions].[TrackerId] = @TrackerId
AND [PacketTransmissions].[DateProcessed] > @StartDate
AND [PacketTransmissions].[DateProcessed] < @EndDate
ORDER BY [PacketTransmissions].[DateProcessed] DESC

This was good at first, but now there is a lot of data, it has slowed right down. My attempt to get the query plan today took 2 minutes, and shows that it will be using a table scan, rather than the index I created. Even when I force the index, it is still very slow.

In comparison, if I choose a session first, and search only for packet transmissions recorded within that session, the query uses the index and is incredibly fast.

My most successful attempt to speed up the query has been to order the results first by session id, then by date processed, to match the index order. While this is not technically always correct, it is acceptable. However, even this has started to time out, and I feel like there is something wrong with my understanding of how to make the JOIN faster.

What can I do to improve the performance of this query?

Querying with DATETIME variables instead of DATETIME2 has simplified the query plan, however it is still very slow.

Sessions has 265,929 rows
PacketTransmissions has 32,916,233 rows

That works out to be 123.7 packets per session, on average.
Some of the sessions are for unregistered devices, so they create a session, send between one and three packets, and then the session is rejected by the server.
I will normally be debugging a registered device, so the actual number of packets per session is considerably higher, between 300 and 5000 packets per session
Some trackers may maintain the same session for a month at a time if they have connectivity

I have in the past had a bad experience with changing the clustered index to use a non-sequential key. It results in a lot of out-of-order writes, and page splits, and the insert performance drops significantly.

The problem with the actual execution plans is that I don't want to run the database at max DTU for up to an hour, and potentially have inserts fail in the meantime.

Best Answer

Perhaps this is crazy, but I like to try a bit of blue-sky-thinking every once in a while, so I'd consider adding the TrackerId column to the dbo.PacketTransmissions table to avoid the join completely. Obviously, this means you need to modify the row-insert procedure for the table, which may or may not be feasible.

However, this change, combined with a simple index:

CREATE INDEX IX_PacketTransmissions ON dbo.PacketTransmissions
(
    TrackerId ASC
    , DateProcessed ASC
) 
INCLUDE (Id); --not strictly required, since the primary key 
              --is always included in every non-clustered index
              --I include them just to be explicit

creates a query plan using a run-of-the-mill index seek, combined with a key lookup for each row returned. As in:

To test this, I created a minimally complete verifiable example:

USE tempdb;

IF OBJECT_ID(N'dbo.Sessions', N'U') IS NOT NULL
DROP TABLE dbo.[Sessions];
IF OBJECT_ID(N'dbo.PacketTransmissions', N'U') IS NOT NULL
DROP TABLE dbo.PacketTransmissions;
GO

CREATE TABLE [dbo].[Sessions]
(
      [Id] int 
        CONSTRAINT PK_Sessions
        PRIMARY KEY CLUSTERED
    , [DateConnected] datetime NOT NULL
    , [Origin] nvarchar(max) NULL
    , [TrackerId] int NULL
    , [Imei] nvarchar(max) NULL
    , [Sim] nvarchar(max) NULL
    , [ProtocolVersion] tinyint NULL
)

CREATE TABLE [dbo].[PacketTransmissions]
(
      [Id] int 
        CONSTRAINT PK_PacketTransmissions 
        PRIMARY KEY CLUSTERED
    , [RequestId] int NULL
    , [SessionId] int NOT NULL
    , [DateProcessed] datetime NOT NULL
    , [Direction] int NOT NULL
    , [Sequence] int NOT NULL
    , [Acknowledgement] int NOT NULL
    , [DateRecorded] datetime NOT NULL
    , [Version] tinyint NOT NULL
    , [Command] tinyint NOT NULL
    , [Flags] tinyint NOT NULL
    , [Checksum] tinyint NOT NULL
    , [Data] varbinary(max) NULL
    , [TrackerId] int NULL
)
GO

INSERT INTO dbo.[Sessions] (Id, DateConnected, Origin, TrackerId, Imei, Sim, ProtocolVersion)
SELECT ROW_NUMBER() OVER (ORDER BY sc1.id)
    , DATEADD(DAY, CONVERT(int, CRYPT_GEN_RANDOM(1)), '2017-01-01 00:00:00')
    , CONVERT(nvarchar(max), CRYPT_GEN_RANDOM(128))
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(nvarchar(40), CRYPT_GEN_RANDOM(38))
    , CONVERT(nvarchar(40), CRYPT_GEN_RANDOM(38))
    , CONVERT(tinyint, CRYPT_GEN_RANDOM(1))
FROM sys.syscolumns sc1
    CROSS JOIN sys.syscolumns sc2;

INSERT INTO dbo.PacketTransmissions (Id, RequestId, SessionId, DateProcessed, Direction, Sequence, Acknowledgement, DateRecorded, Version, Command, Flags, Checksum, Data, TrackerId)
SELECT ROW_NUMBER() OVER (ORDER BY s.Id)
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(int, CRYPT_GEN_RANDOM(3))
    , DATEADD(DAY, CONVERT(int, CRYPT_GEN_RANDOM(1)), '2017-01-01 00:00:00')
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(int, CRYPT_GEN_RANDOM(2))
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , DATEADD(DAY, CONVERT(int, CRYPT_GEN_RANDOM(1)), '2017-01-01 00:00:00')
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CONVERT(int, CRYPT_GEN_RANDOM(1))
    , CRYPT_GEN_RANDOM(128)
    , s.TrackerId
FROM dbo.[Sessions] s
    CROSS JOIN (SELECT v.n
    FROM (VALUES (0), (1))v(n)) v;
GO

On my system, this creates around 700,000 session rows, and double that number of transmission rows.

The query then becomes:

DECLARE @TrackerId int = 100;
DECLARE @StartDate datetime = '2017-03-10';
DECLARE @EndDate   datetime = '2017-03-12';

SELECT [PacketTransmissions].*
FROM [PacketTransmissions] 
WHERE [PacketTransmissions].[TrackerId] = @TrackerId
    AND [PacketTransmissions].[DateProcessed] > @StartDate
    AND [PacketTransmissions].[DateProcessed] < @EndDate
ORDER BY [PacketTransmissions].[DateProcessed] DESC;

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Database Design for Handling 1 Billion Rows in SQL Server

5000 inserts per minute are about 83 inserts per second. With 5 indexes that's 400 physical rows inserted per second. If the workload was in-memory this would not pose a problem even to the smallest of servers. Even if this was a row-by-row insert using the most inefficient way I can think of. 83 trivial queries per second are just not interesting from a CPU standpoint.

Probably, you are disk-bound. You can verify this by looking at wait stats or STATISTICS IO.

Your queries probably touch a lot of different pages so that the buffer pool does not have space for all of them. This causes frequent page reads and probably random disk writes as well.

Imagine a table where you only physically insert at the end because of an ever-increasing key. The working set would be one page: the last one. This would generate sequential IO as well wen the lazy writer or checkpoint process writes the "end" of the table to disk.

Imagine a table with randomly-placed inserts (classic example: a guid key). Here, all pages are the working set because a random page will be touched for each insert. IOs are random. This is the worst case when it comes to working set.

You're in the middle. Your indexes are of the structure (SomeValue, SequentialDateTime). The first component partially randomizes the sequentiality provided by the second. I guess there are quite a few possible values for "SomeValue" so that you have many randomly-placed insert-points in your indexes.

You say that data is split into 10GB tables per week. That's a good starting point because the working set is now bounded by 10GB (disregarding any reads you might do). With 12GB of server memory it is unlikely, though, that all relevant pages can stay in memory.

If you could reduce the size of the weekly "partitions" or increase server memory by a bit you are probably fine.

I'd expect that inserts at the beginning of the week are faster then at the end. You can test this theory on a dev server by running a benchmark with a certain data size and gradually reducing server memory until you see performance tank.

Now even if all reads and writes fit into memory you might still have random dirty page flushing IO. The only way to get rid of that is to write into co-located positions in your indexes. If you can at all convert your indexes to use (more) sequential keys that would help a lot.

As a quick solution I'd add a buffering layer between the clients and the main table. Maybe accumulate 15min of writes into a staging table and periodically flush it. That takes away the load spikes and uses a more efficient plan to write to the big table.

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Database Design for Handling 1 Billion Rows in SQL Server

Related Question