Why an incorrect JOIN using correlated sub-query is so much slower

performancequery-performancesqlitesubquery

I'm doing some fairly lightweight data massaging/cleaning and ran into a problem where one version of JOIN using a correlated sub-query (probably an erroneous one) ran much much slower than what I believe is the correct one. I'm not asking how to do the query (I believe I've now got that correct), but I'd like to know why the slow version is so slow.

The Problem

The domain is a fairly simple database to manage a Lottery Syndicate (recording members payments, games played and wins). In moving to a new engine (SQLite) I'm trying to clean the data and improve the tables' structure.

The existing _Winnings table records the amounts and dates won and the "game type" (there are multiple games that could be played):

CREATE TABLE [_Winnings](
    [ID]                integer primary key not null,
    [WinDate]           date,
    [Amount]            integer,
    [GameType]          integer references _Games(ID)
);
CREATE INDEX [_WinningsIndex] on _Winnings(GameType) ;

The main problem is that there is no link (other than the date of the win) to the actual game played. Those records have already been migrated and are now held in an EventHistory table:

CREATE TABLE [EventHistory](
    [ID]                integer primary key not null,
    [EventType]         integer references Events(ID),
    [GameType]          integer references Games(ID),
    [EventDate]         date
);
CREATE INDEX [EventHistoryEventIndex] on EventHistory(EventType) ;
CREATE INDEX [EventHistoryGameIndex]  on EventHistory(GameType) ;
CREATE INDEX [EventHistoryDateIndex]  on EventHistory(EventDate) ;

The three tables _Games, Games and Events hold the "type" of game/event and have essentially the following content:

_Games                  Games              Events
ID   GameType           ID   GameType      ID   Name
--   ---------          --   ---------     --   ----------
1    GameName1          1    GameName1     5    Dispersal
2    GameName2          1    GameName2     6    Withdrawal
3    GameName3          1    GameName3     7    GamePlayed
4    GameName4          1    GameName4     8    MissingGame
5    Dispersal
6    Withdrawal

where the new tables split the "real" and "pseudo" game-types into their own table.

Sample data, showing the requirements of the migration process are:

_Winnings
ID  WinDate     Amount  GameType         (Notes)
--- ----------  ------  --------         -------------------------------
123 2016-04-20    1234  1                A. Ideal match to "game played" record
167 2017-08-20    1000  1                B. "Missing" game
189 2018-12-20     990  1                C. Matches two games
199 2019-02-01   -2000  6                D. A non-game event (withdrawal)

EventHistory
ID  EventType  GameType  EventDate       (Notes)
--- ---------  --------  ---------       -------------------------------
111 7 (Game)          1  2016-04-20      Perfect match for (A)
222 7 (Game)          1  2017-08-15      \ No entry matches (B)
223 7 (Game)          1  2017-08-25      /
333 7 (Game)          1  2018-12-20      \ Two matches for (C)
334 7 (Game)          1  2018-12-20      /

Case (A) is the "normal" case: a game has been played, and there was a win. I will want the new Winnings entry to refer directly to the matched event record.

Case (B) would indicate some error in the data (probably a mis-entered date-of-win, which I want to identify and deal with later by creating a "MissingGame" record in EventHistory.

Case (C) is valid, representing a double-entry on the same day. Matching either record in EventHistory to the new record in Winnings would be acceptable.

Case (D) is a "pseduo" game: winnings are either withdrawn or used to buy extra lines. Whether or not there is an existing, matching date-entry in EventHistory, a new event record will be created.

My first attempt at finding matches used a left join on the dates (left-join, because there isn't guaranteed to be a date-match), but didn't take cases like (C) into account: multiple matching entries in EventHistory give rise to duplicated values for _Winnings.ID, which I mustn't have.

select
    W.*,
    EH.ID as EID,
    G.ID as GID
from        _Winnings as W
left join   EventHistory as EH      on W.WinDate = EH.EventDate
left join   Games as G              on W.GameType  = G.ID

I therefore changed it to use a correlated sub-query, ensuring only one record from EventHistory would be used (it doesn't really matter which record). In my first attempt, I mistakenly left a reference to the main-select's alias (EH.EventDate):

select
    W.*,
    EH.ID as EID,
    G.ID as GID
from _Winnings as W
left join EventHistory as EH on EH.ID = (
    select min(ID) from EventHistory where W.WinDate = EH.EventDate
)
left join Games as G on W.GameType = G.ID

This seemed to work, but exceedingly slowly. Replacing the alias with the full table-name (EventHistory.EventDate):

select
    W.*,
    EH.ID as EID,
    G.ID as GID
from _Winnings as W
left join EventHistory as EH on EH.ID = (
    select min(ID) from EventHistory where W.WinDate = EventHistory.EventDate
)
left join Games as G on W.GameType = G.ID

dramatically improved the speed. With 365 records in _Winnings, and starting with 494 records in EventHistory (rising to 581 as some new records were added), the overall speed (including performing some inserts) dropped from over 3 minutes to around 3 seconds.

The "fast" query plan:

QUERY PLAN
|--SCAN TABLE _Winnings AS W
|--SEARCH TABLE EventHistory AS EH USING INTEGER PRIMARY KEY (rowid=?)
|--CORRELATED SCALAR SUBQUERY 1
|  `--SEARCH TABLE EventHistory USING COVERING INDEX EventHistoryDateIndex (EventDate=?)
`--SEARCH TABLE Games AS G USING INTEGER PRIMARY KEY (rowid=?)

The "slow" query plan

QUERY PLAN
|--SCAN TABLE _Winnings AS W
|--SCAN TABLE EventHistory AS EH USING COVERING INDEX EventHistoryDateIndex
|--CORRELATED SCALAR SUBQUERY 1
|  `--SEARCH TABLE EventHistory
`--SEARCH TABLE Games AS G USING INTEGER PRIMARY KEY (rowid=?)

Clearly, these are different, but I don't have the skill to understand what they are telling me.


What I am actually doing is processing each row as it is returned by the query and sometimes creating a new record in the EventHistory table (and always creating a row in the migrated Winnings table). Roughly, the process is:

foreach row returned by the query
    if EID or GID is empty
        -- either there isn't an exact date match (EID="") or
        -- the "game-type" is a "pseudo" game (GID=""). In either
        -- case, I want to insert a new row in EventHistory.

        insert new row in EventHistory table
    endif

    insert new row in Winnings table
 endfor

I originally thought that doing the inserts into EventHistory was affecting the speed, since when I timed just the raw query (doing nothing with the results), there was no appreciable difference in speed between the two versions.

However, in the light of CL.'s answer, which included "That you are inserting new rows into the tables has no effect on the speed" I investigated further, and it appears the version of SQLite used may be the biggest factor in the speed difference.

I am using Tcl to script my update process (including the inserts), and this was where I originally saw the dramatic difference in speed between the two versions of the query. Tcl has it's own version of SQLite, which in my case was somewhat old (3.8.7.1 from October 2014).

However, when I first timed just the queries, I used a newly-downloaded version of the standalone SQLite shell (3.27.2 from February 2019). With this version, both queries ran at essentially the same rate (sub-second).

When I repeated the "query only" test in Tcl, using the older version of SQLite, the difference in speed was, again, dramatic: 8ms vs 2 minutes according to Tcl's time function.

My conclusion is that:

The two values are constant (as far as the subquery is concerned), so either all rows of the table match, or none. But the query optimizer is not smart enough to recognize this, so it goes through all rows of the table and evaluates the WHERE clause each time.

from CL's answer does apply in the case of SQLite 3.8.7.1 but no longer applies to SQLite 3.27.2.

(The explain query plan output for each query remains the same in both versions of SQLite, but the VDBE steps shown by explain do differ between SQLite versions).

Best Answer

The difference is in how the correlated subquery does the search.

The fast subquery looks like this:

select min(ID)
from EventHistory
where EventHistory.EventDate = ?

-- SEARCH TABLE EventHistory USING COVERING INDEX EventHistoryDateIndex (EventDate=?)

There is an index on EventDate, so the database can look up the matching rows in that index, and then remember and return only the smallest ID value.

The slow subquery looks like this:

select min(ID)
from EventHistory
where ? = ?

-- SEARCH TABLE EventHistory

The two values are constant (as far as the subquery is concerned), so either all rows of the table match, or none. But the query optimizer is not smart enough to recognize this, so it goes through all rows of the table and evaluates the WHERE clause each time.

(There is the MIN/MAX optimization, but it works only when there is no WHERE clause.)


That you are inserting new rows into the tables has no effect on the speed. However, SQLite computes result rows on demand, if possible, so modifying tables while you are reading them can make the results inconsistent. You should read all results of the query first, or use a temporary table.