Sqlite – very slow performance of WHERE EXISTS without index on EXISTS

I've got a strange case of (very) slow performance of an update query when I don't have an index on the table in the EXISTS clause.

For example:

UPDATE tbl1
   SET col1 = "Y"
 WHERE EXISTS (SELECT NULL FROM tbl2 WHERE tbl1.col2 = tbl2.col2)
   AND tbl1.col3 IN (1,2);

tbl1 has ~4.5M rows. When filtered above by col3, it comes down to ~400k rows.
tbl2 has ~100k rows. All rows are unique on col2, and every row in tbl2 has a match in tbl1.
tbl1 has an index on col2 and an index on col3.
I am using sqlite version 3.29.0 (from Python 3.7.2)

When I run this query without an index on tbl2.col2, the query never completes (I let it run for multiple minutes). If I add a unique index on tbl2(col2), the query finishes almost immediately.

This is confusing to me because I would think that the query should be constrained by the lookup into tbl1, since I'm using an EXISTS query for tbl2 (rather than IN()). The unique index on tbl2.col2 isn't adding any new information that a full table scan wouldn't give it, since every row in tbl2.col2 is unique.

Why is the performance so slow without that unique index?

Best Answer

The unique index on tbl2.col2 isn't adding any new information that a full table scan wouldn't give it

That is true; the question is, how many table scans do you think are there?

SQLite documentation will tell you that its optimizer is capable of only nested-loop joins, so regardless of whether it is smart enough to rewrite the EXISTS subquery into a join¹ or will naively evaluate the predicate for each row of the driving table, you will still end up doing 4.5 million table scans, each retrieving on average 50K rows.

In the presence of a unique index on tbl2.col2 this turns into 4.5 million index seeks using the B-tree, each requiring only 2-3 logical I/O operations and retrieving exactly one row, which is 11 orders of magnitude less I/O.

1 - Which would not make any difference anyway, since it would then rewrite that join into a combination of simple predicates against the two tables.

Best Answer

Related Solutions

Sql-server – Does having an index on a VARCHAR column with a lot of very similar starting values have bad performance

Mysql – Understanding multiple indexes with MySQL (and performance)

Related Question