SQLite Performance – Why Is This Query Slower with Indexed Columns?

countindexoptimizationperformancesqlite

I have a sqlite database with two tables, each with 50,000 rows in, containing names of (fake) people. I've constructed a simple query to find out how many names there are (given name, middle initial, surname) that are common to both tables:

select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;

When there are no indexes except on the primary keys (irrelevant to this query), it runs quickly:

[james@marlon Downloads] $ time sqlite3 generic_data_no_indexes.sqlite "select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;"
131

real    0m0.115s
user    0m0.111s
sys     0m0.004s

But if I add indexes to the three columns on each table (six indexes in all):

CREATE INDEX `idx_uk_givenname` ON `fakenames_uk` (`givenname` )
//etc.

then it runs painfully slowly:

[james@marlon Downloads] $ time sqlite3 generic_data.sqlite "select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;"
131

real    1m43.102s
user    0m52.397s
sys     0m50.696s

Is there any rhyme or reason to this?

Here's the result of EXPLAIN QUERY PLAN for the version without indexes:

0|0|0|SCAN TABLE fakenames_uk
0|1|1|SEARCH TABLE fakenames_usa USING AUTOMATIC COVERING INDEX (middleinitial=? AND surname=? AND givenname=?)

This is with indexes:

0|0|0|SCAN TABLE fakenames_uk
0|1|1|SEARCH TABLE fakenames_usa USING INDEX idx_us_middleinitial (middleinitial=?)

Best Answer

In SQLite, joins are executed as nested loop joins, i.e., the database goes through one table, and for each row, searches matching rows from the other table.

If there is an index, the database can look up any matches in the index quickly, and then go to the corresponding table row to get the values of any other columns that are needed.

In this case, there are three possible indexes. Without any statistical information (which would be created by running ANALYZE), the database chooses the smallest one, to reduce I/O. However, the middleinitial index is useless because it does not greatly reduce the number of table rows that need to be fetched; and the additional step through the index actually increases the I/O needed because the table rows are no longer read in order, but randomly.

If there is no index, the lookup of matching rows would require a complete table scan of the second table for each row of the first table. This would be so bad that the database estimates that it is worthwhile to create and then drop a temporary index just for this query. This temporary ("AUTOMATIC") index is created on all colunms used for the search. The COUNT(*) operation does not need values from any other columns, so this index happens to be a covering index, which means that it is not necessary to actually look up the table row corresponding to an index entry, which saves even more I/O.

To speed up this query, create this index permanently, so that it is no longer necessary to construct a temporary one:

CREATE INDEX uk_all_names ON fakenames_uk(surname, givenname, middleinitial);

EXPLAIN QUERY PLAN
SELECT count(*)
FROM fakenames_uk
JOIN fakenames_usa USING (givenname, middleinitial, surname);

0|0|1|SCAN TABLE fakenames_usa
0|1|0|SEARCH TABLE fakenames_uk USING COVERING INDEX uk_all_names (surname=? AND givenname=? AND middleinitial=?)

The index on surname is no longer needed because the three-column index can be used for any lookups on this column.
The index on givenname might be useful if you will do lookups on this column only.
The index on middleinitial is always worthless: a query that searches for one of the 26 possible values is faster if it just scans the entire table.