SQLite Performance – Why Is This Query Slower with Indexed Columns?

countindexoptimizationperformancesqlite

I have a sqlite database with two tables, each with 50,000 rows in, containing names of (fake) people. I've constructed a simple query to find out how many names there are (given name, middle initial, surname) that are common to both tables:

select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;

When there are no indexes except on the primary keys (irrelevant to this query), it runs quickly:

[james@marlon Downloads] $ time sqlite3 generic_data_no_indexes.sqlite "select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;"
131

real    0m0.115s
user    0m0.111s
sys     0m0.004s

But if I add indexes to the three columns on each table (six indexes in all):

CREATE INDEX `idx_uk_givenname` ON `fakenames_uk` (`givenname` )
//etc.

then it runs painfully slowly:

[james@marlon Downloads] $ time sqlite3 generic_data.sqlite "select count(*) from fakenames_uk inner join fakenames_usa on fakenames_uk.givenname=fakenames_usa.givenname and fakenames_uk.surname=fakenames_usa.surname and fakenames_uk.middleinitial=fakenames_usa.middleinitial;"
131

real    1m43.102s
user    0m52.397s
sys     0m50.696s

Is there any rhyme or reason to this?

Here's the result of EXPLAIN QUERY PLAN for the version without indexes:

0|0|0|SCAN TABLE fakenames_uk
0|1|1|SEARCH TABLE fakenames_usa USING AUTOMATIC COVERING INDEX (middleinitial=? AND surname=? AND givenname=?)

This is with indexes:

0|0|0|SCAN TABLE fakenames_uk
0|1|1|SEARCH TABLE fakenames_usa USING INDEX idx_us_middleinitial (middleinitial=?)

Best Answer

In SQLite, joins are executed as nested loop joins, i.e., the database goes through one table, and for each row, searches matching rows from the other table.

If there is an index, the database can look up any matches in the index quickly, and then go to the corresponding table row to get the values of any other columns that are needed.

In this case, there are three possible indexes. Without any statistical information (which would be created by running ANALYZE), the database chooses the smallest one, to reduce I/O. However, the middleinitial index is useless because it does not greatly reduce the number of table rows that need to be fetched; and the additional step through the index actually increases the I/O needed because the table rows are no longer read in order, but randomly.

If there is no index, the lookup of matching rows would require a complete table scan of the second table for each row of the first table. This would be so bad that the database estimates that it is worthwhile to create and then drop a temporary index just for this query. This temporary ("AUTOMATIC") index is created on all colunms used for the search. The COUNT(*) operation does not need values from any other columns, so this index happens to be a covering index, which means that it is not necessary to actually look up the table row corresponding to an index entry, which saves even more I/O.

To speed up this query, create this index permanently, so that it is no longer necessary to construct a temporary one:

CREATE INDEX uk_all_names ON fakenames_uk(surname, givenname, middleinitial);

EXPLAIN QUERY PLAN
SELECT count(*)
FROM fakenames_uk
JOIN fakenames_usa USING (givenname, middleinitial, surname);

0|0|1|SCAN TABLE fakenames_usa
0|1|0|SEARCH TABLE fakenames_uk USING COVERING INDEX uk_all_names (surname=? AND givenname=? AND middleinitial=?)

The index on surname is no longer needed because the three-column index can be used for any lookups on this column.
The index on givenname might be useful if you will do lookups on this column only.
The index on middleinitial is always worthless: a query that searches for one of the 26 possible values is faster if it just scans the entire table.

Related Solutions

Sql-server – SQL Server Index Scan Actual Executions

Schema and indexes are only one aspect of query plan and performance. Your statement "but with different data" is likely the source of the difference. The number of rows and the distribution of data is essential to the query optimizer. If you have significantly more rows in D2, or if the data is of entirely different characteristics (wider or narrower range of values), then you should expect to see different performance and execution plans.

For each set of statistics, SQL Server keeps a maximum of 200 samples. As the rows in the tables grow and the more irregular the distribution of values the more likely it is that SQL Server will not have enough information to generate optimal execution plans. That's where the use of filtered indexes and statistics comes into play.

If this is a parameterized query you may also be running into a parameter sniffing problem. Note that if you're using local variables the calculation changes also.

SQL Server Query Performance – Optimizing Group By with Many Columns

The non-clustered index you have tested is not the best for this query. It can be used for the WHERE clause and for doing an index scan instead of a full table scan but it cannot be used for the GROUP BY.

The best possible index would have to be a partial index (to filter the unwanted rows from the WHERE clause), then have all the columns used in the GROUP BY and then INCLUDE all the other columns used in the SELECT:

CREATE INDEX special_ix 
  ON dbo.Commissions_Output
    ( company, location, account, 
      salesroute, employee, producttype, 
      item, loadjdate, commissionrate ) 
INCLUDE 
  ( [Extended Sales Price], [Delivered Qty] ) 
WHERE 
  ( [Extended Sales Price] <> 0 ) ;

Best Answer

Related Solutions

Sql-server – SQL Server Index Scan Actual Executions

SQL Server Query Performance – Optimizing Group By with Many Columns

Related Question