Multi-dimensional range queries with single column indexes

clustered-indexindex

I am reading a book that covers many database topics on a very high level, but doesn't dive deeper. It introduces multi-column indexes as a better way to query for 2 dimensional ranges. For example, this query is stated to be inefficient, if latitude and longitude each have single column indexes:

SELECT * FROM restaurants 
 WHERE latitude > 51.4946 
   AND latitude < 51.5079 
   AND longitude > -0.1162 
   AND longitude < -0.1004
;

A better way would be to use a multicolumn index using R-trees spanning over both latitude and longitude.

I don't understand why first extracting information of latitude > 51.4946 using a single columnd index and then extacting information longitude > -0.1162 AND longitude < -0.1004 using the other single column index is inefficient? Can you use only 1 index per table when performing a query or is there some other reason for inefficiency?

Best Answer

Answer for SQL Server

The short answer is that the one column index could work depending on your workload & data, but there are different options like included columns in sql server, to store an unordered column.

Two indexes could be used, but in most cases (for sql server) a key lookup operator will be applied to get the other column(s) needed. Accessing two indexes / adding a key lookup operator will be slower due to the added JOIN operator between the two.

An example of when it should go fast is when the optimizer thinks that almost no or no values in the specified range of latitute exist (latitude > 51.4946 AND latitude < 51.5079) while having the one column index, the seek should be pretty fast (keeping in mind that the statistics are correct & up to date). See Example 1 for this part.

Another important part of your example query is that with ranges & indexes, the secondary column will be added as a residual predicate in a simple seek even when both columns are key columns.

See Example 3 for the residual predicate explanation.

Start data

CREATE TABLE [dbo].[restaurants]
(
    [RestaurantId] INT NOT NULL PRIMARY KEY IDENTITY(1,1), 
    [Latitude] float NOT NULL, 
    [Longitude] float NOT NULL
    );


INSERT INTO restaurants([Latitude],[Longitude])
SELECT 
       ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) * RAND() /10 as lat ,
       ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) * RAND() /10 as long
FROM MASTER..spt_values

Example 1 One column index + (unused) key lookup

Single column indexes

CREATE INDEX IX_latitude on dbo.restaurants([Latitude]);
CREATE INDEX IX_longitude on dbo.restaurants([Longitude]);

Due to no values returned by the seek on latitude, no key lookup (to get the longitude column and apply a residual predicate on that column) operation happened:

enter image description here


Example 2 Removing the key lookup

When we include the longitude in the index, no key lookup operator is found (but there is a residual predicate on this column.

CREATE INDEX IX_latitude2 on dbo.restaurants([Latitude]) INCLUDE([Longitude]);

enter image description here

enter image description here

From these examples we can make up that, if SQL server cannot find a first selective filter on latitude or longitude one of these will be picked and the residual predicate / key lookup will be the operation that will hurt us when the data set increases.


Example 3 When adding the double key column indexes

Here the decision of selectivity becomes important. Will the first key column selected be latitude or longitude, which one will always be more selective? If that question cannot be answered you could opt to create two indexes:

CREATE INDEX IX_latitude_Longitude on dbo.restaurants([Latitude],[Longitude]);
CREATE INDEX IX_longitude_Latitude on dbo.restaurants([Longitude],[Latitude]);

But due to the fact that you are working with ranges <>, the secondary column will be added as a residual predicate:

enter image description here

The best indexes for the query

All this shows us, that for SQL Server & your particular query the better indexes would be

CREATE INDEX IX_latitude on dbo.restaurants([Latitude]) INCLUDE([Longitude]);
CREATE INDEX IX_longitude on dbo.restaurants([Longitude]) INCLUDE([Latitude]);

Where the included columns are not ordered since we are not getting around the residual predicate.

Unless we are also executing queries with different filters on Longitude or Latitude , such as = on both columns or = on one of the columns, right indexing could remove the residual predicate for these types of queries.

Some more information on indexing range queries can be found here.