Multi-dimensional range queries with single column indexes

clustered-indexindex

I am reading a book that covers many database topics on a very high level, but doesn't dive deeper. It introduces multi-column indexes as a better way to query for 2 dimensional ranges. For example, this query is stated to be inefficient, if latitude and longitude each have single column indexes:

SELECT * FROM restaurants 
 WHERE latitude > 51.4946 
   AND latitude < 51.5079 
   AND longitude > -0.1162 
   AND longitude < -0.1004
;

A better way would be to use a multicolumn index using R-trees spanning over both latitude and longitude.

I don't understand why first extracting information of latitude > 51.4946 using a single columnd index and then extacting information longitude > -0.1162 AND longitude < -0.1004 using the other single column index is inefficient? Can you use only 1 index per table when performing a query or is there some other reason for inefficiency?

Best Answer

Answer for SQL Server

The short answer is that the one column index could work depending on your workload & data, but there are different options like included columns in sql server, to store an unordered column.

Two indexes could be used, but in most cases (for sql server) a key lookup operator will be applied to get the other column(s) needed. Accessing two indexes / adding a key lookup operator will be slower due to the added JOIN operator between the two.

An example of when it should go fast is when the optimizer thinks that almost no or no values in the specified range of latitute exist (latitude > 51.4946 AND latitude < 51.5079) while having the one column index, the seek should be pretty fast (keeping in mind that the statistics are correct & up to date). See Example 1 for this part.

Another important part of your example query is that with ranges & indexes, the secondary column will be added as a residual predicate in a simple seek even when both columns are key columns.

See Example 3 for the residual predicate explanation.

Start data

CREATE TABLE [dbo].[restaurants]
(
    [RestaurantId] INT NOT NULL PRIMARY KEY IDENTITY(1,1), 
    [Latitude] float NOT NULL, 
    [Longitude] float NOT NULL
    );


INSERT INTO restaurants([Latitude],[Longitude])
SELECT 
       ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) * RAND() /10 as lat ,
       ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) * RAND() /10 as long
FROM MASTER..spt_values

Example 1 One column index + (unused) key lookup

Single column indexes

CREATE INDEX IX_latitude on dbo.restaurants([Latitude]);
CREATE INDEX IX_longitude on dbo.restaurants([Longitude]);

Due to no values returned by the seek on latitude, no key lookup (to get the longitude column and apply a residual predicate on that column) operation happened:

Example 2 Removing the key lookup

When we include the longitude in the index, no key lookup operator is found (but there is a residual predicate on this column.

CREATE INDEX IX_latitude2 on dbo.restaurants([Latitude]) INCLUDE([Longitude]);

From these examples we can make up that, if SQL server cannot find a first selective filter on latitude or longitude one of these will be picked and the residual predicate / key lookup will be the operation that will hurt us when the data set increases.

Example 3 When adding the double key column indexes

Here the decision of selectivity becomes important. Will the first key column selected be latitude or longitude, which one will always be more selective? If that question cannot be answered you could opt to create two indexes:

CREATE INDEX IX_latitude_Longitude on dbo.restaurants([Latitude],[Longitude]);
CREATE INDEX IX_longitude_Latitude on dbo.restaurants([Longitude],[Latitude]);

But due to the fact that you are working with ranges <>, the secondary column will be added as a residual predicate:

The best indexes for the query

All this shows us, that for SQL Server & your particular query the better indexes would be

CREATE INDEX IX_latitude on dbo.restaurants([Latitude]) INCLUDE([Longitude]);
CREATE INDEX IX_longitude on dbo.restaurants([Longitude]) INCLUDE([Latitude]);

Where the included columns are not ordered since we are not getting around the residual predicate.

Unless we are also executing queries with different filters on Longitude or Latitude , such as = on both columns or = on one of the columns, right indexing could remove the residual predicate for these types of queries.

Some more information on indexing range queries can be found here.

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

You are right in that your example query would not use that index.

The query planner will consider using an index if:

all the fields contained in it are referenced in the query
some of the fields starting from the beginning are referenced

It will not be able to make use of indexes that start with a field not used by the query.

So for your example:

SELECT [id], [name], [customerId], [dateCreated]
   FROM Representatives WHERE customerId=1 
   ORDER BY dateCreated

it would consider indexes such as:

[customerId]
[customerId], [dateCreated]
[customerId], [dateCreated], [name]

but not:

[name], [customerId], [dateCreated]

If it found both [customerId] and [customerId], [dateCreated], [name] its decision to prefer one over the other would depend on the index stats which depend on estimates of the balance of data in the fields. If [customerId], [dateCreated] were defined it should prefer that over the other two unless you give a specific index hint to the contrary.

It is not uncommon to see one index defined for every field in my experience either, though this is rarely optimal as the extra management needed to update the indexes on insert/update, and the extra space needed to store them, is wasted when half of them may never get used - but unless your DB sees write-heavy loads the performance is not going to stink badly even with the excess indexes.

Specific indexes for frequent queries that would otherwise be slow due to table or index scanning is generally a good idea, though don't overdo it as you could be exchanging one performance issue for another. If you do define [customerId], [dateCreated] as an index, for example, remember that the query planner will be able to use that for queries that would use an index on just [customerId] if present. While using just [customerId] would be slightly more efficient than using the compound index this may be mitigated by ending up having two indexes competing for space in RAM instead of one (though if your entire normal working set fits easily into RAM this extra memory competition may not be an issue).

Mysql – How does compound index help speed up query

Indexes are in most cases a B-Tree structure (or some sort of). There are DBMS that supported different indexing types.

As you are talking about longitude/latitude PostgreSQL with it's GIST indexes come to mind. Oracle has Bitmap indexes in addition to the B-Tree indexes. I'm sure SQL Server and DB2 also have some special index types. And then there are full text indexes which make searching text very efficient.

A B-Tree index is very efficient in finding a specific value - think of a primary key index where all values are different. If the index only contains the PK column(s) (i.e. it is not a clustered index) then typically looking up a row by a specific PK values takes not more than (roughly) 3-4 IO operations (at least with Oracle). 2-3 to find the index block and an additional one to read the whole row. This gets more efficient if the index contains additional columns so that the lookup of the actual table row is not needed. The term for that is "covering index" or "index only retrieval".

Now for doing "range lookups" (e.g. where foo > 42) an index is very helpful as well as in most DBMS the index can also be scanned according to a predicate. Usually (again this highly depends on the DBMS) this is slightly less efficient than a direct lookup (again this also depends on the ability to do an "index only retrieval").

I don't know of any BMS which can not use more than one index in a query. Think a join on a PK and a FK column - depending on the data distribution the DBMS might use the index to find the parent rows (PK lookup) and the child rows (FK lookup).

But not all DBMS can use more than one index for the same table in a single query.

After all whether or not an index is being used or not depends on a lot of things.

I can highly recommend http://use-the-index-luke.com/ which is a very good introduction on indexing across all major DBMS.

DBMS specific information:

Oracle: http://docs.oracle.com/cd/E11882_01/server.112/e25789/indexiot.htm
PostgreSQL: http://www.postgresql.org/docs/current/static/indexes.html

Best Answer

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

Mysql – How does compound index help speed up query

Related Question