Mysql – Does MySQL have a limitation of one index per query

indexMySQL

MySQL supports spatial index. However, all the spatial index can do is find points in a box. It can't do fancy stuff like find the nearest 1000 points, etc.

I think, and I don't know where I went wrong, that simply finding points in a box is easy – just do constraints:

x < xmax and x > xmin and y < ymax and y > ymin

Yes those are 4 constraints that use at least 2 columns.
The result is that things get too slow when there are 1.6 million rows.

But why? The x and y are sorted. The complexity should be log(n). Yet queries can take 60 seconds.

Using a spatial index shortens it to 1 second with some subquery trick.

What am I missing? Why does using a bunch of constraints in a query take time?

Even without a fancy algorithm it should be easy to quickly get the points in the box. Yet, I feel that the complexity must be O(n) at least.

Best Answer

There are multiple questions here.

To answer the title: MySQL can utilize more than one index per query. This is called index-merge.

However, it does not always happen that index-merge optimization is used.

"Normal" indexes are B/B+ Trees. Such trees provide with linear sorting. You can index x, and the index sorts x in ascending order. You can do the same for y. You can even index both (x, y) but you still get linear sorting -- not two dimensional one.

To get back to your other question, regarding the query

x < xmax and x > xmin and y < ymax and y > ymin

Let's assume then that we have an index on x and an index on y. The problem begins with the fact that these indexes do not share resources, nor information, nor at any way point to one another.

It's very easy to figure out all rows where x < xmax and x > xmin, using the index on x. But once you have these, you can't just take those rows to the y index and ask of it to "return rows from withing these rows I'm giving you and for which y < ymax and y > ymin. The reason is that the index onycan quickly lookup usingy` values. It does not know how to quickly lookup row #3984987.

The way index-merge works is by getting row ids from the x index, getting row ids from the y index (meanwhile the x rows are stored in a temporary buffer), and then intersect these. Note that intersection has nothing to do with indexes. It is a brute force operation (hopefully, achieved in linear time thanks to the fact that the two lists of rows are already sorted).

Now given such a query as yours, MySQL asks itself: what would be cheaper? Use just one index on x, then row by row filter those rows where y < ymax and y > ymin applies? Or perhaps use an index-merge, create a temporary buffer, work out a brute-force intersection?

The fact is that MySQL rarely chooses the index-merge optimization. I've been looking at many queries and their execution plans. Very rare. Perhaps there's someone there for whom this works on regular basis.

But looking at the X:Y coordinates problem, it's easy to see why this wouldn't work: there's just too many matching values for the x-range alone, and just too many values for the y-range alone for this to be a good plan.

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

You are right in that your example query would not use that index.

The query planner will consider using an index if:

all the fields contained in it are referenced in the query
some of the fields starting from the beginning are referenced

It will not be able to make use of indexes that start with a field not used by the query.

So for your example:

SELECT [id], [name], [customerId], [dateCreated]
   FROM Representatives WHERE customerId=1 
   ORDER BY dateCreated

it would consider indexes such as:

[customerId]
[customerId], [dateCreated]
[customerId], [dateCreated], [name]

but not:

[name], [customerId], [dateCreated]

If it found both [customerId] and [customerId], [dateCreated], [name] its decision to prefer one over the other would depend on the index stats which depend on estimates of the balance of data in the fields. If [customerId], [dateCreated] were defined it should prefer that over the other two unless you give a specific index hint to the contrary.

It is not uncommon to see one index defined for every field in my experience either, though this is rarely optimal as the extra management needed to update the indexes on insert/update, and the extra space needed to store them, is wasted when half of them may never get used - but unless your DB sees write-heavy loads the performance is not going to stink badly even with the excess indexes.

Specific indexes for frequent queries that would otherwise be slow due to table or index scanning is generally a good idea, though don't overdo it as you could be exchanging one performance issue for another. If you do define [customerId], [dateCreated] as an index, for example, remember that the query planner will be able to use that for queries that would use an index on just [customerId] if present. While using just [customerId] would be slightly more efficient than using the compound index this may be mitigated by ending up having two indexes competing for space in RAM instead of one (though if your entire normal working set fits easily into RAM this extra memory competition may not be an issue).

MySQL Spatial – How R-Tree Outperforms B-Tree for Point-in-Rectangle Checks

R-tree structure works in a way that two nearby points are "closer" in the R-tree index - because both coordinates and both with same weight are used to decide where (in the index) a new point is to be placed.

So, it's easy to identify points that are "near" a fixed point - meaning points that have both coordinates near the fixed point coordinates.

B-tree indexes on the other way work differently. Even if you have a compound index, say (Latitude, Longitude), where both coordinates are included the index, they do not have the same weight. In other words, two points with different Latitude and the same exact Longitude will be very far apart (in the index structure). Much more apart than two points with same Latitude and different Longitude.

So, searching for nearby points (with the above (Lat, Long) index) could possibly find fast nearby points with the exact same Latitude. But for all the other points, it will not be very efficient.

As an example, consider you are looking for places nearby a village in Africa - a village that has the same Latitude with London. Your query would spend lots of time calculating the distances between your village and all the (millions) points in UK that have the same (or almost the same) latitude with it. And then reject those points because they are too far apart.

You might think then that two B-tree indices, one on (Latitude) and one on (Longitude) would be fine for such queries. After all, why shouldn't both indices be used in a query that has a:

WHERE Latitude  BETWEEN @LatStart  AND @LatEnd 
  AND Longitude BETWEEN @LongStart AND @LongEnd

I think that DBMS cannot use two (B-tree) indices like that in a query (MySQL cannot AFAIK). Even if it could though, it wouldn't be very good, preformance-wise, as both conditions may not narrow the search much. So, after these two index scans, a million rows found that match one condition would have to be merged with a million rows from the other one. And this merge will probably require sorting as well - which will kill performance if the subresults are big.

In general, any query that has more than one range condition on the same table is difficult to be optimized by B-tree indices. (for other DBMS that have other types of indices, like Hash or Bitmap, I can't speak about).

That's why R-trees and other indices that perform well in spatial searches were invented in the first place. They kind of "interleave" both coordinates (or all 3 or 4 in higher dimensional indices) in the index structure and no coordinate has higher weight than the others.

Best Answer

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

MySQL Spatial – How R-Tree Outperforms B-Tree for Point-in-Rectangle Checks

Related Question