MySQL Spatial – How R-Tree Outperforms B-Tree for Point-in-Rectangle Checks

MySQLspatial

For things like find nearest neighbors, I can undertand that R=Tree can outperform B-Tree. R-Tree can kick out more obviously false points.

However, for simple check of a point in a rectangle, it's effectively

min

If x and y is indexed by b tree, they too can kick out a lot of false points.

So I did an experiment with spatial index

SELECT DISTINCT
  TB.ID,
  TB.Latitude,
  TB.Longitude,
  111151.29341326*SQRT(pow(-6.185-TB.Latitude,2)+pow(106.773-TB.Longitude,2)*0.98839228980165) AS Distance
FROM
  `tablebusiness` AS TB
   join tableauxiliary as TA on TA.BusinessID=TB.ID
WHERE
MBRContains(
   GeomFromText ('MULTIPOINT(-6.2317830813328 106.72621691867,-6.1382169186672  106.81978308133)'),
   TA.Latlong
   )
   AND 
MATCH (FullTextSearch) AGAINST ('kucing*' IN BOOLEAN MODE)
ORDER BY
  Distance
LIMIT
  0, 20

This is pretty fast. .24 seconds

And then I do

SELECT DISTINCT
  TB.ID,
  TB.Latitude,
  TB.Longitude,
  111151.29341326*SQRT(pow(-6.185-TB.Latitude,2)+pow(106.773-TB.Longitude,2)*0.98839228980165) AS Distance
FROM
  `tablebusiness` AS TB
   join tableauxiliary as TA
WHERE
  -6.2317830813328 < TB.Latitude
AND
  TB.Latitude < -6.1382169186672
AND
  106.72621691867 < TB.Longitude
AND
  TB.Longitude <106.81978308133
AND 
   MATCH (TA.FullTextSearch) AGAINST ('kucing*' IN BOOLEAN MODE)
ORDER BY
  Distance
LIMIT
  0, 20

This is very slow.

I wonder why. It doesn't make sense.

Best Answer

R-tree structure works in a way that two nearby points are "closer" in the R-tree index - because both coordinates and both with same weight are used to decide where (in the index) a new point is to be placed.

So, it's easy to identify points that are "near" a fixed point - meaning points that have both coordinates near the fixed point coordinates.


B-tree indexes on the other way work differently. Even if you have a compound index, say (Latitude, Longitude), where both coordinates are included the index, they do not have the same weight. In other words, two points with different Latitude and the same exact Longitude will be very far apart (in the index structure). Much more apart than two points with same Latitude and different Longitude.

So, searching for nearby points (with the above (Lat, Long) index) could possibly find fast nearby points with the exact same Latitude. But for all the other points, it will not be very efficient.

As an example, consider you are looking for places nearby a village in Africa - a village that has the same Latitude with London. Your query would spend lots of time calculating the distances between your village and all the (millions) points in UK that have the same (or almost the same) latitude with it. And then reject those points because they are too far apart.


You might think then that two B-tree indices, one on (Latitude) and one on (Longitude) would be fine for such queries. After all, why shouldn't both indices be used in a query that has a:

WHERE Latitude  BETWEEN @LatStart  AND @LatEnd 
  AND Longitude BETWEEN @LongStart AND @LongEnd

I think that DBMS cannot use two (B-tree) indices like that in a query (MySQL cannot AFAIK). Even if it could though, it wouldn't be very good, preformance-wise, as both conditions may not narrow the search much. So, after these two index scans, a million rows found that match one condition would have to be merged with a million rows from the other one. And this merge will probably require sorting as well - which will kill performance if the subresults are big.

In general, any query that has more than one range condition on the same table is difficult to be optimized by B-tree indices. (for other DBMS that have other types of indices, like Hash or Bitmap, I can't speak about).

That's why R-trees and other indices that perform well in spatial searches were invented in the first place. They kind of "interleave" both coordinates (or all 3 or 4 in higher dimensional indices) in the index structure and no coordinate has higher weight than the others.