MySQL Spatial – How R-Tree Outperforms B-Tree for Point-in-Rectangle Checks

MySQLspatial

For things like find nearest neighbors, I can undertand that R=Tree can outperform B-Tree. R-Tree can kick out more obviously false points.

However, for simple check of a point in a rectangle, it's effectively

min

If x and y is indexed by b tree, they too can kick out a lot of false points.

So I did an experiment with spatial index

SELECT DISTINCT
  TB.ID,
  TB.Latitude,
  TB.Longitude,
  111151.29341326*SQRT(pow(-6.185-TB.Latitude,2)+pow(106.773-TB.Longitude,2)*0.98839228980165) AS Distance
FROM
  `tablebusiness` AS TB
   join tableauxiliary as TA on TA.BusinessID=TB.ID
WHERE
MBRContains(
   GeomFromText ('MULTIPOINT(-6.2317830813328 106.72621691867,-6.1382169186672  106.81978308133)'),
   TA.Latlong
   )
   AND 
MATCH (FullTextSearch) AGAINST ('kucing*' IN BOOLEAN MODE)
ORDER BY
  Distance
LIMIT
  0, 20

This is pretty fast. .24 seconds

And then I do

SELECT DISTINCT
  TB.ID,
  TB.Latitude,
  TB.Longitude,
  111151.29341326*SQRT(pow(-6.185-TB.Latitude,2)+pow(106.773-TB.Longitude,2)*0.98839228980165) AS Distance
FROM
  `tablebusiness` AS TB
   join tableauxiliary as TA
WHERE
  -6.2317830813328 < TB.Latitude
AND
  TB.Latitude < -6.1382169186672
AND
  106.72621691867 < TB.Longitude
AND
  TB.Longitude <106.81978308133
AND 
   MATCH (TA.FullTextSearch) AGAINST ('kucing*' IN BOOLEAN MODE)
ORDER BY
  Distance
LIMIT
  0, 20

This is very slow.

I wonder why. It doesn't make sense.

Best Answer

R-tree structure works in a way that two nearby points are "closer" in the R-tree index - because both coordinates and both with same weight are used to decide where (in the index) a new point is to be placed.

So, it's easy to identify points that are "near" a fixed point - meaning points that have both coordinates near the fixed point coordinates.

B-tree indexes on the other way work differently. Even if you have a compound index, say (Latitude, Longitude), where both coordinates are included the index, they do not have the same weight. In other words, two points with different Latitude and the same exact Longitude will be very far apart (in the index structure). Much more apart than two points with same Latitude and different Longitude.

So, searching for nearby points (with the above (Lat, Long) index) could possibly find fast nearby points with the exact same Latitude. But for all the other points, it will not be very efficient.

As an example, consider you are looking for places nearby a village in Africa - a village that has the same Latitude with London. Your query would spend lots of time calculating the distances between your village and all the (millions) points in UK that have the same (or almost the same) latitude with it. And then reject those points because they are too far apart.

You might think then that two B-tree indices, one on (Latitude) and one on (Longitude) would be fine for such queries. After all, why shouldn't both indices be used in a query that has a:

WHERE Latitude  BETWEEN @LatStart  AND @LatEnd 
  AND Longitude BETWEEN @LongStart AND @LongEnd

I think that DBMS cannot use two (B-tree) indices like that in a query (MySQL cannot AFAIK). Even if it could though, it wouldn't be very good, preformance-wise, as both conditions may not narrow the search much. So, after these two index scans, a million rows found that match one condition would have to be merged with a million rows from the other one. And this merge will probably require sorting as well - which will kill performance if the subresults are big.

In general, any query that has more than one range condition on the same table is difficult to be optimized by B-tree indices. (for other DBMS that have other types of indices, like Hash or Bitmap, I can't speak about).

That's why R-trees and other indices that perform well in spatial searches were invented in the first place. They kind of "interleave" both coordinates (or all 3 or 4 in higher dimensional indices) in the index structure and no coordinate has higher weight than the others.

Related Solutions

MySQL 5.5 – Less Restrictive Query Returns Fewer Results Due to Removing One Constraint

If the second (indeed less restrictive) query returns zero rows while the first returns more than zero rows, then this is a bug.

First check if you can repoduce the error with only table or not. If the error stays while you remove the DISTINCT and/or the ORDER BY ... LIMIT.

Then try to write the set of statements (CREATE tables, INSERT rows, and the 2 queries, as simplified as possible) that reproduces the error, so others can test.

Then you can submit it to MySQL.

Connecting to MySQL on an EC2 Instance

Since Zapier is coming in from outside your instance, a possible cause of your problem is the security group for your instance. Most ports on EC2 instances (except the SSH port) are normally blocked by default unless you explicitly open them.

So you would need to open the mysql port 3306 on your instance up to the ip address Zapier is coming in from. (preferably you would not open it up to everyone in the world.)

See Using Amazon RDS/EC2 on the Zapier site, starting from the section where it says Can't Connect to My Database. They continue on to provide specific instructions for the security group settings.

You will also probably need to grant permission on the database using the mysql GRANT command to the user coming in from the Zapier ip address.

Something like this should work :

GRANT ALL PRIVILEGES
ON mydb.*
TO 'user'@'54.86.9.50'
IDENTIFIED BY 'newpassword';

Or you can allow that user to come in from any ip adress like this:

GRANT ALL PRIVILEGES
ON mydb.*
TO 'user'@'%'
IDENTIFIED BY 'newpassword';

(Incoming requests would still be limited to Zapier ip addresses due to the security group settings.)

Best Answer

Related Solutions

MySQL 5.5 – Less Restrictive Query Returns Fewer Results Due to Removing One Constraint

Connecting to MySQL on an EC2 Instance

Related Question