Mysql – How to Best Implement nearest neighbour search in thesql

MySQL

So, in short,

  1. What should be the data type of latitude and longitude?
  2. What SQL command I should call to get the first 100 nearest restaurants for example?

Detail:

I have 100k biz record each with lattitude and longitude. I see that MySQL actually support a data type called point. Should I use that instead?

Does MySQL support KDTree storage system http://en.wikipedia.org/wiki/File:KDTree-animation.gif

Is it best to use point data type rather than regular float data type to store latitutude and longitude?

Eventually I want to find things like the first 100 restaurants closest to points 105,6 for example and my databases contains a lot of biz and points. Obviously computing the distance one by one for every records and for every points would be O(n) and hence sucks.

Notice that I am aware of a simpler solution described in How do Application Like Yelp Retrieve distance information from data base efficiently and will implement that my self too for a start. That's a good answer.

However, I think there is one creme of the crop answer that should outperform that right? In fact, storing location based on latitude and longitude and finding stuffs nearest to it is a very common problem I expect mysql to have a special design pattern for that. Does it have that?

Where can I learn more about it? Thanks.

Best Answer

As far as design patterns, the Yelp question is pretty standard stuff.

For a more complex answer, you will probably need the geospatial distance. Here is a fascinating powerpoint about that topic (and here is a pdf version of that as well). However, the math involved is quite ugly.

From their slide:

set @orig_lat=122.4058; set @orig_lon=37.7907;
set @dist=10;

SELECT *, 3956 * 2 * ASIN(SQRT(
POWER(SIN((@orig_lat - abs(dest.lat)) * pi()/180 / 2), 2) +  COS(@orig_lat * pi()/180 ) * COS(abs(dest.lat) * pi()/180) *  POWER(SIN((@orig_lon – dest.lon) * pi()/180 / 2), 2) )) as  distance
FROM hotels dest 
having distance < @dist
ORDER BY distance limit 10

There's a longer, more in-depth answer about geospatial distance on Stack Overflow.

But you still want to limit the results by latitude and longitude.

Ultimately, I would avoid the POINT datatype and go with latitude/longitude. There's currently no way to determine the distance between two POINTs, so you're going to have to store latitude/longitude for that calculation anyways.

One last link: you may also want to check out this SO thread regarding speeding up the queries using spatial indexes.