Sql-server – Query Spatial Data for Nearest Neighbor for each group

nearest neighborspatialsql serversql server 2014sql-server-2012

I want to query spatial data for nearest neighbor. I am using this article and the following query works perfectly:

SELECT TOP 1 [Location].STDistance(@Location)
FROM [DS1]
WHERE [Location].STDistance(@Location) IS NOT NULL          
ORDER BY [Location].STDistance(@Location);

The issue is I need to calculate this distance for each MSPID. I have try several things (using cross apply, creating separate function, etc) but nothing worked. So, I decided to create a loop which will calculate the distance for each MSPID.

SELECT TOP 1 [Location].STDistance(@Location)
FROM [DS1]
WHERE [Location].STDistance(@Location) IS NOT NULL 
    AND @CurrentMSPID = [MSPID]         
ORDER BY [Location].STDistance(@Location);

The problem is that the above statement does not use the index. If I replace the @CurrentMSPID with a number the index is used, but when a variable is used, the clustered index is used instead the spatial one.

I have try a lot of options and the only one that works is:

OPTION (OPTIMIZE FOR ( @CurrentMSPID  = 1001 ))

or when the value is hard-coded. I cannot leave it that way because in the future this id could even not exists and it is sure that the number of rows which it matches will be changed.

The [DS1] table has primary key on [MSPID] column and spatial index on the [Location] column.

How can I help the engine to generate better execution plan and use the spatial index without using the OPTIMIZE FOR option?

It seems when other data is extracted from the table (we are filtering by MSPID now) the engine is not able to use the spatial index and performs clustered index seek instead.

Best Answer

I'm going to go out on a bit of a limb here and guess what you are trying to achieve.

I suspect that you want to find the nearest neighbour from within the DS1 table for all the rows in DS1.

For a test set I created the following randomly filled table.

-- Test table
CREATE TABLE DS1 ( 
  MSPID INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
  Location Geometry NOT NULL
);

-- 1 million random points
INSERT INTO DS1 (Location)
SELECT TOP 1000000 
  Geometry::Point(
    RAND(CAST(NEWID() AS VARBINARY))*100000,
    RAND(CAST(NEWID() AS VARBINARY))*100000,
    0) Location
FROM TALLY;

CREATE SPATIAL INDEX DS1_SIDX ON DS1 (Location) 
USING GEOMETRY_AUTO_GRID 
WITH (BOUNDING_BOX =(-15000, -15000, 2015000, 2015000));

I a version of your initial query for a quick test, but unfortunately on my desktop the optimiser decide that parallel was better and ignored the spatial index resulting in a 3 second query. Restricting it to a single core caused it to use the index and returned in milliseconds.

-- Nearest neighbour test
DECLARE @Location Geometry = Geometry::Point(50000,50000,0);

/* Ignored index
 SQL Server Execution Times:
   CPU time = 21781 ms,  elapsed time = 3805 ms.
*/
SELECT TOP 1  
  MSPID, 
  Location.STDistance(@location) Distance
FROM DS1
WHERE Location.STDistance(@location) IS NOT NULL
ORDER BY Location.STDistance(@location)

/* Used index
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 10 ms.
*/
SELECT TOP 1  
  MSPID, 
  Location.STDistance(@location) Distance
FROM DS1
WHERE Location.STDistance(@location) IS NOT NULL
ORDER BY Location.STDistance(@location)
OPTION (MAXDOP 1);

To achieve what you want for a single row from DS1 and assuming that you want the nearest row from DS1 to the one you've picked, the following uses the index for me and again works in milliseconds.

-- For a specific row using cross apply
DECLARE @MSPID INTEGER = 500000;

SELECT 
  a.MSPID FromMSPID,
  b.MSPID ToMSPID,
  b.Distance
FROM DS1 a
  CROSS APPLY (
    SELECT TOP 1  
      MSPID, 
      c.Location.STDistance(a.Location) Distance
    FROM DS1 c
    WHERE c.Location.STDistance(a.Location) IS NOT NULL AND
      c.MSPID <> a.MSPID
    ORDER BY c.Location.STDistance(a.Location)
    ) b
WHERE a.MSPID = @MSPID
OPTION (MAXDOP 1);

This can be made to go through the entire table, but it will take some time to complete as it needs to find the nearest neighbour for each one. ~~It took ~1 minute for 10,000, so for the full million that would take almost 2 hours.~~ I have found that restricting this query to a single core isn't required and allowing parallism will improve performance. For me it halved the time. I would also recommend that you have a look at this question and it's answer for other performance considerations.

/* For all (well 10,000) using cross apply
 SQL Server Execution Times:
   CPU time = 60641 ms,  elapsed time = 64545 ms.
*/
SELECT TOP 10000
  a.MSPID FromMSPID,
  b.MSPID ToMSPID,
  b.Distance
FROM DS1 a
  CROSS APPLY (
    SELECT TOP 1  
      MSPID, 
      c.Location.STDistance(a.Location) Distance
    FROM DS1 c
    WHERE c.Location.STDistance(a.Location) IS NOT NULL AND
      c.MSPID <> a.MSPID
    ORDER BY c.Location.STDistance(a.Location)
    ) b
OPTION (MAXDOP 1);

Related Solutions

Sql-server – Spatial index not used

Correct. Spatial indexes don't get leveraged in that situation, sadly.

Spatial indexes provide a set of grids, allow the system to identify geometries (or geographies) that overlap these grids.

Your best bet is to set a threshold of acceptable closeness, and try that, using something like STBuffer. STIntersects works well, and you can increase this threshold if nothing is found (eg, using a second OUTER APPLY if nothing is found.

Edit: They do now! Check my post on it... http://blogs.lobsterpot.com.au/2014/08/14/sql-spatial-getting-nearest-calculations-working-properly/

In summary, to get your index to be used, you need to make sure that your ORDER BY value can't be NULL - just add a WHERE...IS NOT NULL clause and it should work.

Spatialite: use of virtual spatial index for distances

I know it is late but I recently stumbled accross the same problem. It is not working in your example because the search_frame parameter act as a filter and is equivalent to the function RTreeIntersect.

Here since you use a point it will only look for point wich intersect togetether (when using their bounding box). The idea is to use a bouding box centered on the pp1 point, with side the length of the desired distance.

SELECT pp1.name AS "PopulatedPlace #1",
  GeodesicLength(MakeLine(pp1.geometry, pp2.geometry)) AS "Distance (meters)",
  pp2.name AS "PopulatedPlace #2"
FROM populated_places AS pp1,
populated_places AS pp2
WHERE GeodesicLength(
  MakeLine(pp1.geometry, pp2.geometry)) < 1000.0
  AND pp1.id <> pp2.id
AND pp2.ROWID IN (
  SELECT ROWID
  FROM SpatialIndex
  WHERE
   f_table_name ='populated_places'
   AND search_frame = ST_Expand(pp1.geometry, 500.0)
ORDER BY 2;

To do that we simply use the function ST_Expand which return the bounding box envelopping the geometry (here it is the point) and expand it uniformly.

So we filter the search as follow : only pp2 point inside the square centered on pp1 with 1000 meter side length will be considered.

Best Answer

Related Solutions

Sql-server – Spatial index not used

Spatialite: use of virtual spatial index for distances

Related Question