Sql-server – Query Spatial Data for Nearest Neighbor for each group

nearest neighborspatialsql serversql server 2014sql-server-2012

I want to query spatial data for nearest neighbor. I am using this article and the following query works perfectly:

SELECT TOP 1 [Location].STDistance(@Location)
FROM [DS1]
WHERE [Location].STDistance(@Location) IS NOT NULL          
ORDER BY [Location].STDistance(@Location);

The issue is I need to calculate this distance for each MSPID. I have try several things (using cross apply, creating separate function, etc) but nothing worked. So, I decided to create a loop which will calculate the distance for each MSPID.

SELECT TOP 1 [Location].STDistance(@Location)
FROM [DS1]
WHERE [Location].STDistance(@Location) IS NOT NULL 
    AND @CurrentMSPID = [MSPID]         
ORDER BY [Location].STDistance(@Location);

The problem is that the above statement does not use the index. If I replace the @CurrentMSPID with a number the index is used, but when a variable is used, the clustered index is used instead the spatial one.

I have try a lot of options and the only one that works is:

OPTION (OPTIMIZE FOR ( @CurrentMSPID  = 1001 ))

or when the value is hard-coded. I cannot leave it that way because in the future this id could even not exists and it is sure that the number of rows which it matches will be changed.

The [DS1] table has primary key on [MSPID] column and spatial index on the [Location] column.

How can I help the engine to generate better execution plan and use the spatial index without using the OPTIMIZE FOR option?


It seems when other data is extracted from the table (we are filtering by MSPID now) the engine is not able to use the spatial index and performs clustered index seek instead.

Best Answer

I'm going to go out on a bit of a limb here and guess what you are trying to achieve.

I suspect that you want to find the nearest neighbour from within the DS1 table for all the rows in DS1.

For a test set I created the following randomly filled table.

-- Test table
CREATE TABLE DS1 ( 
  MSPID INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
  Location Geometry NOT NULL
);

-- 1 million random points
INSERT INTO DS1 (Location)
SELECT TOP 1000000 
  Geometry::Point(
    RAND(CAST(NEWID() AS VARBINARY))*100000,
    RAND(CAST(NEWID() AS VARBINARY))*100000,
    0) Location
FROM TALLY;

CREATE SPATIAL INDEX DS1_SIDX ON DS1 (Location) 
USING GEOMETRY_AUTO_GRID 
WITH (BOUNDING_BOX =(-15000, -15000, 2015000, 2015000));

I a version of your initial query for a quick test, but unfortunately on my desktop the optimiser decide that parallel was better and ignored the spatial index resulting in a 3 second query. Restricting it to a single core caused it to use the index and returned in milliseconds.

-- Nearest neighbour test
DECLARE @Location Geometry = Geometry::Point(50000,50000,0);

/* Ignored index
 SQL Server Execution Times:
   CPU time = 21781 ms,  elapsed time = 3805 ms.
*/
SELECT TOP 1  
  MSPID, 
  Location.STDistance(@location) Distance
FROM DS1
WHERE Location.STDistance(@location) IS NOT NULL
ORDER BY Location.STDistance(@location)

/* Used index
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 10 ms.
*/
SELECT TOP 1  
  MSPID, 
  Location.STDistance(@location) Distance
FROM DS1
WHERE Location.STDistance(@location) IS NOT NULL
ORDER BY Location.STDistance(@location)
OPTION (MAXDOP 1);

To achieve what you want for a single row from DS1 and assuming that you want the nearest row from DS1 to the one you've picked, the following uses the index for me and again works in milliseconds.

-- For a specific row using cross apply
DECLARE @MSPID INTEGER = 500000;

SELECT 
  a.MSPID FromMSPID,
  b.MSPID ToMSPID,
  b.Distance
FROM DS1 a
  CROSS APPLY (
    SELECT TOP 1  
      MSPID, 
      c.Location.STDistance(a.Location) Distance
    FROM DS1 c
    WHERE c.Location.STDistance(a.Location) IS NOT NULL AND
      c.MSPID <> a.MSPID
    ORDER BY c.Location.STDistance(a.Location)
    ) b
WHERE a.MSPID = @MSPID
OPTION (MAXDOP 1);

This can be made to go through the entire table, but it will take some time to complete as it needs to find the nearest neighbour for each one. It took ~1 minute for 10,000, so for the full million that would take almost 2 hours. I have found that restricting this query to a single core isn't required and allowing parallism will improve performance. For me it halved the time. I would also recommend that you have a look at this question and it's answer for other performance considerations.

/* For all (well 10,000) using cross apply
 SQL Server Execution Times:
   CPU time = 60641 ms,  elapsed time = 64545 ms.
*/
SELECT TOP 10000
  a.MSPID FromMSPID,
  b.MSPID ToMSPID,
  b.Distance
FROM DS1 a
  CROSS APPLY (
    SELECT TOP 1  
      MSPID, 
      c.Location.STDistance(a.Location) Distance
    FROM DS1 c
    WHERE c.Location.STDistance(a.Location) IS NOT NULL AND
      c.MSPID <> a.MSPID
    ORDER BY c.Location.STDistance(a.Location)
    ) b
OPTION (MAXDOP 1);