SQL Server – Slow Geospatial Queries

spatialsql server

We have a simple SQL Server table with geospatial data that looks like this:

CREATE TABLE [dbo].[Factors](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [StateCode] [nvarchar](2) NOT NULL,
    [GeoLocation] [geography] NULL,
    [Factor] [decimal](18, 6) NOT NULL,
 CONSTRAINT [PK_dbo.Factors] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)

We have something like 100k+ rows in it right now, but that's expected to grow to millions.

We run queries on it that look like this:

declare @state nvarchar(2) = 'AL'
declare @point geography = geography::STGeomFromText('POINT(-86.19146040 32.38225770)', 4326)

select top 3 
    Lat,
    Lon,
    Factor, 
    GeoLocation.STDistance(@point) as Distance
from dbo.Factors
where StateCode = @state and GeoLocation.STDistance(@point) is not null
order by Distance

Here's the bit that's weird though. The data in that table is spotty: we've got it for the southern parts of a state, for instance, but not for the whole state. If the point we search for is within a few hundred meters of points we've got data for (for instance, from the southern part of the state), the query returns subsecond. But if it's, say, 100 kilometers from the nearest data point (for instance, if the target point is from the northern part of the state), then the query will take up to 3 minutes or so to return. In both instances, query plans indicate that they're starting with a scan of the geospatial index, so it's not the problem that sometimes happens, that SQL Server can't figure out it's supposed to use the index in question.

My assumption is that it has something to do with how the geospatial index is laid out.

CREATE SPATIAL INDEX IX_Factors_Spatial 
ON [dbo].[Factors] (GeoLocation)
USING GEOGRAPHY_AUTO_GRID 
    WITH (
            CELLS_PER_OBJECT = 16, 
            PAD_INDEX = OFF, 
            STATISTICS_NORECOMPUTE = OFF, 
            SORT_IN_TEMPDB = OFF, 
            DROP_EXISTING = OFF, 
            ONLINE = OFF, 
            ALLOW_ROW_LOCKS = ON, 
            ALLOW_PAGE_LOCKS = ON);

But I don't know that I grok the details well enough to put my finger on the problem.

Any suggestions for how to approach troubleshooting this?

Best Answer

Short answer: Compare actual execution plans for the fast and slow variants and you'll see yourself.

When the given @point is close to the points in the table, the tessellations used in the spatial index actually help to dismiss most of the rows and only few seeks of the index are necessary.

When the given @point is far from any point in the table, the engine effectively has to read all rows. It seeks an index 100K times, which is slow.

If you disable the spatial index you'll see that performance of the query becomes the same for any given @point. It will be slower than your fast variant when index is useful, but it will be faster than your slow variant when the index is harmful.

See Spatial Indexes Overview if you haven't already for the basic details about the internal structures of such index.

Sample test data

CREATE TABLE [dbo].[Factors](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [GeoLocation] [geography] NULL,
    [Factor] [decimal](18, 6) NOT NULL,
 CONSTRAINT [PK_Factors] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Generate ~100K rows within ~20km by ~20km area.

DECLARE @MinLat float = -38.180184;
DECLARE @MaxLat float = -38.000000;
DECLARE @MinLon float = 145.000000;
DECLARE @MaxLon float = 145.227707;
DECLARE @PointCount int = 317;

WITH 
x AS
(
    SELECT TOP (@PointCount)
        ROW_NUMBER() OVER (ORDER BY [object_id]) AS rn
    FROM sys.all_objects
)
INSERT INTO [dbo].[Factors]
    ([GeoLocation]
    ,[Factor])
SELECT
    geography::Point(
         @MinLat + (TLat.rn-1) * (@MaxLat - @MinLat) / (@PointCount-1)
        ,@MinLon + (TLon.rn-1) * (@MaxLon - @MinLon) / (@PointCount-1)
        ,4326) AS GeoLocation
    ,0 AS Factor
FROM
    x AS TLat CROSS JOIN x AS TLon
ORDER BY TLat.rn, TLon.rn;

Create default spatial index

CREATE SPATIAL INDEX [IX_GeoLocation] ON [dbo].[Factors]
(
    [GeoLocation]
)USING  GEOGRAPHY_AUTO_GRID 
WITH (
CELLS_PER_OBJECT = 16, 
PAD_INDEX = OFF, 
STATISTICS_NORECOMPUTE = OFF, 
SORT_IN_TEMPDB = OFF, 
DROP_EXISTING = OFF, 
ONLINE = OFF, 
ALLOW_ROW_LOCKS = ON, 
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

Test queries

@point1 is close to other points in the table. @point2 is far away from other points in the table.

declare @point1 geography = geography::Point(-38.000000, 145.000000, 4326);
declare @point2 geography = geography::Point(+38.000000, 145.000000, 4326);

select top 3 
    Factor, 
    GeoLocation.STDistance(@point1) as Distance
from dbo.Factors
where GeoLocation.STDistance(@point1) is not null
order by Distance
option(recompile);


select top 3 
    Factor, 
    GeoLocation.STDistance(@point2) as Distance
from dbo.Factors
where GeoLocation.STDistance(@point2) is not null
order by Distance
option(recompile);

Execution plans and IO

Index enabled

IO. Top result is fast (7ms, 171 reads). Bottom result is slow (5,693ms, 234,662 reads).

index enabled IO

Fast.

index enabled fast

Slow.

index enabled slow

Index disabled

IO. Both queries have the same number of reads (601) and same duration (~1700ms).

index disabled IO

Plan is the same for both queries:

index disabled plan

It is faster to scan 100K rows, than seek 100K times.


I don't know how to solve the problem, if there is a way to get the best of the both worlds and somehow automatically decide whether to use the index or not.

You can try to calculate the bounding box (min/max lat/lon) and change the logic based on whether the given point is within the bounding box.

The most interesting thing happens in that built-in Geodetic Tesselation table-valued function and I don't see how to fine-tune it.

With spatial indexes a lot depends on your data distribution.

In some cases you may be better off with two separate simple standard indexes on latitude and longitude if you know that your data is dense and you can limit the search to a narrow stripe or small area (given point +- few km).