SQL Server – Slow Geospatial Queries

spatialsql server

We have a simple SQL Server table with geospatial data that looks like this:

CREATE TABLE [dbo].[Factors](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [StateCode] [nvarchar](2) NOT NULL,
    [GeoLocation] [geography] NULL,
    [Factor] [decimal](18, 6) NOT NULL,
 CONSTRAINT [PK_dbo.Factors] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)

We have something like 100k+ rows in it right now, but that's expected to grow to millions.

We run queries on it that look like this:

declare @state nvarchar(2) = 'AL'
declare @point geography = geography::STGeomFromText('POINT(-86.19146040 32.38225770)', 4326)

select top 3 
    Lat,
    Lon,
    Factor, 
    GeoLocation.STDistance(@point) as Distance
from dbo.Factors
where StateCode = @state and GeoLocation.STDistance(@point) is not null
order by Distance

Here's the bit that's weird though. The data in that table is spotty: we've got it for the southern parts of a state, for instance, but not for the whole state. If the point we search for is within a few hundred meters of points we've got data for (for instance, from the southern part of the state), the query returns subsecond. But if it's, say, 100 kilometers from the nearest data point (for instance, if the target point is from the northern part of the state), then the query will take up to 3 minutes or so to return. In both instances, query plans indicate that they're starting with a scan of the geospatial index, so it's not the problem that sometimes happens, that SQL Server can't figure out it's supposed to use the index in question.

My assumption is that it has something to do with how the geospatial index is laid out.

CREATE SPATIAL INDEX IX_Factors_Spatial 
ON [dbo].[Factors] (GeoLocation)
USING GEOGRAPHY_AUTO_GRID 
    WITH (
            CELLS_PER_OBJECT = 16, 
            PAD_INDEX = OFF, 
            STATISTICS_NORECOMPUTE = OFF, 
            SORT_IN_TEMPDB = OFF, 
            DROP_EXISTING = OFF, 
            ONLINE = OFF, 
            ALLOW_ROW_LOCKS = ON, 
            ALLOW_PAGE_LOCKS = ON);

But I don't know that I grok the details well enough to put my finger on the problem.

Any suggestions for how to approach troubleshooting this?

Best Answer

Short answer: Compare actual execution plans for the fast and slow variants and you'll see yourself.

When the given @point is close to the points in the table, the tessellations used in the spatial index actually help to dismiss most of the rows and only few seeks of the index are necessary.

When the given @point is far from any point in the table, the engine effectively has to read all rows. It seeks an index 100K times, which is slow.

If you disable the spatial index you'll see that performance of the query becomes the same for any given @point. It will be slower than your fast variant when index is useful, but it will be faster than your slow variant when the index is harmful.

See Spatial Indexes Overview if you haven't already for the basic details about the internal structures of such index.

Sample test data

CREATE TABLE [dbo].[Factors](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [GeoLocation] [geography] NULL,
    [Factor] [decimal](18, 6) NOT NULL,
 CONSTRAINT [PK_Factors] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Generate ~100K rows within ~20km by ~20km area.

DECLARE @MinLat float = -38.180184;
DECLARE @MaxLat float = -38.000000;
DECLARE @MinLon float = 145.000000;
DECLARE @MaxLon float = 145.227707;
DECLARE @PointCount int = 317;

WITH 
x AS
(
    SELECT TOP (@PointCount)
        ROW_NUMBER() OVER (ORDER BY [object_id]) AS rn
    FROM sys.all_objects
)
INSERT INTO [dbo].[Factors]
    ([GeoLocation]
    ,[Factor])
SELECT
    geography::Point(
         @MinLat + (TLat.rn-1) * (@MaxLat - @MinLat) / (@PointCount-1)
        ,@MinLon + (TLon.rn-1) * (@MaxLon - @MinLon) / (@PointCount-1)
        ,4326) AS GeoLocation
    ,0 AS Factor
FROM
    x AS TLat CROSS JOIN x AS TLon
ORDER BY TLat.rn, TLon.rn;

Create default spatial index

CREATE SPATIAL INDEX [IX_GeoLocation] ON [dbo].[Factors]
(
    [GeoLocation]
)USING  GEOGRAPHY_AUTO_GRID 
WITH (
CELLS_PER_OBJECT = 16, 
PAD_INDEX = OFF, 
STATISTICS_NORECOMPUTE = OFF, 
SORT_IN_TEMPDB = OFF, 
DROP_EXISTING = OFF, 
ONLINE = OFF, 
ALLOW_ROW_LOCKS = ON, 
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

Test queries

@point1 is close to other points in the table. @point2 is far away from other points in the table.

declare @point1 geography = geography::Point(-38.000000, 145.000000, 4326);
declare @point2 geography = geography::Point(+38.000000, 145.000000, 4326);

select top 3 
    Factor, 
    GeoLocation.STDistance(@point1) as Distance
from dbo.Factors
where GeoLocation.STDistance(@point1) is not null
order by Distance
option(recompile);


select top 3 
    Factor, 
    GeoLocation.STDistance(@point2) as Distance
from dbo.Factors
where GeoLocation.STDistance(@point2) is not null
order by Distance
option(recompile);

Execution plans and IO

Index enabled

IO. Top result is fast (7ms, 171 reads). Bottom result is slow (5,693ms, 234,662 reads).

Fast.

Slow.

Index disabled

IO. Both queries have the same number of reads (601) and same duration (~1700ms).

Plan is the same for both queries:

It is faster to scan 100K rows, than seek 100K times.

I don't know how to solve the problem, if there is a way to get the best of the both worlds and somehow automatically decide whether to use the index or not.

You can try to calculate the bounding box (min/max lat/lon) and change the logic based on whether the given point is within the bounding box.

The most interesting thing happens in that built-in Geodetic Tesselation table-valued function and I don't see how to fine-tune it.

With spatial indexes a lot depends on your data distribution.

In some cases you may be better off with two separate simple standard indexes on latitude and longitude if you know that your data is dense and you can limit the search to a narrow stripe or small area (given point +- few km).

Related Solutions

Sql-server – SQL Server 2008 – Return records in chronological order, given a separation factor

Prior to SQL 2008, the most common solution was to use a UDF to calculate the great-circle distance between two points on a sphere. The Haversine formula is probably the most commonly used method.

Of course the Earth is not actually a perfect sphere, but this was considered "good enough" for most uses.

In SQL 2008, as you anticipated, such calculations are simplified and made more accurate by the introduction of the Geography and Geometry data types. Here's a brief sample of how you can use them to simplify distance calculations.

DECLARE @locations TABLE(locname VARCHAR(100), coord geography)
DECLARE @loc1 geography
DECLARE @loc2 geography


INSERT INTO @locations 
VALUES('HOME', geography::Point(-81.810194, 41.478156, 4326))   --Note: Lat, Long, SRID
                                                                --The 4326 is the SRID (spatial reference id) used by SQL as 
                                                                --a reference to the WGS 84 Standard. This is the same reference
                                                                --used by the GPS system
INSERT INTO @locations 
VALUES('WORK', geography::Point(-81.687771, 41.498227, 4326))

SELECT * FROM @locations

SELECT @loc1 = coord FROM @locations WHERE locname = 'HOME'
SELECT @loc2 = coord FROM @locations WHERE locname = 'WORK'

SELECT @loc1.STDistance(@loc2) * 3.2808399  --STDistance is in meters so we multiply to convert to feet 

SELECT @loc1.STIntersects(@loc2.STBuffer(300 / 3.2808399)) as isWithin300Ft  --This formula returns True when the point @loc1 intersects with the 300ft buffer zone around @loc2

The SRID is the key to the improved accuracy. The WGS 84 specification to which it refers includes a standardized coordinate system and a reference ellipsoid. In other words, it accounts for the non-spherical nature of the Earth, giving better results than a pure spherical Great Circle calculation.

If GIS accuracy is important to your work, this is the simplest way to implement it in SQL 2008.

Mysql – the SQL command to get 20 closest points from some center location

If a SPATIAL INDEX does not work for you, see http://mysql.rjweb.org/doc.php/latlng It is a description of how to efficiently "find the nearest" on the globe.

Best Answer

Execution plans and IO

Related Solutions

Sql-server – SQL Server 2008 – Return records in chronological order, given a separation factor

Mysql – the SQL command to get 20 closest points from some center location

Related Question