We have a simple SQL Server table with geospatial data that looks like this:
CREATE TABLE [dbo].[Factors](
[Id] [int] IDENTITY(1,1) NOT NULL,
[StateCode] [nvarchar](2) NOT NULL,
[GeoLocation] [geography] NULL,
[Factor] [decimal](18, 6) NOT NULL,
CONSTRAINT [PK_dbo.Factors] PRIMARY KEY CLUSTERED
(
[Id] ASC
)
We have something like 100k+ rows in it right now, but that's expected to grow to millions.
We run queries on it that look like this:
declare @state nvarchar(2) = 'AL'
declare @point geography = geography::STGeomFromText('POINT(-86.19146040 32.38225770)', 4326)
select top 3
Lat,
Lon,
Factor,
GeoLocation.STDistance(@point) as Distance
from dbo.Factors
where StateCode = @state and GeoLocation.STDistance(@point) is not null
order by Distance
Here's the bit that's weird though. The data in that table is spotty: we've got it for the southern parts of a state, for instance, but not for the whole state. If the point we search for is within a few hundred meters of points we've got data for (for instance, from the southern part of the state), the query returns subsecond. But if it's, say, 100 kilometers from the nearest data point (for instance, if the target point is from the northern part of the state), then the query will take up to 3 minutes or so to return. In both instances, query plans indicate that they're starting with a scan of the geospatial index, so it's not the problem that sometimes happens, that SQL Server can't figure out it's supposed to use the index in question.
My assumption is that it has something to do with how the geospatial index is laid out.
CREATE SPATIAL INDEX IX_Factors_Spatial
ON [dbo].[Factors] (GeoLocation)
USING GEOGRAPHY_AUTO_GRID
WITH (
CELLS_PER_OBJECT = 16,
PAD_INDEX = OFF,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ONLINE = OFF,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON);
But I don't know that I grok the details well enough to put my finger on the problem.
Any suggestions for how to approach troubleshooting this?
Best Answer
Short answer: Compare actual execution plans for the fast and slow variants and you'll see yourself.
When the given
@point
is close to the points in the table, the tessellations used in the spatial index actually help to dismiss most of the rows and only few seeks of the index are necessary.When the given
@point
is far from any point in the table, the engine effectively has to read all rows. It seeks an index 100K times, which is slow.If you disable the spatial index you'll see that performance of the query becomes the same for any given
@point
. It will be slower than your fast variant when index is useful, but it will be faster than your slow variant when the index is harmful.See Spatial Indexes Overview if you haven't already for the basic details about the internal structures of such index.
Sample test data
Generate ~100K rows within ~20km by ~20km area.
Create default spatial index
Test queries
@point1
is close to other points in the table.@point2
is far away from other points in the table.Execution plans and IO
Index enabled
IO. Top result is fast (7ms, 171 reads). Bottom result is slow (5,693ms, 234,662 reads).
Fast.
Slow.
Index disabled
IO. Both queries have the same number of reads (601) and same duration (~1700ms).
Plan is the same for both queries:
It is faster to scan 100K rows, than seek 100K times.
I don't know how to solve the problem, if there is a way to get the best of the both worlds and somehow automatically decide whether to use the index or not.
You can try to calculate the bounding box (min/max lat/lon) and change the logic based on whether the given point is within the bounding box.
The most interesting thing happens in that built-in Geodetic Tesselation table-valued function and I don't see how to fine-tune it.
With spatial indexes a lot depends on your data distribution.
In some cases you may be better off with two separate simple standard indexes on latitude and longitude if you know that your data is dense and you can limit the search to a narrow stripe or small area (given point +- few km).