Sql-server – How to optimize a query so that it seeks on one index first, and then another index after that

indexjoin;performancequery-performancespatialsql server

I have two sets of earth measurements from satellite data, each with time fields (mjd for mean julian date) and geography positions (GeoPoint, spacial) and I'm looking for coincidences between the two sets such that their times match to a threshold of 3hrs (or .125 days) and their distances to within 200 km of each other.

I have made indexes for both the mjd fields on both tables and spatial tables.

When I just join on the time constraint, the database calculates 100,000 matches in 8 seconds and computes the distances for all 100,000 matches in that time. The query looks like this:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

And the executed plan is:

Only mjd constraint

When sorted, 9 of the distances were under 200km, so there are matches. The trouble is, when I add the distance constraint and run this instead,

select top 10 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
and h.GeoPoint.STDistance(m.GeoPoint)<200000
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

it goes away for a long time. Obviously, in 8 seconds, it could find 100,000 time matches, 9 of which were under 200km, so the optimizer must be trying something sub-optimal. The plan looks similar to above with a filter on the distances (I'm guessing).

with spatial constrant, no spatial filter

I can force the use of the spatial index with this:

select top 5 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0 
from L2V5.dbo.header h join L2.dbo.MLS_Header m 
on h.GeoPoint.STDistance(m.GeoPoint)<200000
and h.mjd between m.mjd-.125 and m.mjd+.125 
option( table hint ( h, index(ix_MJD), index(ix_GeoPoint) ), table hint( m, index(ix_MJD) ) )

both constraints with both indexes

which then takes 3 minutes to find 5 matches.

How do I tell the query optimizer to use the MJD index seek first, and then the spatial index second (or is that what it is doing already) and is there any way I can help it out by telling it how many matches to expect? If it can compute 100,000 matches with distances in 8 seconds that has 9 under 200km, shouldn't the addition of the spatial index make it faster not slower?

Thanks for any other tips or ideas.

EDIT: To answer the question what the plan looks like without the hints, this (and it takes forever):

no hints

It maybe also worth mentioning that there are almost 1M records in the one table and 8M in the other

Best Answer

The problem is that it might (and knowing spatial indexes, probably will) assume that the spatial filter will be a lot more selective than the time filter.

But if you have a few million records within 200km, then it could be significantly worse.

You're asking it to find records within 200km, which returns data ordered by some spatial order. Finding the records in there that are close in time means checking each one.

Or else you're finding records by time, and you're getting results in time order. Then, filtering this list to the 200km radius is a matter of checking each one.

If you filter the data in two ranges like this, it becomes hard to apply the second filter using an index. You may be better off telling it not to use the spatial index if the time filter is the tighter one.

If both are large individually, and it's only together that they are tight, then you have a more complex issue, one that people have tried to solve for a long time, and which could be nicely resolved by indexes that cover 3D (and beyond) space. Except that SQL Server doesn't have them.

Sorry.

Edit: more info...

This is a similar problem to finding time ranges that cover a particular point in time. When you search for the records that start before that point, you then have an unordered mess of end times - and vice-versa. If you look for people in the phone book whose surnames start with F, you can't hope to find the people whose first names start with R very easily. And an index on first name doesn't help either for the same reason. Finding things in that next index is hard when your first index is not an equality.

Now, if you could change your date filter into an equality filter (or series of equality filters), then you could stand a chance, except that a spatial index is a special kind of index and can't be used as the second level in a composite index.

So you're left with an awkward situation, I'm afraid. :(

Edit: Try:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
where h.GeoPoint.STDistance(m.GeoPoint)/1000.0 < 200
option( table hint ( h, index(ix_MJD) ) );

Notice that I'm deliberately breaking the sargability by dividing by 1000 before comparing to 200. I want this work to be done in the Key Lookup.

Mind you, you could avoid the need for the lookups (and the hints) by INCLUDEing GeoPoint and Time in both ix_MJD indexes. That'll certainly take some of the heat out of the query plan.