Sql-server – How to optimize a query so that it seeks on one index first, and then another index after that

indexjoin;performancequery-performancespatialsql server

I have two sets of earth measurements from satellite data, each with time fields (mjd for mean julian date) and geography positions (GeoPoint, spacial) and I'm looking for coincidences between the two sets such that their times match to a threshold of 3hrs (or .125 days) and their distances to within 200 km of each other.

I have made indexes for both the mjd fields on both tables and spatial tables.

When I just join on the time constraint, the database calculates 100,000 matches in 8 seconds and computes the distances for all 100,000 matches in that time. The query looks like this:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

And the executed plan is:

Only mjd constraint

When sorted, 9 of the distances were under 200km, so there are matches. The trouble is, when I add the distance constraint and run this instead,

select top 10 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
and h.GeoPoint.STDistance(m.GeoPoint)<200000
option( table hint ( h, index(ix_MJD) ), table hint( m, index(ix_MJD) ) )

it goes away for a long time. Obviously, in 8 seconds, it could find 100,000 time matches, 9 of which were under 200km, so the optimizer must be trying something sub-optimal. The plan looks similar to above with a filter on the distances (I'm guessing).

with spatial constrant, no spatial filter

I can force the use of the spatial index with this:

select top 5 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0 
from L2V5.dbo.header h join L2.dbo.MLS_Header m 
on h.GeoPoint.STDistance(m.GeoPoint)<200000
and h.mjd between m.mjd-.125 and m.mjd+.125 
option( table hint ( h, index(ix_MJD), index(ix_GeoPoint) ), table hint( m, index(ix_MJD) ) )

both constraints with both indexes

which then takes 3 minutes to find 5 matches.

How do I tell the query optimizer to use the MJD index seek first, and then the spatial index second (or is that what it is doing already) and is there any way I can help it out by telling it how many matches to expect? If it can compute 100,000 matches with distances in 8 seconds that has 9 under 200km, shouldn't the addition of the spatial index make it faster not slower?

Thanks for any other tips or ideas.

EDIT: To answer the question what the plan looks like without the hints, this (and it takes forever):

no hints

It maybe also worth mentioning that there are almost 1M records in the one table and 8M in the other

Best Answer

The problem is that it might (and knowing spatial indexes, probably will) assume that the spatial filter will be a lot more selective than the time filter.

But if you have a few million records within 200km, then it could be significantly worse.

You're asking it to find records within 200km, which returns data ordered by some spatial order. Finding the records in there that are close in time means checking each one.

Or else you're finding records by time, and you're getting results in time order. Then, filtering this list to the 200km radius is a matter of checking each one.

If you filter the data in two ranges like this, it becomes hard to apply the second filter using an index. You may be better off telling it not to use the spatial index if the time filter is the tighter one.

If both are large individually, and it's only together that they are tight, then you have a more complex issue, one that people have tried to solve for a long time, and which could be nicely resolved by indexes that cover 3D (and beyond) space. Except that SQL Server doesn't have them.

Sorry.

Edit: more info...

This is a similar problem to finding time ranges that cover a particular point in time. When you search for the records that start before that point, you then have an unordered mess of end times - and vice-versa. If you look for people in the phone book whose surnames start with F, you can't hope to find the people whose first names start with R very easily. And an index on first name doesn't help either for the same reason. Finding things in that next index is hard when your first index is not an equality.

Now, if you could change your date filter into an equality filter (or series of equality filters), then you could stand a chance, except that a spatial index is a special kind of index and can't be used as the second level in a composite index.

So you're left with an awkward situation, I'm afraid. :(

Edit: Try:

select top 100000 h.Time, m.Time, h.GeoPoint.STDistance(m.GeoPoint)/1000.0
from L2V5.dbo.header h join L2.dbo.MLS_Header m
on h.mjd between m.mjd-.125 and m.mjd+.125
where h.GeoPoint.STDistance(m.GeoPoint)/1000.0 < 200
option( table hint ( h, index(ix_MJD) ) );

Notice that I'm deliberately breaking the sargability by dividing by 1000 before comparing to 200. I want this work to be done in the Key Lookup.

Mind you, you could avoid the need for the lookups (and the hints) by INCLUDEing GeoPoint and Time in both ix_MJD indexes. That'll certainly take some of the heat out of the query plan.

Related Solutions

Mysql – Why does MySQL (InnoDB) table get faster after OPTIMIZE TABLE, but then don’t work

OPTIMIZE rebuilds the table. This (for InnoDB) squeezes out some of the fragmentation and wasted space. This is unlikely to make a noticeable difference in any query.

Also, OPTIMIZE does an ANALYZE. This has a chance of changing the statistics, thereby leading to a different (better or worse) EXPLAIN plan.

Since ANALYZE is much faster (on InnoDB) than OPTIMIZE, just do the ANALYZE.

Various non-ANALYZE actions cause an ANALYZE to be done.

ANALYZE randomly probes the BTrees, gathering stats. Sometimes the resulting stats are poor. There is effectively no way to prevent this from happening. Several partial hacks have been created over the years; 5.6.7 gets close to eliminating this problem with ANALYZE. Here's one of them: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_stats_persistent_sample_pages

Mysql – How to optimize a very slow query with joins and group by

OK, I ended up adding another lookup table:

CREATE TABLE IF NOT EXISTS `stops_routes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stop_id` varchar(100) NOT NULL,
  `route_id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
  KEY `stop_id` (`stop_id`),
  KEY `route_id` (`route_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

Filling it was fairly fast:

mysql> insert into stops_routes (stop_id, route_id)
    ->
    -> select
    ->     s.stop_id,
    ->     r.route_id as from_route_id
    ->
    -> from routes r
    -> left join trips t on t.route_id = r.route_id
    -> left join stop_times st on st.trip_id = t.trip_id
    -> left join stops s on s.stop_id = st.stop_id
    -> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496  Duplicates: 0  Warnings: 0

Using it is blazingly fast:

mysql> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id
    -> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0002        |
| 0001          | 0003        |
| 0001          | 0004        |
| 0001          | 0005        |
| 0001          | 0006        |
| 0001          | 0008        |
| 0001          | 0009        |
| 0001          | 0011        |
| 0001          | 0014        |
| 0001          | 0031        |
+---------------+-------------+
10 rows in set (0.63 sec)

Now I can fill my last lookup table (set of connections between every routes on my GTFS network):

mysql> insert into route_connections (from_route_id, to_route_id)
    -> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848  Duplicates: 0  Warnings: 0

Amazingly fast. I guess the engine couldn't break up the steps to optimize this.
I'd still be interested to know if it would be possible to get the same result (from route to route connections table) using only one sub-second or sub-minute query.

Best Answer

Related Solutions

Mysql – Why does MySQL (InnoDB) table get faster after OPTIMIZE TABLE, but then don’t work

Mysql – How to optimize a very slow query with joins and group by

Related Question