Sql-server – Are these two queries logically equivalent

performancequeryquery-performancequery-refactorsql server

Are these two queries logically equivalent?

DECLARE @DateTime DATETIME = GETDATE()

Query 1

SELECT *
FROM   MyTable
WHERE  Datediff(DAY, LogInsertTime, @DateTime) > 7

Query 2

SELECT *
FROM   MyTable
WHERE  LogInsertTime < @DateTime - 7

If they are not logically equivalent, can you give me the logical equivalent of the first query so that the WHERE clause can effectively use an index (i.e. eliminate function wrapping)?

Best Answer

Whether the two queries you posted are logically equivalent is irrelevant; you shouldn't use either of them. I will try to steer you away from a couple of things:

Whenever possible, try to avoid applying functions to columns. It's always as good, and mostly better, to keep those calculations against constants and not columns - this can destroy SARGability and render indexes on those columns useless. In this case, I much prefer query 2, especially if LogDateTime is indexed (or might ever be).
I don't like the shorthand date math and I recommend against it. Sure, it's faster to type, but try that with a DATE data type and you will get an ugly error. Much better to spell it out, e.g.:
```
WHERE LogInsertTime < DATEADD(DAY, -7, @DateTime);
```

Related Solutions

Sql-server – Large time difference between two almost identical simple queries…why

I am willing to bet the issue is that the first does a scan of the table, while the second can look things up from an index. This happens all the time. The reason, as Phil has said, is that there are an infinite number of possibilities.

If you want to make the first approach work better with more flexibility you need to use a recursive query structure. Something like:

WITH RECURSIVE sparse_scan AS (
    SELECT m.Metadata, COUNT(*) As Count
      FROM TWG10MinData twg, Metadata m
     WHERE twg.metadata = m.Metadata and m.Metadata = 1100
  GROUP BY m.Metadata
 UNION ALL
    SELECT m.Metadata, COUNT(*) As Count
      FROM TWG10MinData twg, Metadata m
      JOIN sparse_scan s ON (m.Metadata = s.Metadata + 1000)
     WHERE twg.metadata = m.Metadata 
  GROUP BY m.Metadata
)
SELECT * FROM sparse_scan order by m.Metadata;

This works differently, running the query for 1000, 2100, 3100, 4100, etc, until a row is not returned. You can be a little fancier to skip cases where a row does not exist but the next one does, but the basic approach is something I use frequently, albeit on PostgreSQL.

Sql-server – Comparing two queries in SQL Server 2012

I love your approach to careful consideration to query tuning and reviewing options and plans. I wish more developers did this. One caution would be - always test with a lot of rows, looking at the logical reads, this is a smallish table. Try and generate a sample load and run the query again. One small issue - in your top query you are not asking for an order by, in your bottom query you are. You should compare and contrast them each with ordering.

I just quickly created a SalesOrders table with 200,000 sales orders in it - still not huge by any stretch of the imagination. And ran the queries with the ORDER BY in each. I also played with indexes a bit.

With no clustered index on OrderID, just a non-clustered index on CustID The second query outperformed. Especially with the order by included in each. There was twice as many reads on the first query than the second query, and the cost percentages were 67% / 33% between the queries.

With a clustered index on OrderID and a non-clustered index just on CustID They performed in a similar speed and the exact same number of reads.

So I would suggest you increase the number of rows and do some more testing. But my final analysis on your queries -

You may find them behaving more similarly than you realize when you increase the rows, so keep that caveat in mind and test that way.

If all you ever want to return is the maximum OrderID for each Customer, and you want to determine that by the OrderID being the greatest OrderID then the second query out of these two is the best way to go from my mindset - it is a bit simpler and while ever so slightly more expensive based on subtree cost it is a quicker and easier to decipher statement. If you intend on adding other columns into your result set someday? Then the first query allows you do to do that.

Updated: One of your comments under your question was:

Please keep in mind, that finding the best query in this question is a means of refining the techniques used for comparing them.

But best takeaway for doing that- test with more data - always makes sure you have data consistent with production and expected future production. Query plans start looking data when you give more rows to the tables, and try and keep the distribution what you'd expect in production. And pay attention to things like including Order By or not, here I don't think it makes a terrible bit of difference in the end, but still worth digging into.

Your approach of comparing this level of detail and data is a good one. Subtree costs are arbitrary and meaningless mostly, but still worth at least looking at for comparison between edits/changes or even between queries. Looking at the time statistics and the IO are quite important, as is looking at the plan for anything that feels out of place for the size of the data you are working with and what you are trying to do.

Best Answer

Related Solutions

Sql-server – Large time difference between two almost identical simple queries…why

Sql-server – Comparing two queries in SQL Server 2012

Related Question