I'm just curious why an aggregate query runs so much faster with a GROUP BY
clause than without one.
For example, this query takes almost 10 seconds to run
SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1
While this one takes less than a second
SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1
GROUP BY CreatedDate
There is only one CreatedDate
in this case, so the grouped query returns the same results as the ungrouped one.
I noticed the execution plans for the two queries are different – The second query uses Parallelism while the first query does not.
Is it normal for SQL server to evaluate an aggregate query differently if it doesn't have a GROUP BY clause? And is there something I can do to improve the performance of the 1st query without using a GROUP BY
clause?
Edit
I just learned I can use OPTION(querytraceon 8649)
to set the cost overhead of parallelism to 0, which makes makes the query use some parallelism and reduces the runtime to 2 seconds, although I don't know if there's any downsides to using this query hint.
SELECT MIN(CreatedDate)
FROM MyTable
WHERE SomeIndexedValue = 1
OPTION(querytraceon 8649)
I'd still prefer a shorter runtime since the query is meant to populate a value upon user selection, so should ideally be instantaneous like the grouped query is. Right now I'm just wrapping my query, but I know that's not really an ideal solution.
SELECT Min(CreatedDate)
FROM
(
SELECT Min(CreatedDate) as CreatedDate
FROM MyTable WITH (NOLOCK)
WHERE SomeIndexedValue = 1
GROUP BY CreatedDate
) as T
Edit #2
In response to Martin's request for more info:
Both CreatedDate
and SomeIndexedValue
have a separate non-unique, non-clustered index on them. SomeIndexedValue
is actually a varchar(7) field, even though it stores a numeric value that points to the PK (int) of another table. The relationship between the two tables is not defined in the database. I am not supposed to change the database at all, and can only write queries that query data.
MyTable
contains over 3 million records, and each record is assigned a group it belongs to (SomeIndexedValue
). The groups can be anywhere from 1 to 200,000 records
Best Answer
It looks like it is probably following an index on
CreatedDate
in order from lowest to highest and doing lookups to evaluate theSomeIndexedValue = 1
predicate.When it finds the first matching row it is done, but it may well be doing many more lookups than it expects before it finds such a row (it assumes the rows matching the predicate are randomly distributed according to date.)
See my answer here for a similar issue
The ideal index for this query would be one on
SomeIndexedValue, CreatedDate
. Assuming that you can't add that or at least make your existing index onSomeIndexedValue
coverCreatedDate
as an included column then you could try rewriting the query as followsto prevent it from using that particular plan.