Given the following tables:
CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, Name VARCHAR(16))
INSERT INTO FeeTestClient (Name)
VALUES ('Test'), ('Test 2'), ('Test 3')
CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16))
INSERT INTO FeeTest (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
(1, 10, 'Default', 'asdf'),
(2, 15, 'Default', 'Default'),
(2, 20, 'Default', 'qwer'),
(2, 10, 'zxcv', 'asdf'),
(3, 20, 'Default', 'Default')
My goal is to select all FeeTestClient
elements, and pick the fee that is least default. The rules on a default fee are pretty simple: if Val2
is 'Default'
, then Val
cannot be anything except 'Default'
, and for each fee, we want the first one where Val
is not 'Default'
, or the first one where Val2
is not 'Default'
, otherwise we are guaranteed a Val = 'Default' AND Val2 = 'Default'
match.
The customer will only ever have one item that matches 'Default'/'Default'
, one item that matches 'Default'/____
and one item that matches ____/____
. (Though the last two rows may not exist.) If they have a ____/____
then they'll always have a 'Default'/____
, every customer will have a 'Default'/'Default'
. They can never have a ____/'Default'
— this is an invalid state on the application, and they can never have more than one of the same x/y
, this is enforced by a UNIQUE
constraint on the table.
It is possible (in the database) for the customer to have a 'Default'/a
, and a 'Default'/b
, but that is considered an invalid state in the application, and there is a test for that. (The user must delete one of the two.)
This is similar to a previous question of mine (Select all records, join with table A if join exists, table B if not), but much less pleasant. Because they're INT
(actually FLOAT
in the database, but the same issue applies) they're being aggregated together like I don't want.
I want to get a result of:
Id Name (No column name)
1 Test 10
2 Test 2 10
3 Test 3 20
I've tried:
SELECT Id, Name, COALESCE(f1.Fee, f2.Fee, f3.Fee)
FROM FeeTestClient
LEFT OUTER JOIN FeeTest AS f1 ON f1.ClientId = Id AND f1.Val <> 'Default' AND f1.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f2 ON f2.ClientId = Id AND f2.Val = 'Default' AND f2.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f3 ON f3.ClientId = Id AND f3.Val = 'Default' AND f3.Val2 = 'Default'
Which is ungodly slow on the live data-set, but returns the proper result (roughly 15 seconds, running the basic select on the data, without this selection of joins, is 7, plan is here), and I've also tried (by suggestion of Joe Obbish):
SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, Name
Which is just as slow (plan is here), and produces the wrong output (though if you COALESCE
the MAX
s then it works correctly). (Maybe worse.)
I'm at a loss, and writing these queries has been very painful, so any advice on building the desired output is appreciated.
I've provided the actual plans, but they differ very heavily from the test ones (this MCVE is to demonstrate the result I want, answers are not obligated or even expected to make any performance claims), it seems. Please disregard the extra joins shown in the plans above – they are unimportant to the example.
As far as distribution, in the live data-set, 1.5% only have a 'Default'/'Default'
, 44.9% have 'Default'/___
and 'Default'/'Default'
, and 53.6% have all three.
Best Answer
I'm not sure that your MCVE fully represents the problem that you have, but I'll answer the question as given. This question is about performance, so having just a few rows in the table won't cut it. I duplicated your sample data a million times for a total of 6 million rows for
FeeTest
and 3 million rows forFeeTestClient
. Code to do that below:Using
GROUP BY
and only keeping the relevant aggregate could be a good approach depending on the nature of the data in the table and the indexes that are defined against the table. The query below finishes in 2 seconds on my machine:The plan is as I expect. There's a hash join along with a hash aggregate at the end.
However, you have other options because you have a
FeeTestClient
table. Another strategy is to calculate what you're looking for per row withOUTER APPLY
. One way to do that is below:With
APPLY
andTOP
you almost always want a good index on the inner table. The query optimizer builds a temporary index for us via the spool, and the query takes 14 seconds to run on my machine:We're searching on
ClientId
so let's try an index on that:The index isn't covering so the optimizer doesn't want to use it. Using it would require a lot of key lookups to get the columns that aren't on the index. I can force the index to be used with a lazy, undocumented trick that you should not use in production:
Now the query runs in 5 seconds. We can see the key lookups in the plan:
Our final attempt with be with a covering index:
The query from before now runs in four seconds:
How can you apply this to your giant, anonymized query? You have a lot of key lookups and an index spool. Try defining covering indexes so that you can access the data you need more efficiently. I can't make any kind of guarantee about the overall runtime but it should help to some degree. Please note that I did not have time to take a very careful look at the plans that you posted.
In case anyone is following along at home, there's a bizarre edge case that can cause the covering index to still not be used. Here's one way to work around it: