Sql-server – Pick the most non-default value from a selection of values

coalescejoin;sql serversql server 2014

Given the following tables:

CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, Name VARCHAR(16))
INSERT INTO FeeTestClient (Name)
VALUES ('Test'), ('Test 2'), ('Test 3')

CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16))
INSERT INTO FeeTest (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
       (1, 10, 'Default', 'asdf'),
       (2, 15, 'Default', 'Default'),
       (2, 20, 'Default', 'qwer'),
       (2, 10, 'zxcv', 'asdf'),
       (3, 20, 'Default', 'Default')

My goal is to select all FeeTestClient elements, and pick the fee that is least default. The rules on a default fee are pretty simple: if Val2 is 'Default', then Val cannot be anything except 'Default', and for each fee, we want the first one where Val is not 'Default', or the first one where Val2 is not 'Default', otherwise we are guaranteed a Val = 'Default' AND Val2 = 'Default' match.

The customer will only ever have one item that matches 'Default'/'Default', one item that matches 'Default'/____ and one item that matches ____/____. (Though the last two rows may not exist.) If they have a ____/____ then they'll always have a 'Default'/____, every customer will have a 'Default'/'Default'. They can never have a ____/'Default' — this is an invalid state on the application, and they can never have more than one of the same x/y, this is enforced by a UNIQUE constraint on the table.

It is possible (in the database) for the customer to have a 'Default'/a, and a 'Default'/b, but that is considered an invalid state in the application, and there is a test for that. (The user must delete one of the two.)

This is similar to a previous question of mine (Select all records, join with table A if join exists, table B if not), but much less pleasant. Because they're INT (actually FLOAT in the database, but the same issue applies) they're being aggregated together like I don't want.

I want to get a result of:

Id  Name    (No column name)
1   Test    10
2   Test 2  10
3   Test 3  20

I've tried:

SELECT Id, Name, COALESCE(f1.Fee, f2.Fee, f3.Fee)
FROM FeeTestClient
LEFT OUTER JOIN FeeTest AS f1 ON f1.ClientId = Id AND f1.Val <> 'Default' AND f1.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f2 ON f2.ClientId = Id AND f2.Val = 'Default' AND f2.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f3 ON f3.ClientId = Id AND f3.Val = 'Default' AND f3.Val2 = 'Default'

Which is ungodly slow on the live data-set, but returns the proper result (roughly 15 seconds, running the basic select on the data, without this selection of joins, is 7, plan is here), and I've also tried (by suggestion of Joe Obbish):

SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, Name

Which is just as slow (plan is here), and produces the wrong output (though if you COALESCE the MAXs then it works correctly). (Maybe worse.)

I'm at a loss, and writing these queries has been very painful, so any advice on building the desired output is appreciated.

I've provided the actual plans, but they differ very heavily from the test ones (this MCVE is to demonstrate the result I want, answers are not obligated or even expected to make any performance claims), it seems. Please disregard the extra joins shown in the plans above – they are unimportant to the example.

As far as distribution, in the live data-set, 1.5% only have a 'Default'/'Default', 44.9% have 'Default'/___ and 'Default'/'Default', and 53.6% have all three.

Best Answer

I'm not sure that your MCVE fully represents the problem that you have, but I'll answer the question as given. This question is about performance, so having just a few rows in the table won't cut it. I duplicated your sample data a million times for a total of 6 million rows for FeeTest and 3 million rows for FeeTestClient. Code to do that below:

DROP TABLE IF EXISTS FeeTestClient;

CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, [Name] VARCHAR(16));

INSERT INTO FeeTestClient WITH (TABLOCK)
([Name])
SELECT 'ZZZZZZZ' + CAST(RN AS VARCHAR(7))
FROM
(
    SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS FeeTest_source;

CREATE TABLE FeeTest_source (ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest_source (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
       (1, 10, 'Default', 'asdf'),
       (2, 15, 'Default', 'Default'),
       (2, 20, 'Default', 'qwer'),
       (2, 10, 'zxcv', 'asdf'),
       (3, 20, 'Default', 'Default');


SELECT 3 * client.Id - 2 + client.Id % 3 AS ClientId
, src.Fee
, src.Val
, src.Val2 into #t
FROM FeeTest_source src
INNER JOIN FeeTestClient client ON src.ClientId = 1 + client.Id % 3;    


DROP TABLE IF EXISTS FeeTest;

CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));

INSERT INTO FeeTest WITH (TABLOCK)
(ClientId, Fee, Val, Val2)
SELECT * FROM #t;

DROP TABLE #t;

Using GROUP BY and only keeping the relevant aggregate could be a good approach depending on the nature of the data in the table and the indexes that are defined against the table. The query below finishes in 2 seconds on my machine:

SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, Name

plan 1

The plan is as I expect. There's a hash join along with a hash aggregate at the end.

However, you have other options because you have a FeeTestClient table. Another strategy is to calculate what you're looking for per row with OUTER APPLY. One way to do that is below:

SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
    SELECT TOP 1 ft.Fee
    FROM FeeTest ft
    WHERE ft.ClientId = ftc.Id
    ORDER BY
        CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
        + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
    DESC, ft.ClientId
) oa;

With APPLY and TOP you almost always want a good index on the inner table. The query optimizer builds a temporary index for us via the spool, and the query takes 14 seconds to run on my machine:

query 2

We're searching on ClientId so let's try an index on that:

CREATE INDEX NOT_COVERING ON FeeTest (ClientId);

The index isn't covering so the optimizer doesn't want to use it. Using it would require a lot of key lookups to get the columns that aren't on the index. I can force the index to be used with a lazy, undocumented trick that you should not use in production:

SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
    SELECT TOP 1 ft.Fee
    FROM FeeTest ft
    WHERE ft.ClientId = ftc.Id
    ORDER BY
        CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
        + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
    DESC, ft.ClientId
) oa
OPTION (QueryRuleOff BuildSpool);

Now the query runs in 5 seconds. We can see the key lookups in the plan:

query 3

Our final attempt with be with a covering index:

CREATE INDEX COVERING ON FeeTest (ClientId) INCLUDE (Val, Val2, Fee);

The query from before now runs in four seconds:

query 4

How can you apply this to your giant, anonymized query? You have a lot of key lookups and an index spool. Try defining covering indexes so that you can access the data you need more efficiently. I can't make any kind of guarantee about the overall runtime but it should help to some degree. Please note that I did not have time to take a very careful look at the plans that you posted.

In case anyone is following along at home, there's a bizarre edge case that can cause the covering index to still not be used. Here's one way to work around it:

ALTER TABLE FeeTest
ADD MAGIC_COLUMN AS CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END;

CREATE INDEX COVERING_2 ON FeeTest (ClientId) INCLUDE (MAGIC_COLUMN, Fee);