Sql-server – Pick the most non-default value from a selection of values

coalescejoin;sql serversql server 2014

Given the following tables:

CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, Name VARCHAR(16))
INSERT INTO FeeTestClient (Name)
VALUES ('Test'), ('Test 2'), ('Test 3')

CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16))
INSERT INTO FeeTest (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
       (1, 10, 'Default', 'asdf'),
       (2, 15, 'Default', 'Default'),
       (2, 20, 'Default', 'qwer'),
       (2, 10, 'zxcv', 'asdf'),
       (3, 20, 'Default', 'Default')

My goal is to select all FeeTestClient elements, and pick the fee that is least default. The rules on a default fee are pretty simple: if Val2 is 'Default', then Val cannot be anything except 'Default', and for each fee, we want the first one where Val is not 'Default', or the first one where Val2 is not 'Default', otherwise we are guaranteed a Val = 'Default' AND Val2 = 'Default' match.

The customer will only ever have one item that matches 'Default'/'Default', one item that matches 'Default'/____ and one item that matches ____/____. (Though the last two rows may not exist.) If they have a ____/____ then they'll always have a 'Default'/____, every customer will have a 'Default'/'Default'. They can never have a ____/'Default' — this is an invalid state on the application, and they can never have more than one of the same x/y, this is enforced by a UNIQUE constraint on the table.

It is possible (in the database) for the customer to have a 'Default'/a, and a 'Default'/b, but that is considered an invalid state in the application, and there is a test for that. (The user must delete one of the two.)

This is similar to a previous question of mine (Select all records, join with table A if join exists, table B if not), but much less pleasant. Because they're INT (actually FLOAT in the database, but the same issue applies) they're being aggregated together like I don't want.

I want to get a result of:

Id  Name    (No column name)
1   Test    10
2   Test 2  10
3   Test 3  20

I've tried:

SELECT Id, Name, COALESCE(f1.Fee, f2.Fee, f3.Fee)
FROM FeeTestClient
LEFT OUTER JOIN FeeTest AS f1 ON f1.ClientId = Id AND f1.Val <> 'Default' AND f1.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f2 ON f2.ClientId = Id AND f2.Val = 'Default' AND f2.Val2 <> 'Default'
LEFT OUTER JOIN FeeTest AS f3 ON f3.ClientId = Id AND f3.Val = 'Default' AND f3.Val2 = 'Default'

Which is ungodly slow on the live data-set, but returns the proper result (roughly 15 seconds, running the basic select on the data, without this selection of joins, is 7, plan is here), and I've also tried (by suggestion of Joe Obbish):

SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, Name

Which is just as slow (plan is here), and produces the wrong output (though if you COALESCE the MAXs then it works correctly). (Maybe worse.)

I'm at a loss, and writing these queries has been very painful, so any advice on building the desired output is appreciated.

I've provided the actual plans, but they differ very heavily from the test ones (this MCVE is to demonstrate the result I want, answers are not obligated or even expected to make any performance claims), it seems. Please disregard the extra joins shown in the plans above – they are unimportant to the example.

As far as distribution, in the live data-set, 1.5% only have a 'Default'/'Default', 44.9% have 'Default'/___ and 'Default'/'Default', and 53.6% have all three.

Best Answer

I'm not sure that your MCVE fully represents the problem that you have, but I'll answer the question as given. This question is about performance, so having just a few rows in the table won't cut it. I duplicated your sample data a million times for a total of 6 million rows for FeeTest and 3 million rows for FeeTestClient. Code to do that below:

DROP TABLE IF EXISTS FeeTestClient;

CREATE TABLE FeeTestClient (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, [Name] VARCHAR(16));

INSERT INTO FeeTestClient WITH (TABLOCK)
([Name])
SELECT 'ZZZZZZZ' + CAST(RN AS VARCHAR(7))
FROM
(
    SELECT TOP (3000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2
) t
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS FeeTest_source;

CREATE TABLE FeeTest_source (ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));
INSERT INTO FeeTest_source (ClientId, Fee, Val, Val2)
VALUES (1, 15, 'Default', 'Default'),
       (1, 10, 'Default', 'asdf'),
       (2, 15, 'Default', 'Default'),
       (2, 20, 'Default', 'qwer'),
       (2, 10, 'zxcv', 'asdf'),
       (3, 20, 'Default', 'Default');


SELECT 3 * client.Id - 2 + client.Id % 3 AS ClientId
, src.Fee
, src.Val
, src.Val2 into #t
FROM FeeTest_source src
INNER JOIN FeeTestClient client ON src.ClientId = 1 + client.Id % 3;    


DROP TABLE IF EXISTS FeeTest;

CREATE TABLE FeeTest (FeeId INT IDENTITY(1,1) NOT NULL PRIMARY KEY, ClientId INT, Fee INT, Val VARCHAR(16), Val2 VARCHAR(16));

INSERT INTO FeeTest WITH (TABLOCK)
(ClientId, Fee, Val, Val2)
SELECT * FROM #t;

DROP TABLE #t;

Using GROUP BY and only keeping the relevant aggregate could be a good approach depending on the nature of the data in the table and the indexes that are defined against the table. The query below finishes in 2 seconds on my machine:

SELECT Id, Name, MAX(COALESCE(CASE WHEN Val <> 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 <> 'Default' THEN Fee END, CASE WHEN Val = 'Default' AND Val2 = 'Default' THEN Fee END))
FROM FeeTestClient
LEFT OUTER JOIN FeeTest ON FeeTest.ClientId = Id
GROUP BY Id, Name

The plan is as I expect. There's a hash join along with a hash aggregate at the end.

However, you have other options because you have a FeeTestClient table. Another strategy is to calculate what you're looking for per row with OUTER APPLY. One way to do that is below:

SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
    SELECT TOP 1 ft.Fee
    FROM FeeTest ft
    WHERE ft.ClientId = ftc.Id
    ORDER BY
        CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
        + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
    DESC, ft.ClientId
) oa;

With APPLY and TOP you almost always want a good index on the inner table. The query optimizer builds a temporary index for us via the spool, and the query takes 14 seconds to run on my machine:

We're searching on ClientId so let's try an index on that:

CREATE INDEX NOT_COVERING ON FeeTest (ClientId);

The index isn't covering so the optimizer doesn't want to use it. Using it would require a lot of key lookups to get the columns that aren't on the index. I can force the index to be used with a lazy, undocumented trick that you should not use in production:

SELECT Id, [Name], oa.Fee
FROM FeeTestClient ftc
OUTER APPLY (
    SELECT TOP 1 ft.Fee
    FROM FeeTest ft
    WHERE ft.ClientId = ftc.Id
    ORDER BY
        CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END
        + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END
    DESC, ft.ClientId
) oa
OPTION (QueryRuleOff BuildSpool);

Now the query runs in 5 seconds. We can see the key lookups in the plan:

Our final attempt with be with a covering index:

CREATE INDEX COVERING ON FeeTest (ClientId) INCLUDE (Val, Val2, Fee);

The query from before now runs in four seconds:

How can you apply this to your giant, anonymized query? You have a lot of key lookups and an index spool. Try defining covering indexes so that you can access the data you need more efficiently. I can't make any kind of guarantee about the overall runtime but it should help to some degree. Please note that I did not have time to take a very careful look at the plans that you posted.

In case anyone is following along at home, there's a bizarre edge case that can cause the covering index to still not be used. Here's one way to work around it:

ALTER TABLE FeeTest
ADD MAGIC_COLUMN AS CASE WHEN Val <> 'Default' THEN 2 ELSE 0 END + CASE WHEN Val2 <> 'Default' THEN 1 ELSE 0 END;

CREATE INDEX COVERING_2 ON FeeTest (ClientId) INCLUDE (MAGIC_COLUMN, Fee);

Related Solutions

Mysql – get column from too many tables in thesql

If all the tables use the MyISAM Storage Engine and have the same table structure, I have some good news for you.

You can create a single table that consumes no additional space except a .frm file and some mapping info. The key is to take advantage of the MERGE (MRG_MyISAM) Storage Engine.

Here is how you can do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4) INSERT_METHOD=LAST;

Using this method, you can query the 4 tables at the same time like this:

SELECT ReqF FROM XMerge WHERE EmpName='John';

Was that simple, or what ???

In your case, you have 75 tables. You would do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

The beauty of this is that creating a MERGE table takes milliseconds. Just make sure every table has an index on EmpName. Better to do 75 indexed lookups that 75 full table scans. If there is no index on EmpName, you need to do this:

ALTER TABLE X1 ADD UNIQUE KEY (EmpName);
ALTER TABLE X2 ADD UNIQUE KEY (EmpName);
.
.
.
ALTER TABLE X75 ADD UNIQUE KEY (EmpName);
CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

Give it a Try !!!

Postgresql – provide a default for a left outer join

SELECT a.a1,b.b1,  
    CASE WHEN b.b1 is NULL THEN 5 ELSE b.b2 END AS b2  
FROM a LEFT OUTER JOIN b  
ON a.a1 = b.b1

Best Answer

Related Solutions

Mysql – get column from too many tables in thesql

Postgresql – provide a default for a left outer join

Related Question