Sql-server – How to efficiently recompute the data in a rather large X table

database-designperformancesql serversql-server-2017

I have recently noticed performance issues for the main application I am developing right now (it is a legacy one). A job triggers a stored procedure that computes the security for all the documents and all the application users in the database. The table looks like the following:

Id INT IDENTITY(1, 1) CONSTRAINT PK_DocumentSecurity PRIMARY KEY CLUSTERED,
DocumentId INT CONSTRAINT FK_DocumentSecurity FOREIGN KEY REFERENCES Document,
UserId INT CONSTRAINT FK_DocumentSecurity FOREIGN KEY REFERENCES ApplicationUser,
CanWrite BIT NOT NULL,
+ an index for DocumentId and UserId

This table has about 3.6M records in production, but this keeps growing due to the increasing number of documents and users.

This happens very often (once a minute) and I have noticed using sp_who2 that it sometimes blocks other SPIDs. The profiler also indicates a very large number of reads and CPU, so clearly, this process is very heavy.

Sometimes, on the test environments, the computation seems to take forever, probably due to a very inefficient query plan that happens from time to time (test servers are significantly lower on resources than the production one).

I have checked the code and the current algorithm looks like the following:

  • compute a hash for all the records in the tables involved in the computation
  • if the hash is the same as before, do nothing
  • insert the result of a big and complicated SELECT statement (I will provide details below) to insert into a persistent table having the exact schema as above
  • when done the view used for joining to apply the security is altered to use the brand new computed table

The actual computation looks like the following (actual details are removed due to brevity):

SELECT DISTINCT ISNULL(ds.UserId, 0), ISNULL(ds.DocumentId, 0)
FROM (
    SELECT ... FROM Document ... CROSS JOIN ApplicationUser ...
    UNION
    SELECT ... FROM Document ... CROSS JOIN ApplicationUserRole ...
    UNION
    SELECT ... FROM Document ...
    UNION 
    SELECT DISTINCT ... FROM DocumentDetail ...
    UNION 
) ds
JOIN (
    SELECT ... FROM Document ...
    WHERE (... OR ... OR ... ) AND ...
    UNION
    SELECT ...
) dsp ON dsp.DocumentId = ds.DocumentId AND dsp.UserId = ds.UserId

From my perspective, there are two big problems with this way of computation:

  • maintenance: it is becoming harder and harder to introduce new security rules
  • performance: ORs and DISTINCTs are killing the performance

I have already created a story to refactor this code, but I would like to validate that my reasoning is correct (I have an OOP bias, as I am mainly a .NET developer) before starting working on it:

  • refactor the security rules (to be validated by the business, since they certainly look different than what is already there which also might not be 100% correct for all cases) that they all like the following (precedence matter here):

if role is admin, allow access for all documents if role is sales and
user has read role for document, insert with CanWrite = 0 if role is
sales and user has a write role for document, insert with CanWrite = 1

  • create a temporary table with the target schema

  • create an index for DocumentId and UserId

  • apply each security rule:

    INSERT INTO #buffer (DocumentId, UserId, CanWrite)
    SELECT ...
    FROM Document D ...
      JOIN ApplicationUser U ...
    WHERE NOT EXISTS (SELECT 1 FROM #buffer B WHERE B.DocumentId = D.DocumentId AND B.UserId = U.UserId)
    
  • truncate the target table

  • INSERT INTO DocumentSecurity SELECT * FROM #buffer

The last two steps might need to be optimized if they take too long (maybe replace with the swap the table under the view mechanism already in place).

This should make the code more maintainable and I assume the performance will be better if I remove all the ORs and the DISTINCTs, but I am interested in a DBA's perspective for such an approach. Is it suboptimal?

Note: I cannot have duplicates in the security table because there are currently many places (Entity Framework used in application layer + stored procedures) that simply JOIN with the security view (as opposed to check for existence).

Best Answer

The amount of data we are talking about here is not nearly enough to be this big of a problem. Odds are, a bit of optimization will do you quite nicely.

So, opening your query plan, as I go along:

  • Fix your WHERE clauses. Your current query forces table scans just about everywhere because you use functions. IsNull(COALESCE(linkedlot.IsPrivate, c.IsPrivate), 0) = 0 should be ((ll.IsPrivate = 0) Or (ll.IsPrivate Is Null And c.IsPrivate = 0) Or (ll.IsPrivate Is Null And c.IsPrivate Is Null)). Side note: are these fields actually nullable? Why? These queries would be MUCH simpler if they were not.

  • Change UNION to UNION ALL. UNION does an implicit DISTINCT, but you're doing one in the main query already. Depending on your data, this may help a lot or hurt a lot.

  • Make sure your temporary tables have indexes/primary keys.

  • If the query actually takes a while, load it into a copy of the table, then MERGE it into the live copy afterwards.

Wait, never mind... very simply, your CROSS JOINs are killing you.

Look through your query plan (or rather, save it and open it up in SQL Server Management Studio), and hover over some of those thick arrows. You're exploding Contract times ApplicationUserRole five times (8 million rows every time), Contract times ApplicationUser twice (6 million rows every time), then slapping all those together (on several levels) to get distinct values (sorting 8-16 million rows every time).

But "explode all possible combinations for every possible way rights can be affected, whether they exist or not, then combine, then sort, then sort again" is not what you want. You want users versus documents, and look up their rights.

So let's do that.

Your query should look like this:

Select
  UserId = U.UserId,
  ContractId = C.ContractId,
  -- If there are explicit denials, MAX should be MIN
  CanRead = Max(Case When <<highest priority condition>>
                     When <<down the list...>>
                     Else 0 End),
  -- If there are explicit denials, MAX should be MIN
  CanWrite = Max(Case When <<highest priority condition>>
                      When <<down the list...>>
                      Else 0 End)
From
  dbo.User As U
  Cross Join dbo.Contract As C
  Left Outer Join (<<Look up first thing...>>)
  Left Outer Join (<<down the list...>>)
Group By
  U.UserId, C.ContractId

So, grabbing a few from your plan:

Select
  UserId = U.UserId,
  ContractId = C.ContractId,
  CanRead = Max(Case When ContractOwner.UserId Is Not Null Then 1
                     When <<down the list...>>
                     Else 0 End),
  CanWrite = Max(Case When ContractOwner.UserId Is Not Null Then 1
                      When <<down the list...>>
                      Else 0 End)
From
  dbo.User As U
  Cross Join dbo.Contract As C
  Left Outer Join dbo.LotSupplier As LotSupplier
    On LotSupplier.ContractId = C.Id
  Left Outer Join dbo.ContractBuyer As ContractBuyer 
    On ContractBuyer.ContractId = C.Id
  Left Outer Join dbo.Lot As LinkedLot 
    On LotSupplier.LotId = LinkedLot.Id
  Left Outer Join dbo.ApplicationUserRole As ContractOwner
    On ContractOwner.UserId = U.UserId
    And ContractOwner.BUid = C.SignatoryBUId
    And ((C.IsPrivate Is Null) Or (C.IsPrivate = Cast(0 As Bit)))  -- NO ISNULL!!
    And ContractOwner.RoleId In (16, 19, 20)
  Left Outer Join dbo.ContractBuyer As ContractBuyer
    On ContractBuyer.ContractId = C.Id
  Left Outer Join dbo.ApplicationUserRole As Buyer
    On ((C.IsPrivate = Cast(1 As Bit)
        Or (LinkedLot.IsPrivate = Cast(1 As Bit))
    And ContractBuyer.BuyerId = Buyer.UserId
    And ...on and on...
  Left Outer Join (<<down the list...>>)
Group By
  U.UserId, C.ContractId

Unless you're running this on a brutally underspecced server, the entire query can be made to run in a few seconds, trust me.

Since you're also obviously working in what I call a many-hats model (a person can be a user, buyer, owner, et cetera, I highly recommend you consider normalizing out a Person (or Party, or whatever tickles your fancy) table.

  • Information would have to be updated in one place only
  • You would not have to do that awful join on e-mail
  • That join would not fail on updating info in only one place :-)
  • You would be much better set up for sub-contracting, multi-owner contracting, contract transfers and all that fun

It looks like you're kind of using ApplicationUser for this purpose, but I think your question alone shows one of the downsides to that.