Sql-server – How to efficiently recompute the data in a rather large X table

database-designperformancesql serversql-server-2017

I have recently noticed performance issues for the main application I am developing right now (it is a legacy one). A job triggers a stored procedure that computes the security for all the documents and all the application users in the database. The table looks like the following:

Id INT IDENTITY(1, 1) CONSTRAINT PK_DocumentSecurity PRIMARY KEY CLUSTERED,
DocumentId INT CONSTRAINT FK_DocumentSecurity FOREIGN KEY REFERENCES Document,
UserId INT CONSTRAINT FK_DocumentSecurity FOREIGN KEY REFERENCES ApplicationUser,
CanWrite BIT NOT NULL,
+ an index for DocumentId and UserId

This table has about 3.6M records in production, but this keeps growing due to the increasing number of documents and users.

This happens very often (once a minute) and I have noticed using sp_who2 that it sometimes blocks other SPIDs. The profiler also indicates a very large number of reads and CPU, so clearly, this process is very heavy.

Sometimes, on the test environments, the computation seems to take forever, probably due to a very inefficient query plan that happens from time to time (test servers are significantly lower on resources than the production one).

I have checked the code and the current algorithm looks like the following:

compute a hash for all the records in the tables involved in the computation
if the hash is the same as before, do nothing
insert the result of a big and complicated SELECT statement (I will provide details below) to insert into a persistent table having the exact schema as above
when done the view used for joining to apply the security is altered to use the brand new computed table

The actual computation looks like the following (actual details are removed due to brevity):

SELECT DISTINCT ISNULL(ds.UserId, 0), ISNULL(ds.DocumentId, 0)
FROM (
    SELECT ... FROM Document ... CROSS JOIN ApplicationUser ...
    UNION
    SELECT ... FROM Document ... CROSS JOIN ApplicationUserRole ...
    UNION
    SELECT ... FROM Document ...
    UNION 
    SELECT DISTINCT ... FROM DocumentDetail ...
    UNION 
) ds
JOIN (
    SELECT ... FROM Document ...
    WHERE (... OR ... OR ... ) AND ...
    UNION
    SELECT ...
) dsp ON dsp.DocumentId = ds.DocumentId AND dsp.UserId = ds.UserId

From my perspective, there are two big problems with this way of computation:

maintenance: it is becoming harder and harder to introduce new security rules
performance: ORs and DISTINCTs are killing the performance

I have already created a story to refactor this code, but I would like to validate that my reasoning is correct (I have an OOP bias, as I am mainly a .NET developer) before starting working on it:

refactor the security rules (to be validated by the business, since they certainly look different than what is already there which also might not be 100% correct for all cases) that they all like the following (precedence matter here):

if role is admin, allow access for all documents if role is sales and
user has read role for document, insert with CanWrite = 0 if role is
sales and user has a write role for document, insert with CanWrite = 1

create a temporary table with the target schema
create an index for DocumentId and UserId

apply each security rule:

INSERT INTO #buffer (DocumentId, UserId, CanWrite)
SELECT ...
FROM Document D ...
  JOIN ApplicationUser U ...
WHERE NOT EXISTS (SELECT 1 FROM #buffer B WHERE B.DocumentId = D.DocumentId AND B.UserId = U.UserId)

truncate the target table
INSERT INTO DocumentSecurity SELECT * FROM #buffer

The last two steps might need to be optimized if they take too long (maybe replace with the swap the table under the view mechanism already in place).

This should make the code more maintainable and I assume the performance will be better if I remove all the ORs and the DISTINCTs, but I am interested in a DBA's perspective for such an approach. Is it suboptimal?

Note: I cannot have duplicates in the security table because there are currently many places (Entity Framework used in application layer + stored procedures) that simply JOIN with the security view (as opposed to check for existence).

Best Answer

The amount of data we are talking about here is not nearly enough to be this big of a problem. Odds are, a bit of optimization will do you quite nicely.

So, opening your query plan, as I go along:

Fix your WHERE clauses. Your current query forces table scans just about everywhere because you use functions. IsNull(COALESCE(linkedlot.IsPrivate, c.IsPrivate), 0) = 0 should be ((ll.IsPrivate = 0) Or (ll.IsPrivate Is Null And c.IsPrivate = 0) Or (ll.IsPrivate Is Null And c.IsPrivate Is Null)). Side note: are these fields actually nullable? Why? These queries would be MUCH simpler if they were not.
Change UNION to UNION ALL. UNION does an implicit DISTINCT, but you're doing one in the main query already. Depending on your data, this may help a lot or hurt a lot.
Make sure your temporary tables have indexes/primary keys.
If the query actually takes a while, load it into a copy of the table, then MERGE it into the live copy afterwards.

Wait, never mind... very simply, your CROSS JOINs are killing you.

Look through your query plan (or rather, save it and open it up in SQL Server Management Studio), and hover over some of those thick arrows. You're exploding Contract times ApplicationUserRole five times (8 million rows every time), Contract times ApplicationUser twice (6 million rows every time), then slapping all those together (on several levels) to get distinct values (sorting 8-16 million rows every time).

But "explode all possible combinations for every possible way rights can be affected, whether they exist or not, then combine, then sort, then sort again" is not what you want. You want users versus documents, and look up their rights.

So let's do that.

Your query should look like this:

Select
  UserId = U.UserId,
  ContractId = C.ContractId,
  -- If there are explicit denials, MAX should be MIN
  CanRead = Max(Case When <<highest priority condition>>
                     When <<down the list...>>
                     Else 0 End),
  -- If there are explicit denials, MAX should be MIN
  CanWrite = Max(Case When <<highest priority condition>>
                      When <<down the list...>>
                      Else 0 End)
From
  dbo.User As U
  Cross Join dbo.Contract As C
  Left Outer Join (<<Look up first thing...>>)
  Left Outer Join (<<down the list...>>)
Group By
  U.UserId, C.ContractId

So, grabbing a few from your plan:

Select
  UserId = U.UserId,
  ContractId = C.ContractId,
  CanRead = Max(Case When ContractOwner.UserId Is Not Null Then 1
                     When <<down the list...>>
                     Else 0 End),
  CanWrite = Max(Case When ContractOwner.UserId Is Not Null Then 1
                      When <<down the list...>>
                      Else 0 End)
From
  dbo.User As U
  Cross Join dbo.Contract As C
  Left Outer Join dbo.LotSupplier As LotSupplier
    On LotSupplier.ContractId = C.Id
  Left Outer Join dbo.ContractBuyer As ContractBuyer 
    On ContractBuyer.ContractId = C.Id
  Left Outer Join dbo.Lot As LinkedLot 
    On LotSupplier.LotId = LinkedLot.Id
  Left Outer Join dbo.ApplicationUserRole As ContractOwner
    On ContractOwner.UserId = U.UserId
    And ContractOwner.BUid = C.SignatoryBUId
    And ((C.IsPrivate Is Null) Or (C.IsPrivate = Cast(0 As Bit)))  -- NO ISNULL!!
    And ContractOwner.RoleId In (16, 19, 20)
  Left Outer Join dbo.ContractBuyer As ContractBuyer
    On ContractBuyer.ContractId = C.Id
  Left Outer Join dbo.ApplicationUserRole As Buyer
    On ((C.IsPrivate = Cast(1 As Bit)
        Or (LinkedLot.IsPrivate = Cast(1 As Bit))
    And ContractBuyer.BuyerId = Buyer.UserId
    And ...on and on...
  Left Outer Join (<<down the list...>>)
Group By
  U.UserId, C.ContractId

Unless you're running this on a brutally underspecced server, the entire query can be made to run in a few seconds, trust me.

Since you're also obviously working in what I call a many-hats model (a person can be a user, buyer, owner, et cetera, I highly recommend you consider normalizing out a Person (or Party, or whatever tickles your fancy) table.

Information would have to be updated in one place only
You would not have to do that awful join on e-mail
That join would not fail on updating info in only one place :-)
You would be much better set up for sub-contracting, multi-owner contracting, contract transfers and all that fun

It looks like you're kind of using ApplicationUser for this purpose, but I think your question alone shows one of the downsides to that.

Related Solutions

Sql-server – Most performant SQL query needed

EAV based systems (e.g. Agresso) have query performance problems inherent to the EAV structure. The problems are really based on three underlying issues:

The structure is fiddly to query, so you get added complexity in your queries.
You can't index the base table on any of the attributes, which drives some inherent inefficiency in the query plans.
You have to do multiple joins against a large table.

Some possible strategies for mitigating EAV performance problems are:

Using a clustered index to keep all records relating to a given parent in close physical proximity can help somewhat. This minimises the I/O.
Consider using the metadata to create a view that flattens out the EAV structure. You may have to redefine the view every time it changes, and it won't necessarily help performance much, but it will be much easier to query. Try to create the view as a roll-up rather than a multiple way join.
Any attribute that is not strictly user defined (eg. transaction dates, amounts) should go on the base table. These can also be indexed.
Resolve the most selective predicates in subqueries first and join against the sub-query.
If you have the option flatten out the physical structure, even if it has to be into a series of fields like 'StringField1', 'StringField2', 'DateField1', 'MoneyField1'. Your application will have to hold the configration data anyway, so you can just map to a field on the database.

This sounds clumsy, but it's much more efficient to query, and it doesn't have to be in the users' faces. You can also use the metadata to generate a view over the table that names the fields, so other customers of the data can see a meaningful view of the data and query it without necessarily having access to the application metadata.

If you run out of columns in the table, just add more in the database. The field mapping system can always query the system data dictionary to see what's available.

At some point in these debates, somebody often chips in with a comment about XML columns ("I know, I'll just use XML"). To paraphrase JWZ, now you have two problems. Around 1998 they installed another circle in hell just for people who defile their systems with XML blob fields.

Although it doesn't look slick and high-tech, just having a set of user attribute columns on the table is by far the best solution. It is the most efficient, and there are many ways to make this approach more user friendly at the database level.

EDIT: Here is a snippet of T-SQL showing how to create a view that flattens out an EAV structure. Note that you wound have to write a generator for the view based on the application's attribute metadata and re-create it every time this is changed.

if object_id ('dbo.Foo') is not null
    drop table dbo.Foo
go

if object_id ('dbo.FooView') is not null
    drop view dbo.FooView
go 

create table dbo.Foo (
       FooID     int not null
      ,FooKey    varchar (20)
      ,FooVal    varchar (100)
)

insert Foo (FooID, FooKey, FooVal)
values (1, 'Date', '2011-11-11')

insert Foo (FooID, FooKey, FooVal)
values (1, 'Amount', '100')

insert Foo (FooID, FooKey, FooVal)
values (2, 'Date', '2012-12-12')

insert Foo (FooID, FooKey, FooVal)
values (2, 'Amount', '200')
go

create view dbo.FooView
    as
select *
  from (select FooID
              ,FooKey
              ,FooVal
          from dbo.Foo) keys
 pivot (Max (FooVal) 
        for FooKey in ([Date],[Amount])) pvt
go

select * from dbo.FooView

This is much more efficient than solutions involving multiple self-joins on the table, and will be somewhat efficient with a clustered index on the base entity and attribute type keys. You will probably also want to put type conversions into the view so you can sort the results correctly.

Sql-server – SQL Server 2012 – Frequent SELECT and UPDATE on bit column. Index doesn’t help

An index on the bit column isn't going to help at all because of selectivity. You should consider a filtered index:

CREATE INDEX unindexed ON dbo.Documents(documentId) WHERE IsIndexed = 0;

You should also include an ORDER BY documentId in your SELECT.

Though you should test this in a staging environment because it may help your SELECT but may offset that with a more expensive UPDATE.

Best Answer

Related Solutions

Sql-server – Most performant SQL query needed

Sql-server – SQL Server 2012 – Frequent SELECT and UPDATE on bit column. Index doesn’t help

Related Question