SQL Server – Using Dual Databases for Performance

sql serversql-server-2008-r2

We have a SQL database that stores application usage logs for about 3000 PCs. These PCs send their usage data to the SQL server around 10-20 times per day. We used to store only the most recent 60 days of application usage, but the customer asked us to no longer purge data. Now that we have about a year's worth of data (about 6,000,000 rows), the SQL database is suffering from some performance issues. Not significant, mind you, but far more than any other database we have. There are a significant number of records added each hour (application open records), and within a few hours at most that record will be updated just once with the associated application close. It is these updates that you can see via SQL Activity Monitor that are taking considerable time to complete.

That UPDATE query is simple:

SELECT TOP 1 f_ID 
from tb_applicationusage 
WHERE f_application = 'xxxxxxx' AND 
      f_computername = 'xxxxxxxxx' AND 
      f_endtime IS NULL 
ORDER BY f_starttime DESC

Effectively, it finds the most recent matching application start for a specific machine that doesn't yet have an associated application close. I can't think of a more efficient way to run the query, so I'm considering the following alternative:

Move to two databases:

  1. Working database with only the most recent 24 hours worth of records
  2. Final database with all other records

I'm no SQL guru, so I'm probably missing some drawbacks of this method. The goal would be to just have a SQL Agent job move the completed records over to the final database every night. Then, when the customer wants to run their monthly reports, I can just have that report query only the final database and not the working database. With only maybe 10,000 records to query in the working database instead of 6,000,000 it would seem logical that it would work faster. But again, it seems so simple I'm probably missing something obvious.

Version: Microsoft SQL Server 2008 R2

Best Answer

You can do better than two databases. There are two things you should look at in your existing database before sharding off part of the old data:

  1. Choose a good clustered index. There are three rules you should follow for the clustered index to work well with this data:

    1. It should use an increasing value, so that new records will always belong at the end of the table in cluster order, or at least in the last page. This is especially important when you have lots of inserts, as in this case. Something like an identity/autoincrement field, but you'll see in a moment why we can do better.
    2. It should uniquely or nearly-uniquely identify the record, so the updates for the application close records will be fast.
    3. You should be able to know the clustered index based on the application close records coming into the DB (this rules out the identity column from earlier).
    4. You don't want anything changed by the application close records to be part of the index, as that could force the database to need to move the record to a new location on the disk when you have updates.

    If there is an increasing timestamp (ie: f_starttime), that may good for the first field in the index, as long as it's also part of the close record as indicated in requirement #3. Add any other fields that you'll need to uniquely or nearly-uniquely identify a record. Note that you can still use an identity column for the table. Just don't use it as the first column in the clustered index. Based the sql code in the question, I might go with f_starttime, f_computername, f_application, f_ID.

    Even if you go with the staging table suggested in the other answer, these index changes may still be a good idea.

  2. Table partitioning. Table partitioning helps the db server keep only the recent records in memory, so that older data from the same table can remain on disk. Sql Server 2016 will even let you push the historical data to cloud storage on Azure via Stretch Database.

The other suggestion to keep completed records separate from open records is good, too. Even with that suggest, though, indexing and table partitioning can help as the size of the table for completed records becomes large. You can start looking at sharding old data to a separate (linked) db only after all of these options fail.

Really, though, Sql Server is easily able to handle six million-ish records without resorting to these kinds of tricks (changing the index may still be worth doing, though). Are you sure the server is correctly provisioned for this? You might do just as well simply adding RAM to the server.

Finally, separating a reporting database from the live processing database is also common, and not at all a bad thing to do. We sometimes call this a "Data Warehouse", though that also often involves schema changes and an SSIS process to move the data. This is a good feature to have, because it prevents an accidental mistake in a data analysis query from causing performance problems in production. You can best accomplish this via database mirroring/log shipping to a read-only slave, or more recently via an AlwaysOn Availability Group.