We have a SQL database that stores application usage logs for about 3000 PCs. These PCs send their usage data to the SQL server around 10-20 times per day. We used to store only the most recent 60 days of application usage, but the customer asked us to no longer purge data. Now that we have about a year's worth of data (about 6,000,000 rows), the SQL database is suffering from some performance issues. Not significant, mind you, but far more than any other database we have. There are a significant number of records added each hour (application open records), and within a few hours at most that record will be updated just once with the associated application close. It is these updates that you can see via SQL Activity Monitor that are taking considerable time to complete.
That UPDATE query is simple:
SELECT TOP 1 f_ID
from tb_applicationusage
WHERE f_application = 'xxxxxxx' AND
f_computername = 'xxxxxxxxx' AND
f_endtime IS NULL
ORDER BY f_starttime DESC
Effectively, it finds the most recent matching application start for a specific machine that doesn't yet have an associated application close. I can't think of a more efficient way to run the query, so I'm considering the following alternative:
Move to two databases:
- Working database with only the most recent 24 hours worth of records
- Final database with all other records
I'm no SQL guru, so I'm probably missing some drawbacks of this method. The goal would be to just have a SQL Agent job move the completed records over to the final database every night. Then, when the customer wants to run their monthly reports, I can just have that report query only the final database and not the working database. With only maybe 10,000 records to query in the working database instead of 6,000,000 it would seem logical that it would work faster. But again, it seems so simple I'm probably missing something obvious.
Version: Microsoft SQL Server 2008 R2
Best Answer
You can do better than two databases. There are two things you should look at in your existing database before sharding off part of the old data:
Choose a good clustered index. There are three rules you should follow for the clustered index to work well with this data:
If there is an increasing timestamp (ie:
f_starttime
), that may good for the first field in the index, as long as it's also part of the close record as indicated in requirement #3. Add any other fields that you'll need to uniquely or nearly-uniquely identify a record. Note that you can still use an identity column for the table. Just don't use it as the first column in the clustered index. Based the sql code in the question, I might go withf_starttime, f_computername, f_application, f_ID
.Even if you go with the staging table suggested in the other answer, these index changes may still be a good idea.
The other suggestion to keep completed records separate from open records is good, too. Even with that suggest, though, indexing and table partitioning can help as the size of the table for completed records becomes large. You can start looking at sharding old data to a separate (linked) db only after all of these options fail.
Really, though, Sql Server is easily able to handle six million-ish records without resorting to these kinds of tricks (changing the index may still be worth doing, though). Are you sure the server is correctly provisioned for this? You might do just as well simply adding RAM to the server.
Finally, separating a reporting database from the live processing database is also common, and not at all a bad thing to do. We sometimes call this a "Data Warehouse", though that also often involves schema changes and an SSIS process to move the data. This is a good feature to have, because it prevents an accidental mistake in a data analysis query from causing performance problems in production. You can best accomplish this via database mirroring/log shipping to a read-only slave, or more recently via an AlwaysOn Availability Group.