SQL Server – Predicting Query Performance Over Time

nonclustered-indexperformancequery-performancesql serversql-server-2012

I have a SQL Server 2012 database that I am using to save data from processed files. We read data from a folder, process it with python and save the results into the database.

One of the first things we do in our ETL process is to check if the file has already been processed. We simply do a:

SELECT id FROM table1 WHERE basename = <basename>

If there is a result we skip the file, if there is no result we process the file. Right now this query takes ~250ms with ~5m records. We have a nonclustered index on the basename column already.

We will see about 100-200k records added per month. We get the files in batches. So we might see 2k files, and then 2 hours later another 2k files. Some days we will get 10k files, other days we might only get 4k files.

Keeping all other variables the same is there a rule of thumb for projecting when we might run into performance (queries taking longer than 1s) issues with this query other than inserting 15-20 million records into the table and seeing what happens?

Table DDL:

CREATE TABLE [dbo].[raw_records](
[id] [int] IDENTITY(1,1) NOT NULL,
[basename] [varchar](512) NULL,
[filename] [varchar](1024) NULL,
[file_size] [int] NULL,
[machine] [varchar](10) NULL,
[insert_timestamp] [datetime] NULL,
[raw_xml] [xml] NULL,
[process_status] [varchar](2048) NULL,

PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Index:

CREATE NONCLUSTERED INDEX [basename_index] ON [dbo].[raw_records]
(
    [basename] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

This table was created long before I started working on it, so I'm assuming someone just made filename 1024 maximum length to hold "enough". Certainly changeable.

The files that are produced have a timestamp and unique information within the filename itself (e.g. system1_metadata_timestamp.xml) one "system" couldn't produce (or should never) produce a file with the same timestamp.

select max(len(basename)), max(len(filename)) from dbo.raw_records;

returns: basename – 143, filename – 168. Probably a good thing to change down to a max of 260.

process_status probably doesn't need to be that long either, but I feel that one is reasonable to guess at, because the column is there to hold error messages from the processing phase. I ran a query on it and had a max of 600 characters. We don't typically query on that column though. It is just more informational for debugging.

I'm going through the application cleaning up stuff such as that. In certain places I can't get away from it but in others can't do much about it unfortunately (e.g. need to actually retrieve the XML column to extract data from it). This question simply stemmed from seeing the performance of the query in question and not wanting it to get away from me. It is the first thing that gets done for every file so if that doesn't work, nothing else will either.

Best Answer

With a good index present, the time taken to locate a matching row should scale roughly logarithmically, as long as you have room for the index in memory.

I'd make the index UNIQUE since the basename must be unique otherwise your workflow is invalid, and it makes the index more efficient.

CREATE UNIQUE INDEX IX_raw_records_basename
ON dbo.raw_records (basename);

Check the execution plan for the query to ensure the index is being used.

Make sure you have enough room in memory for the index, and assuming concurrency won't be a massive problem, you should be good for a very large number of rows.

I'd reconsider the length of the basename and filename columns since the query optimizer will use the length when calculating how much memory it will need to allocate to run the query. If, for instance, the basename column will never hold more than 20 characters, but you have it defined as 512 characters, the memory grant for SELECT basename FROM dbo.raw_records; would be 25.6 times larger than actually required. Column lengths are actually much more important than most people realize.

You could also change the query to be SELECT 1 FROM table1 WHERE basename = <basename> that way you wouldn't even need the id since all you're trying to do is verify its existence. Only do what you really need. It looks like the index you show in your question would work fine for that.

In addition, if space is an issue you may want to consider data compression of the indexes and table. This should allow the index to fit in a smaller memory footprint. Evaluate DATA_COMPRESSION = ROW versus DATA_COMPRESSION = PAGE to see which is the best compression method for your requirements.