Sql-server – Using wildcards in a like statement on an unindexed VARCHAR(MAX) column with more than 1 million records

sql server

In order to troubleshoot a problem, I have a one-time question as to whether a specific varchar(max) field contains non-printing ASCII characters (other than white-space). The following is my straightforward idea about how to determine if there are such characters stored in our production database.

SELECT TOP 10 [CaseNoteId]
      ,[CaseId]
      ,[CaseNote]
  FROM [DB].[XY].[ReferralCaseNotes]
  WHERE CaseNote LIKE ('%[' + CHAR(1) + '-' + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + '-' + CHAR(31) + CHAR(127) + ']%')

My hesitancy to actually run this stems from using wildcards in the LIKE pattern, that there are over a million records in the table in question, the lack of a full-text index on this column, and that this will likely be an exhaustive search because we do not believe that any such characters exist.

I am a neophyte. How can I estimate whether running this query will be a significant load on our production system? Also, is there a better way to get at the same information?

Possible Improvements:

I'm not worried about data changing while my query runs. Can I change this query to look at a few rows at a time in a way that is beneficial?
Can I set this query to somehow be a background operation that doesn't get in the way of any other queries?
Can I run it for a limited time and determine what percentage of the table was searched, so that I can estimate the time required for a full search?
Would WITH(READPAST) improve my performance?

Why?

The database in question involves sensitive data, the government, and security folks making rules. Restoring a backup to a different server makes a ton of sense, but would cost the taxpayer several orders of magnitude more than makes any sense.

If the answer is, "Don't worry, you're just doing a SELECT," then I say, "Great!"

Best Answer

If snapshot isolation is enabled you will not have any blocking issues. If not, you should probably run the query under READ COMMITTED or even READ UNCOMMITTED. It is a common myth that a READ COMMITTED scan locks the table.
You can use Resource Governor for this. Or use a MAXDOP 1 hint. Controlling load of bulk operations is very hard with SQL Server. Depending on the situation you might be 100% fine leaving this running all day, or you might induce timeouts in other parts of the workload. It is not unreasonable to run the query for 10s and cancel it. Then determine whether the application workload was impacted or not.
I like to do progress estimation by dividing the table size (in MB) by the observed disk read rate (in MB/sec). This gives an estimation for the total scan time.

Fulltext search cannot help you because it works on a per-word basis. You'd need to plug in a custom stemmer that knows how to split special characters. Unrealistic. Your query is fine.

Related Solutions

Sql-server – What’s the term that describes filtering query that makes use of indexes

Sargable (or sometimes sargeable). It's not really a word, it's made up of Search ARGument, and when a WHERE clause is sargable, that mean's it's possible for it to use an index. It doesn't mean it will use the index, and it doesn't mean it will seek, either. A lot of factors go into the optimizer's choice, and the rules can clearly differ between different platforms, and even different versions and editions of the same platform.

References:

Sql-server – How to optimize searching a column for a sub-string in SQL Server

If you know the exact string, then using a binary collation for your search can help.

WHERE [value] LIKE '%{substring}%' COLLATE Latin1_General_BIN

, because it won't have to do case conversions and the like. This can make it a few times faster, but not lightning fast.

Another option is to consider blowing up your table and indexing that - using an indexed view over a john between a table of numbers and your table, leveraging the substring function.

For example, if you have a table of at least 1000 numbers, you could make a row for each character. Yes, it'll cost space, but it will quickly let you find every "H" in there.

Problem is - it's still not quite what you need, because there will be a lot of "H"s. Better might be to grab three-letter combinations. So if you're looking for 'Hotel', you know that either "Hot", "ote" or "tel" must be in the three-letter combo table. You could make an inline table-valued function to handle this. Naturally, when you search for "Hot", the next block for that obj_id must be like 'el%', and so on, plus you should check that your main table still successfully satisfies the main predicate.

It's an idea... if you can put up with the complexity of working around it like this.

Best Answer

Related Solutions

Sql-server – What’s the term that describes filtering query that makes use of indexes

Sql-server – How to optimize searching a column for a sub-string in SQL Server

Related Question