Sql-server – Using wildcards in a like statement on an unindexed VARCHAR(MAX) column with more than 1 million records

sql server

In order to troubleshoot a problem, I have a one-time question as to whether a specific varchar(max) field contains non-printing ASCII characters (other than white-space). The following is my straightforward idea about how to determine if there are such characters stored in our production database.

SELECT TOP 10 [CaseNoteId]
      ,[CaseId]
      ,[CaseNote]
  FROM [DB].[XY].[ReferralCaseNotes]
  WHERE CaseNote LIKE ('%[' + CHAR(1) + '-' + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + '-' + CHAR(31) + CHAR(127) + ']%')

My hesitancy to actually run this stems from using wildcards in the LIKE pattern, that there are over a million records in the table in question, the lack of a full-text index on this column, and that this will likely be an exhaustive search because we do not believe that any such characters exist.

I am a neophyte. How can I estimate whether running this query will be a significant load on our production system? Also, is there a better way to get at the same information?

Possible Improvements:

  1. I'm not worried about data changing while my query runs. Can I change this query to look at a few rows at a time in a way that is beneficial?
  2. Can I set this query to somehow be a background operation that doesn't get in the way of any other queries?
  3. Can I run it for a limited time and determine what percentage of the table was searched, so that I can estimate the time required for a full search?
  4. Would WITH(READPAST) improve my performance?

Why?

The database in question involves sensitive data, the government, and security folks making rules. Restoring a backup to a different server makes a ton of sense, but would cost the taxpayer several orders of magnitude more than makes any sense.

If the answer is, "Don't worry, you're just doing a SELECT," then I say, "Great!"

Best Answer

  1. If snapshot isolation is enabled you will not have any blocking issues. If not, you should probably run the query under READ COMMITTED or even READ UNCOMMITTED. It is a common myth that a READ COMMITTED scan locks the table.
  2. You can use Resource Governor for this. Or use a MAXDOP 1 hint. Controlling load of bulk operations is very hard with SQL Server. Depending on the situation you might be 100% fine leaving this running all day, or you might induce timeouts in other parts of the workload. It is not unreasonable to run the query for 10s and cancel it. Then determine whether the application workload was impacted or not.
  3. I like to do progress estimation by dividing the table size (in MB) by the observed disk read rate (in MB/sec). This gives an estimation for the total scan time.

Fulltext search cannot help you because it works on a per-word basis. You'd need to plug in a custom stemmer that knows how to split special characters. Unrealistic. Your query is fine.