SQL Server – Retain Column Statistics on Paragraph-Style Columns

sql serverstatistics

Specific to Microsoft SQL Server…

If a query has a search predicate against a paragraph-style text column, it will create a system-created statistic to answer the query.

Got it.

My question is if those should be removed later as they can't be of very much help since they are based purely on the first few characters of the text string and not actually useful in the search which will be a scan against the table no matter what. Later "grooming" operations via auto-stats update, or a self-developed-stats-grooming process will encounter these and may try and update them.
To complicate it, this column is typically "wide" and can often be "deep" (say, a bajillion rows) and cause a large IO hit to accomplish even a small sample scan to update the stat.

My thinking leads me to say they should be removed. What do others say?

I am not asking if queries against these types of data are advisable. They happen. I am just doubting the usefulness of these auto-generated stats and their justification to exist unless you happen to have an text column with the first few characters actually useful in finding the rows you need – enough so that you'd consider create an index on that column – and I expect that is unusual.

What say you? Keep these albatrosses or get rid of them?

Best Answer

First of all, and perhaps most importantly, albatrosses are not unlucky. They are in fact considered a good omen. In the poem, the young sailor shoots it with his crossbow, and it is this reckless act that is considered unlucky.

Second of all, SQL Server uses a technique called trie trees, for string statistics. These are not well documented but get a brief mention in 'Microsoft SQL Server 2012 Internals' in the 'String Statistics' section. You cannot delete trie trees or control them.

If you delete the ordinary column statistics associated with a column they will be get recreated next time someone queries that column, unless you explicitly disable them which is inadvisable. In the same way, the sailor is forced to wear the dead albatross around his neck as punishment : )

I think Sean's suggestion of using Full Text Indexing is a good one. In summary do not shoot albatrosses, do not disable auto-stats.

Let's do a simple walkthru using 'The Rime of the Ancient Mariner' as an example. I downloaded the text from Project Gutenberg available here, cleaned it up a bit and loaded it up in to a SQL Server table. For a simple Like query with wildcards, I get a full clustered index scan with an estimated rowcount of 1.55 and actual rowcount of 7:

Simple Like query

Looking at the statistics for the column, we can see a lot of stats that probably wouldn't make that much sense for other queries, which I think was your point:

DBCC SHOW_STATISTICS results

Despite this, the estimated versus actual rowcount (1.55 versus 7) is pretty reasonable and is in the same order of magnitude, but we will get effectively a table scan because of the wildcards.

Switching on full-text indexing, we can get more accurate rowcounts and start to interrogate the full-text index for other information it stores about the documents, eg

SELECT * 
FROM dbo.documentText
WHERE CONTAINS ( documentText, 'albatross' )
GO

SELECT *
FROM sys.dm_fts_index_keywords( DB_ID(), OBJECT_ID('dbo.documentText') )


SELECT *
FROM sys.dm_fts_index_keywords_by_document( DB_ID(), OBJECT_ID('dbo.documentText') )


SELECT TOP 10 display_term, COUNT(*)
FROM sys.dm_fts_index_keywords_by_document( DB_ID(), OBJECT_ID('dbo.documentText') )
GROUP BY display_term
ORDER BY 2 DESC

We now get index seeks and potentially this could make a large performance difference on a bigger table.

Full walkthru script available here.

PS No albatrosses were harmed in the making of this post.