SQL Server – Retain Column Statistics on Paragraph-Style Columns

sql serverstatistics

Specific to Microsoft SQL Server…

If a query has a search predicate against a paragraph-style text column, it will create a system-created statistic to answer the query.

Got it.

My question is if those should be removed later as they can't be of very much help since they are based purely on the first few characters of the text string and not actually useful in the search which will be a scan against the table no matter what. Later "grooming" operations via auto-stats update, or a self-developed-stats-grooming process will encounter these and may try and update them.
To complicate it, this column is typically "wide" and can often be "deep" (say, a bajillion rows) and cause a large IO hit to accomplish even a small sample scan to update the stat.

My thinking leads me to say they should be removed. What do others say?

I am not asking if queries against these types of data are advisable. They happen. I am just doubting the usefulness of these auto-generated stats and their justification to exist unless you happen to have an text column with the first few characters actually useful in finding the rows you need – enough so that you'd consider create an index on that column – and I expect that is unusual.

What say you? Keep these albatrosses or get rid of them?

Best Answer

First of all, and perhaps most importantly, albatrosses are not unlucky. They are in fact considered a good omen. In the poem, the young sailor shoots it with his crossbow, and it is this reckless act that is considered unlucky.

Second of all, SQL Server uses a technique called trie trees, for string statistics. These are not well documented but get a brief mention in 'Microsoft SQL Server 2012 Internals' in the 'String Statistics' section. You cannot delete trie trees or control them.

If you delete the ordinary column statistics associated with a column they will be get recreated next time someone queries that column, unless you explicitly disable them which is inadvisable. In the same way, the sailor is forced to wear the dead albatross around his neck as punishment : )

I think Sean's suggestion of using Full Text Indexing is a good one. In summary do not shoot albatrosses, do not disable auto-stats.

Let's do a simple walkthru using 'The Rime of the Ancient Mariner' as an example. I downloaded the text from Project Gutenberg available here, cleaned it up a bit and loaded it up in to a SQL Server table. For a simple Like query with wildcards, I get a full clustered index scan with an estimated rowcount of 1.55 and actual rowcount of 7:

Looking at the statistics for the column, we can see a lot of stats that probably wouldn't make that much sense for other queries, which I think was your point:

Despite this, the estimated versus actual rowcount (1.55 versus 7) is pretty reasonable and is in the same order of magnitude, but we will get effectively a table scan because of the wildcards.

Switching on full-text indexing, we can get more accurate rowcounts and start to interrogate the full-text index for other information it stores about the documents, eg

SELECT * 
FROM dbo.documentText
WHERE CONTAINS ( documentText, 'albatross' )
GO

SELECT *
FROM sys.dm_fts_index_keywords( DB_ID(), OBJECT_ID('dbo.documentText') )


SELECT *
FROM sys.dm_fts_index_keywords_by_document( DB_ID(), OBJECT_ID('dbo.documentText') )


SELECT TOP 10 display_term, COUNT(*)
FROM sys.dm_fts_index_keywords_by_document( DB_ID(), OBJECT_ID('dbo.documentText') )
GROUP BY display_term
ORDER BY 2 DESC

We now get index seeks and potentially this could make a large performance difference on a bigger table.

Full walkthru script available here.

PS No albatrosses were harmed in the making of this post.

Related Solutions

Sql-server – Auto Update Statistics in SQL Server 2008R2: Why do some statistics remain stale despite a large number of row inserts

Some info, if not a definitive answer

It's been blogged recently

When are Statistics Updated – What triggers Statistics to Update
Comment from Paul Randal

There is a whitepaper too. See the section "Maintaining Statistics in SQL Server 2008" where there are some conditions that sound like affect you. Example:

One limitation of the automatic update logic is that it tracks changes to columns in the statistics, but not changes to columns in the predicate. If there are many changes to the columns used in predicates of filtered statistics, consider using manual updates to keep up with the changes.

At the end there are some settings to check too: what if OFF at the DB level which overrides an ON at the index/stat level?

HTH...

SQL Server – Identifying Resampled Partitions in Incremental Statistics

My first question would be to ask why you're actually using incremental in the first place. Here's an answer that I posted regarding incremental statistics, a blog post by Erin Stellato that illuminates one of the primary complaints and pitfalls with incremental statistics (they're not used at the partition level by the optimizer), and two blog posts by me that work through evaluating any potential use case for incremental statistics.

Having said that, to know when a partition's stats have been sampled you can use an undocumented DMF (sys.dm_db_stats_properties_internal()) to get partition-level information. I have a comment on this blog post that describes how to understand the hierarchy at a fairly high level.

select 
    sysdatetime(),                          
    schema_name = sh.name,
    table_name = t.name,
    stat_name = s.name,
    index_name = i.name,
    leading_column = index_col(quotename(sh.name)+'.'+quotename(t.name),s.stats_id,1),
    s.stats_id,
    parition_number = isnull(sp.partition_number,1),
    s.has_filter,                       
    s.is_incremental,
    s.auto_created,
    sp.last_updated,    
    sp.rows,
    sp.rows_sampled,                        
    sp.unfiltered_rows,
    modification_counter = coalesce(sp.modification_counter, n1.modification_counter) 
from sys.stats s 
join sys.tables t 
    on s.object_id = t.object_id
join sys.schemas sh
    on t.schema_id = sh.schema_id
left join sys.indexes i 
    on s.object_id = i.object_id
        and s.name = i.name
cross apply sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) sp
outer apply sys.dm_db_stats_properties_internal(s.object_id, s.stats_id) n1
where n1.node_id = 1
    and (
            (is_incremental = 0)
               or
            (is_incremental = 1 and sp.partition_number is not null)
         )
    and t.name = '<<TABLENAME>>'
    and s.name like '<<STATNAME>>%'
order by s.stats_id,isnull(sp.partition_number,1);

Best Answer

Related Solutions

Sql-server – Auto Update Statistics in SQL Server 2008R2: Why do some statistics remain stale despite a large number of row inserts

SQL Server – Identifying Resampled Partitions in Incremental Statistics

Related Question