SQL Server 2019 – Benefits of UTF-8 Support

collationencodingsql serverutf-8

I'm already quite comfortable with using COMPRESS() and DECOMPRESS() in an internal forum software for our company (Currently in SQL Server 2017), but trying to make the database as efficient as possible, is there an advantage to adding _UTF-8 to my current collation as in Latin1_General_100_CI_AS_SC_UTF8 upon future migration to SQL Server 2019?

Best Answer

Here's a list of recommended uses taken from here:

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8” encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:

  1. Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page).
  2. Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning, data won’t fit into NVARCHAR(4000)).
  3. There is a lot of data for this column or set of columns (1 GB or more when stored in NVARCHAR).
  4. Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes There is no desire to make the column VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and use DECOMPRESS() for SELECT queries (no need to worry about lack of ability to index the VARBINARY value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of “=”) or manipulated.
  5. The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.