Sql-server – When `nvarchar/nchar` is going to be used with SQL Server 2019

datatypessql serversql-server-2019unicodeutf-8

With SQL Server 2019 Microsoft introduces UTF-8 support for CHAR and VARCHAR data types and says:

This feature may provide significant storage savings, depending on the
character set in use. For example, changing an existing column data
type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8
enabled collation, translates into nearly 50% reduction in storage
requirements. This reduction is because NCHAR(10) requires 22 bytes
for storage, whereas CHAR(10) requires 12 bytes for the same Unicode
string.

UTF-8 seems to support every script, so basically we can start storing Unicode data in varchar and char columns. And as is said in the documentation, this can reduce the size of tables and indexes, and from there we can get even better performance, because smaller amount of data is read.

I am wondering does this mean we can stop to use nvarchar and nchar columns which implements UTF-16?

Can anyone point a scenario and reason, not to use the char data types with UTF encoding and continue use the n-chars ones?

Best Answer

this can reduce the size of tables and indexes (emphasis added)

Reduction in size is only possible if most of the characters are essentially [space], 0 - 9, A - Z, a - z, and some basic punctuation. Outside of that specific set of characters (in practical usage terms, standard ASCII values 32 - 126), you will be at best equal in size to NVARCHAR / UTF-16, or in many cases larger.

I am planning to migrate the data as I believe reading less data will lead to better performance for the system at all.

Be careful. UTF-8 is not a magic "fix everything" switch. All other things being equal, yes, reading less does improve performance. But here "all other things" are not equal. Even when storing only standard ASCII characters (meaning: all characters are 1 byte, hence requiring half the space as compared to storing in NVARCHAR), there is a slight performance penalty for using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.

In my testing, even with only standard ASCII characters, storing the data as UTF-8 provided no savings of elapsed time, but was definitely worse for CPU time. And that was without Data Compression, so at least there was less disk space used. But, when using compression, the space required for UTF-8 was only 1% - 1.5% smaller. So effectively no space savings yet higher CPU time for UTF-8.

Things get more complicated when using NVARCHAR(MAX) since Unicode Compression does not work with that datatype, even if the value is small enough to be stored in row. But, if the data is small enough, it should still benefit from Row or Page Compression (in which case it actually becomes faster than UTF-8). However, off-row data cannot use any compression. Still, making the table a Clustered Columnstore Index does greatly reduce the size of NVARCHAR(MAX) (even if it is still slightly larger than UTF-8 when using Clustered Columnstore Index).

Can anyone point a scenario and reason, not to use the char data types with UTF encoding

Definitely. In fact, I don't really find a compelling reason to use it in most cases. The only scenario that truly benefits from UTF-8 is:

Data is mostly standard ASCII (values 0 - 127)
It needs to be Unicode because it might need to store a wider range of characters than are available on any single 8-bit Code Page (i.e. VARCHAR)
Most of the data is stored off-row (so Page compression doesn't even work)
You have enough data that you need / want to reduce the size for non-query-performance reasons (e.g. reduce backup size, reduce time required to backup / restore, etc)
You cannot use Clustered Columnstore Index (perhaps the usage of the table makes performance worse in this case?)

My testing shows that in nearly all cases, NVARCHAR was faster, especially when there was more data. In fact, 21k rows with an average of 5k characters per row required 165 MB for UTF-8 and 236 MB for NVARCHAR uncompressed. And yet the NVARCHAR was 2x faster in elapsed time, and at least 2x faster (sometimes more) in CPU time. Still, it did take up 71 MB more on disk.

Outside of that, I still wouldn't recommend using UTF-8, at least as of CTP 2, due to a variety of bugs that I have found in this feature.

For a detailed analysis of this new feature, including an explanation of the differences between UTF-16 and UTF-8, and a listing of those bugs, please see my post:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Best Answer

Related Solutions

Sql-server – How to efficiently downsize some Unicode fields

Mysql – How large space will be occupied by thesql for a varchar utf8 column

Related Question