Regarding:
I tested doing an “add new column, set new=old, drop old, rename old to new” for both of the main two fields I want to convert from nvarchar(max) to varchar(max), and it took 81 minutes on our test server ... before running out of disk space ... it was too slow.
and
Another technique I plan to try is to create the new table def in a new schema, select into it from the old table, then move tables amongst schema and delete the old table.
Generally speaking, making a copy of the table with the ideal schema is my preferred approach. But, if you only now have maybe enough space to convert the two columns, are you sure you have enough space to make a copy of the entire table?
Also, the new table would just need to have a different name. It wouldn't need to be in a different schema.
Since you are on Enterprise Edition, have you considered or even looked at enabling Data Compression? Not only would it have the effect you are looking for on the NCHAR
/ NVARCHAR
fields, but it would also save space on other fields with other types as well.
There are two types of Compression: Row and Page. You should read up on them and run the stored procedure that estimates what your savings would be.
Enabling compression can be done as an ONLINE
operation, but might require a bit of disk space. If you do not have the available space for this, then you might could consider a hybrid approach where you build a copy of the table as TableNEW
, and have the clustered index already created, and created with Compression enabled. Then you should be able to slowly fill TableNEW
and the data will compress as it goes in. Of course, you will want to use INSERT INTO...SELECT to do it in batches. And the full benefit of the compression might not be realized until after you drop the original table and do a full rebuild on the Clutered Index of TableNEW
.
And keep in mind that there are scenarios where you might not save that much space, or saving the space isn't worth the increase in CPU activity. But, that all depends on a lot of factors so it is really something that should be tested on your system.
You could always take the approach of:
- Enable compression, either directly to the current table as ONLINE (if there is enough space to support it), or into a separate table that has Compression enabled.
- IF you find that CPU actually has increased beyond the benefit of the space savings, then you have the option of building the table again with regular VARCHAR fields and no Compression. Because compression was already enabled, you should definitely have room for this now.
But again, like anything we do, it should be tested. I have heard for years how horrible "XML parsing" is on CPU, and how bad Compression is supposed to be, but in practice, those concerns are quite often overstated. The only way to know is to test on your system. (ps, just in case it is not clear being a text-only medium, these final statements are not attacking what @Kin said in his answer about needing to be cautious of CPU activity increase. He is correct, at least to a degree. I am just reminding everyone to be sure to put everything into perspective of current hardware and software and system setup).
CHAR(...) CHARACTER SET utf8
always takes 3 bytes per character -- CHAR(100)
occupies 300 bytes (no length needed).
VARCHAR
occupies 1-2 bytes for a length, plus only as many bytes as needed. So VARCHAR(100)
with hello
will occupy 7 (2+5) bytes in any character set.
Señor
, in CHARACTER SET latin1
, take 5 bytes (plus length). In utf8, it takes 6 bytes (plus length). This is because ñ
is the 1-byte hex F1
in latin1 or the 2-byte C3B1
for utf8.
Some Chinese characters and some Emoji, need 4 bytes, so utf8mb4
is a better choice for them.
There is very little use for ever using CHAR
instead of VARCHAR
. They are usually for consistent-length fields such as postal code, country code, md5, uuid, etc. And all of those may as well be CHARACTER SET ascii
or possibly latin1
, but not utf8
.
Back to your question... With CHARACTER SET utf8
(or utf8mb4):
- English letters take 1 byte.
- Western European accented letters take 2 bytes.
- Greek, Hebrew, and other languages in that general area, take 2 bytes.
- Japanese and Korean take 3 bytes.
- Most Chinese characters are 3 bytes; some are 4 and need utf8mb4.
- A longer list
"mysql will physically store characters using 1 byte for latin and 3 bytes for CJK" is not phrased correctly. Change 'latin' to 'English' to make it correct. ('Latin' tends to include various accented characters.)
Best Answer
Reduction in size is only possible if most of the characters are essentially
[space]
,0 - 9
,A - Z
,a - z
, and some basic punctuation. Outside of that specific set of characters (in practical usage terms, standard ASCII values 32 - 126), you will be at best equal in size toNVARCHAR
/ UTF-16, or in many cases larger.Be careful. UTF-8 is not a magic "fix everything" switch. All other things being equal, yes, reading less does improve performance. But here "all other things" are not equal. Even when storing only standard ASCII characters (meaning: all characters are 1 byte, hence requiring half the space as compared to storing in
NVARCHAR
), there is a slight performance penalty for using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand,NVARCHAR
/ UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.In my testing, even with only standard ASCII characters, storing the data as UTF-8 provided no savings of elapsed time, but was definitely worse for CPU time. And that was without Data Compression, so at least there was less disk space used. But, when using compression, the space required for UTF-8 was only 1% - 1.5% smaller. So effectively no space savings yet higher CPU time for UTF-8.
Things get more complicated when using
NVARCHAR(MAX)
since Unicode Compression does not work with that datatype, even if the value is small enough to be stored in row. But, if the data is small enough, it should still benefit from Row or Page Compression (in which case it actually becomes faster than UTF-8). However, off-row data cannot use any compression. Still, making the table a Clustered Columnstore Index does greatly reduce the size ofNVARCHAR(MAX)
(even if it is still slightly larger than UTF-8 when using Clustered Columnstore Index).Definitely. In fact, I don't really find a compelling reason to use it in most cases. The only scenario that truly benefits from UTF-8 is:
VARCHAR
)My testing shows that in nearly all cases, NVARCHAR was faster, especially when there was more data. In fact, 21k rows with an average of 5k characters per row required 165 MB for UTF-8 and 236 MB for
NVARCHAR
uncompressed. And yet theNVARCHAR
was 2x faster in elapsed time, and at least 2x faster (sometimes more) in CPU time. Still, it did take up 71 MB more on disk.Outside of that, I still wouldn't recommend using UTF-8, at least as of CTP 2, due to a variety of bugs that I have found in this feature.
For a detailed analysis of this new feature, including an explanation of the differences between UTF-16 and UTF-8, and a listing of those bugs, please see my post:
Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?