I'm in the process of migrating a number of databases from ISO8859-1 to UTF-8. To fit the old string in new tables I figured there are 3 options:
- Multiply the number of bytes with a number the corresponds to the distribution of multibyte characters in the domain, and add check constraints for length validation. I.e CHAR(3) might hypothetically become CHAR(5).
- Declare column as CHAR(3 CODEUNITS32). As far as I can tell this will occupy 12 bytes
- Declare column as GRAPHIC(3). As far as I can tell this will occupy 6 bytes.
I don't really consider 1 an option, that leaves us with 2 and 3. When would 2 be preferable over 3? I think I've seen some info where some characters may not map correctly from CHAR to GRAPHIC, but this seems to work fine for characters from ISO8859-1. Other than that, am I missing something or would GRAPHIC always be preferably over CHAR CODEUNIT32?
Best Answer
The
GRAPHIC
data type seems to be a remnant from the times when there was no widespread Unicode support. It was there to facilitate storage of double-byte characters. In essenceGRAPHIC(3)
is equivalent toCHAR(3 CODEUNITS16)
.Apart from the length semantics (1 byte vs. 2 byte characters)
(VAR)CHAR
and(VAR)GRAPHIC
are functionally equivalent. Now that you have an option of specifying the length units along with the data type, I really see no reason to use theGRAPHIC
types.Since a UTF-8 character can occupy anywhere between 1 to 4 bytes, using any fixed-number-of-bytes character assumption would cause either risk of truncation or wasted space. To avoid the latter you might consider using
VARCHAR
instead ofCHAR
if your application can tolerate that.