Db2 – VARGRAPHIC vs VARCHAR CODEUNITS32

db2db2-luw

I'm in the process of migrating a number of databases from ISO8859-1 to UTF-8. To fit the old string in new tables I figured there are 3 options:

  1. Multiply the number of bytes with a number the corresponds to the distribution of multibyte characters in the domain, and add check constraints for length validation. I.e CHAR(3) might hypothetically become CHAR(5).
  2. Declare column as CHAR(3 CODEUNITS32). As far as I can tell this will occupy 12 bytes
  3. Declare column as GRAPHIC(3). As far as I can tell this will occupy 6 bytes.

I don't really consider 1 an option, that leaves us with 2 and 3. When would 2 be preferable over 3? I think I've seen some info where some characters may not map correctly from CHAR to GRAPHIC, but this seems to work fine for characters from ISO8859-1. Other than that, am I missing something or would GRAPHIC always be preferably over CHAR CODEUNIT32?

Best Answer

The GRAPHIC data type seems to be a remnant from the times when there was no widespread Unicode support. It was there to facilitate storage of double-byte characters. In essence GRAPHIC(3) is equivalent to CHAR(3 CODEUNITS16).

Apart from the length semantics (1 byte vs. 2 byte characters) (VAR)CHAR and (VAR)GRAPHIC are functionally equivalent. Now that you have an option of specifying the length units along with the data type, I really see no reason to use the GRAPHIC types.

Since a UTF-8 character can occupy anywhere between 1 to 4 bytes, using any fixed-number-of-bytes character assumption would cause either risk of truncation or wasted space. To avoid the latter you might consider using VARCHAR instead of CHAR if your application can tolerate that.

Related Question