Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk

javaoracleunicodeutf-8

I have java code trimming a UTF-8 string to the size of my Oracle (11.2.0.4.0) column which ends up throwing an error because java and Oracle see the string as different byte lengths. I've verified my NLS_CHARACTERSET parameter in Oracle is 'UTF8'.

I wrote a test which illustrates my issue below using the unicode chipmunk emoji (?️)

public void test() throws UnsupportedEncodingException, SQLException {
    String squirrel = "\uD83D\uDC3F\uFE0F";
    int squirrelByteLength = squirrel.getBytes("UTF-8").length; //this is 7
    Connection connection = dataSource.getConnection();

    connection.prepareStatement("drop table temp").execute();

    connection.prepareStatement("create table temp (foo varchar2(" + String.valueOf(squirrelByteLength) + "))").execute();

    PreparedStatement statement = connection.prepareStatement("insert into temp (foo) values (?)");
    statement.setString(1, squirrel);
    statement.executeUpdate();
}

This fails on the last line of the test with the following message:

ORA-12899: value too large for column
"MYSCHEMA"."TEMP"."FOO" (actual: 9, maximum: 7)

The setting of NLS_LENGTH_SEMANTICS is BYTE. Unfortunately, I cannot change this as it is a legacy system. I'm not interested in increasing the column size, just reliably being able to predict the Oracle size of a string.

Best Answer

What follows is my speculation.

Java Strings are internally represented using the UTF-16 encoding. When you getBytes("UTF-8") Java converts between the two encodings, and you probably use an up-to-date Java platform.

When you attempt to store a Java String in the database, Oracle also performs conversion between the Java native UTF-16 and the database character set as determined by NLS_CHARACTERSET.

The chipmunk character was approved as part of the Unicode standard in 2014 (according to the page you linked), while the latest release of Oracle 11g rel.2 was published in 2013.

One might assume that Oracle uses a different or outdated character conversion algorithm so the byte representation of ?️) on the server (9 bytes long) is different than what getBytes() returns on the client (7 bytes).

I guess to resolve this issue you could upgrade your Oracle server or use UTF-16 as the database character set.