I have java code trimming a UTF-8 string to the size of my Oracle (11.2.0.4.0) column which ends up throwing an error because java and Oracle see the string as different byte lengths. I've verified my NLS_CHARACTERSET
parameter in Oracle is 'UTF8'.
I wrote a test which illustrates my issue below using the unicode chipmunk emoji (?️)
public void test() throws UnsupportedEncodingException, SQLException {
String squirrel = "\uD83D\uDC3F\uFE0F";
int squirrelByteLength = squirrel.getBytes("UTF-8").length; //this is 7
Connection connection = dataSource.getConnection();
connection.prepareStatement("drop table temp").execute();
connection.prepareStatement("create table temp (foo varchar2(" + String.valueOf(squirrelByteLength) + "))").execute();
PreparedStatement statement = connection.prepareStatement("insert into temp (foo) values (?)");
statement.setString(1, squirrel);
statement.executeUpdate();
}
This fails on the last line of the test with the following message:
ORA-12899: value too large for column
"MYSCHEMA"."TEMP"."FOO" (actual: 9, maximum: 7)
The setting of NLS_LENGTH_SEMANTICS
is BYTE
. Unfortunately, I cannot change this as it is a legacy system. I'm not interested in increasing the column size, just reliably being able to predict the Oracle size of a string.
Best Answer
What follows is my speculation.
Java
String
s are internally represented using the UTF-16 encoding. When yougetBytes("UTF-8")
Java converts between the two encodings, and you probably use an up-to-date Java platform.When you attempt to store a Java
String
in the database, Oracle also performs conversion between the Java native UTF-16 and the database character set as determined byNLS_CHARACTERSET
.The chipmunk character was approved as part of the Unicode standard in 2014 (according to the page you linked), while the latest release of Oracle 11g rel.2 was published in 2013.
One might assume that Oracle uses a different or outdated character conversion algorithm so the byte representation of ?️) on the server (9 bytes long) is different than what
getBytes()
returns on the client (7 bytes).I guess to resolve this issue you could upgrade your Oracle server or use UTF-16 as the database character set.