Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk

javaoracleunicodeutf-8

I have java code trimming a UTF-8 string to the size of my Oracle (11.2.0.4.0) column which ends up throwing an error because java and Oracle see the string as different byte lengths. I've verified my NLS_CHARACTERSET parameter in Oracle is 'UTF8'.

I wrote a test which illustrates my issue below using the unicode chipmunk emoji (?️)

public void test() throws UnsupportedEncodingException, SQLException {
    String squirrel = "\uD83D\uDC3F\uFE0F";
    int squirrelByteLength = squirrel.getBytes("UTF-8").length; //this is 7
    Connection connection = dataSource.getConnection();

    connection.prepareStatement("drop table temp").execute();

    connection.prepareStatement("create table temp (foo varchar2(" + String.valueOf(squirrelByteLength) + "))").execute();

    PreparedStatement statement = connection.prepareStatement("insert into temp (foo) values (?)");
    statement.setString(1, squirrel);
    statement.executeUpdate();
}

This fails on the last line of the test with the following message:

ORA-12899: value too large for column
"MYSCHEMA"."TEMP"."FOO" (actual: 9, maximum: 7)

The setting of NLS_LENGTH_SEMANTICS is BYTE. Unfortunately, I cannot change this as it is a legacy system. I'm not interested in increasing the column size, just reliably being able to predict the Oracle size of a string.

Best Answer

What follows is my speculation.

Java Strings are internally represented using the UTF-16 encoding. When you getBytes("UTF-8") Java converts between the two encodings, and you probably use an up-to-date Java platform.

When you attempt to store a Java String in the database, Oracle also performs conversion between the Java native UTF-16 and the database character set as determined by NLS_CHARACTERSET.

The chipmunk character was approved as part of the Unicode standard in 2014 (according to the page you linked), while the latest release of Oracle 11g rel.2 was published in 2013.

One might assume that Oracle uses a different or outdated character conversion algorithm so the byte representation of ?️) on the server (9 bytes long) is different than what getBytes() returns on the client (7 bytes).

I guess to resolve this issue you could upgrade your Oracle server or use UTF-16 as the database character set.

Related Solutions

Why does an Oracle database require both the SYSTEM and the SYSAUX tablespaces

Tradition and the ability for elegant failure degradation.

As the database has evolved over 11 versions, important views and tables that maintain and support the database have been programmed into the two tablespaces. Furthermore, they represent a fantastic logical separation between the "NEVER EVER TOUCH" and the "third parties can put stuff here."

From the docs:

The SYSAUX tablespace was installed as an auxiliary tablespace to the SYSTEM tablespace when you created your database. Some database components that formerly created and used separate tablespaces now occupy the SYSAUX tablespace.

If the SYSAUX tablespace becomes unavailable, core database functionality will remain operational. The database features that use the SYSAUX tablespace could fail, or function with limited capability.

Thus, while it's critical to have the SYSTEM tablespace never ever fail, one doesn't need to be nearly as paranoid for SYSAUX, this allows the designers to keep the size of SYSTEM down, while allowing third-party auxiliary features to be part of the "core" database.

Does Oracle RAC require that the Ethernet aliases be in a different IP range than the base addresses

From what I've seen RAC requires that you create Ethernet alias interfaces (eth0:0) on each node, with addresses in a different range.

Thats not correct. If eth0 is the interface for the public network, then the VIPs and SCAN VIPs must be on the same subnet, that is exactly what you are asking for. Also its not you, but Oracle Grid Infrastructure that creates and manages these virtual IPs, you just specify the name and address for them.

Furthermore, the private virtual IPs (called HAIP) on the private network (lets say eth1) will be created in the 169.254.0.0/16 subnet, no matter what.

So here is a sample configuration for a 3 node cluster:

Public addresses:

node1, eth0: 192.168.1.1/24
node2, eth0: 192.168.1.2/24
node3, eth0: 192.168.1.3/24

VIPs, with names defined in DNS or hosts file:

node1-vip, eth0:0: 192.168.1.11/24
node2-vip, eth0:0: 192.168.1.12/24
node3-vip, eth0:0: 192.168.1.13/24

Private addresses (note the different subnet), names are not necessary to be defined:

node1-priv, eth1: 192.168.2.1/24
node2-priv, eth1: 192.168.2.2/24
node3-priv, eth1: 192.168.2.3/24

SCAN VIPs (11.2.0.1 or above), with one single name defined in DNS:

192.168.1.21/24
192.168.1.22/24
192.168.1.23/24

Can run on any node as eth0:X. Number of SCAN VIPs is independent of number of nodes.

HAIP (11.2.0.2 or above), no names, the IPs are dynamic, so I have just put some random IPs here to complete the example:

node1: eth1.1: 169.254.190.49/16
node2: eth1.1: 169.254.203.17/16
node3: eth1.1: 169.254.243.66/16

Best Answer

Related Solutions

Why does an Oracle database require both the SYSTEM and the SYSAUX tablespaces

Does Oracle RAC require that the Ethernet aliases be in a different IP range than the base addresses

Related Question