It looks like the problem was caused by a data flow task where I had two different data flows running in parallel. The same source tables were involved in both data sources but with different destinations. When I broke the task up into two tasks running one after the other, the truncation problem seems to have gone away.
The reason for the truncation is quite simple. Some characters (accented ones, for example) in the WE8ISO8859P1 character set are stored as a single byte, but in AL32UTF8 they end up being stored as multiple bytes. As a result of conversion, a 4000 character string may end up actually requiring more than 4000 bytes.
By way of example, this query shows you that the Euro symbol (0x80 in WE8ISO8859P1) becomes 2 bytes in AL32UTF8:
SQL> select length(convert(chr(128),'AL32UTF8','WE8ISO8859P1')) from dual
2 /
LENGTH(CONVERT(CHR(128),'AL32UTF8','WE8ISO8859P1'))
---------------------------------------------------
2
SQL>
To list all characters that will be affected by the change, you can use the following query:
with n as
(
select level as c from dual
connect by level <= 255
)
select c as "WE8ISO8859P1 value",
'"'||chr(c)||'"' as Character,
length(convert(chr(c),'AL32UTF8','WE8ISO8859P1')) as "New length"
from n
where length(convert(chr(c),'AL32UTF8','WE8ISO8859P1'))>1;
Unfortunately the maximum length a CHAR
or VARCHAR
string can be in Oracle is 4000 bytes. The only option available to you if characterset conversion pushes you over this limit is to convert the columns to use the CLOB
datatype, but we warned - CLOB
s are difficult to deal with and can present challenges.
Best Answer
Assuming that your database character set is
AL32UTF8
(if there is a difference between the two, this is almost certainly your character set) and that all the data is characters that exist in theUS7ASCII
character set ("english" might constitute more than that depending on the definitions we're using) there won't be any noticable performance differences between avarchar2(n CHAR)
and avarchar2(n BYTE)
.Based on your database character set, Oracle has to do a tiny bit more work reading a string to figure out that every byte represents a single character rather than being part of a multi-byte character. But checking the length of a string in either characters or bytes is a pretty trivial operation. Unless you're doing something like running a TPC benchmark where you're already doing a crazy number of things that you'd never do in reality to get tiny fractions of performance improvement, it's not something worth worrying about.
If you were going to go crazy with a TPC benchmark level of effort to get every last thousandth of a percent, it would probably be slightly more efficient to use a database character set of
US7ASCII
and declare everything using byte semantics given that you only had to store English. Practically, though, if your application is so efficient that this sort of thing matters, though, you've already solved every performance problem so you should be focusing much more heavily on supportability and maintainability which would argue for anAL32UTF8
character set and character length semantics.