Mysql – Special Characters in MySQL using UTF8 Unicode Collation and LOAD DATA INFILE

MySQLunicodeutf-8

I have several csv files in am loading into MySQl using Java. In the Description field I have several Special Characters that are causing the load to fail. I am using LOAD DATA INFILE. This is nested in a for each loop which parses an array of filenames / tables and runs through each combination until it is finished with all the files.

Here is my jdbc connection string where I am passing a definitive collation param/value for UTF8 collation

 static String  url = "jdbc:mysql://localhost:3306/stage?verifyServerCertificate=false&characterEncoding=UTF8";

Here is my LOAD DATA string.

final String sql2 = ("LOAD DATA INFILE" + filetoEat  +
       "INTO TABLE staging." +tableName +
       "CHARACTER SET UTF8
       FIELDS TERMINATED BY','  ENCLOSED BY '\"\'
       LINES TERMINATED BY '\n'  IGNORE 1 LINES");

Create Table Statement

CREATE TABLE `zmi064_02` (
  `Material` varchar(11) CHARACTER SET utf8 DEFAULT NULL,
  `Material Description` varchar(60) CHARACTER SET utf8 DEFAULT NULL,
  `MatType` text CHARACTER SET utf8,
  `MatGrp` text CHARACTER SET utf8,
  `ClassType` text CHARACTER SET utf8,
  `Class type Text` text CHARACTER SET utf8,
  `Class` text CHARACTER SET utf8,
  `Class Text` text CHARACTER SET utf8,
  `CharactName` text CHARACTER SET utf8,
  `CharactValue` text CHARACTER SET utf8,
  `Plant` varchar(6) CHARACTER SET utf8 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Show variables

character_set_client        utf8
character_set_connection    utf8
character_set_database      utf8
character_set_filesystem    binary
character_set_results       utf8
character_set_server        utf8
character_set_system        utf8
collation_connection        utf8_general_ci
collation_database          utf8_unicode_ci
collation_server            utf8_general_ci

The code is working fine until it comes across a special character like a degree symbol or micro symbol µ within a Material Description . At that point it throws an Exception

Invalid utf8 character string: 'LUG'

The string LUG is followed by a µ symbol. The DB is set to utf8 – utf8_unicode_ci and the column in question is a VARCHAR(60) that holds material descriptions. I have tried using ESCAPED BY '\' but I can't seem to get it working correctly. I have also tried CHARACTER SET UTF8. I have also tried different collation ie, utf8_general_ci to no avail.

Any insight is greatly appreciated

Best Answer

I figured that I would answer this now that I found the solution. Because I am using Java to run the LOAD DATA INFILE via JDBC the JDBC driver seems to be checking the collation at the DB and not the actual table being loaded as it is parsing the file. So you can't have the DB set to UTF-8 and have a Latin collated table as you would be able to do with an INSERT statement. I had tried to set the Table collation as Latin and even had the field in question Latin, but until I changed the entire DB to Latin it was failing. The CSV files are large so checking every char in question is not easy, but I was catching the Exceptions in Java and was able to determine that the error was generated by the JDBC driver and was complaining that "Character at line xx is not a UTF-8 character" Running in Debug allowed me to see more details.

I then concluded it must not be looking at the Latin collated table it would be filling, but was looking at the DB which was still set to UTF-8. Changing the DB to Latin was all I needed to do.

I want to thank @Rick James for his help, I hope this will help others in the future.

Pat