MySQL – Import geonames allCountries.txt Using LOAD INFILE – ERROR 1300 (HY000)

MySQLmysql-5.7

I'm trying to import the allCountries.txt file from http://download.geonames.org into MySQL 5.7 using LOAD INFILE
but getting the following error:

ERROR 1300 (HY000): Invalid utf8 character string:
''Afikanisitani,'Apekanikana,A Phu Han (Afghanistan),A Phú Hãn '

procedure:
– download & unzip the allCountries.zip from http://download.geonames.org/export/dump/
– create a table in mysql / phpmyadmin / MySQLWorkbench for the import:

CREATE TABLE `geoname` (
 `geonameid` int,
 `name` varchar(200),
 `asciiname` varchar(200),
 `alternatenames` varchar(4000),
 `latitude` float,
 `longitude` float,
 `feature_class` char(1),
 `feature_code` varchar(10),
 `country_code` varchar(2),
 `cc2` varchar(60),
 `admin1_code` varchar(20),
 `admin2_code` varchar(80),
 `admin3_code` varchar(20),
 `admin4_code` varchar(20),
 `population` int,
 `elevation` int,
 `dem` int,
 `timezone` varchar(40),
 `modification_date` date,
 `id` int NOT NULL AUTO_INCREMENT PRIMARY KEY
) CHARACTER SET utf8;

– run the following command to import the .txt file:
LOAD DATA LOCAL INFILE '/location/of/my/allCountries.txt' INTO TABLE geoname;

This procedure worked fine in MySQL 5.6 – but fails with the aforementioned error in MySQL 5.7.13

Best Answer

(Caution: TMI coming...)

There are definitely utf8mb4 characters in that data.

The LOAD DATA needed CHARACTER SET utf8mb4.

The alternatenames column has 46 cases of 4-byte UTF-8 codes. Here's one:

mysql> SELECT * FROM allcountries WHERE geonameid = 6962506\G
*************************** 1. row ***************************
        geonameid: 6962506
             name: Shidanbian
        asciiname: Shidanbian
   alternatenames: Shidanbian,shi dan pian,石旦?
         latitude: 31.08403
        longitude: 107.27187
    feature_class: P
     feature_code: PPL
     country_code: CN
              cc2:
      admin1_code: 32
      admin2_code:
      admin3_code:
      admin4_code:
       population: 0
        elevation: 0
              dem: 378
         timezone: Asia/Shanghai
modification_date: 2014-10-02

I found such rows with WHERE HEX(alternatenames) REGEXP '^(..)*F0'

The last character is HEX F0A18E9A, which needs utf8mb4.

Another example is id 281184, where ???????????? is the "Gothic" writing for Jerusalem.

FYI, here is the table and the LOAD DATA I used:

CREATE TABLE AllCountries (
    geonameid MEDIUMINT UNSIGNED NOT NULL  COMMENT "integer id of record in geonames database",
    name     VARCHAR(200) CHARACTER SET utf8mb4  COMMENT "name of geographical point (utf8) varchar(200)",
    asciiname  VARCHAR(200)    COMMENT "name of geographical point in plain ascii characters, varchar(200)",
    alternatenames TEXT CHARACTER SET utf8mb4   COMMENT "alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)",
    latitude   DECIMAL(7,5)    COMMENT "latitude in decimal degrees (wgs84)",
    longitude  DECIMAL(8,5)    COMMENT "longitude in decimal degrees (wgs84)",
    feature_class CHAR(1)      COMMENT "see http://www.geonames.org/export/codes.html, char(1)",
    feature_code  VARCHAR(10)  COMMENT "see http://www.geonames.org/export/codes.html, varchar(10)",
    country_code  CHAR(2)      COMMENT "ISO-3166 2-letter country code, 2 characters",
    cc2          VARCHAR(200)  COMMENT "alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters",
    admin1_code  VARCHAR(20)   COMMENT "fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)",
    admin2_code  VARCHAR(80)   COMMENT "code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) ",
    admin3_code  VARCHAR(20)   COMMENT "code for third level administrative division, varchar(20)",
    admin4_code  VARCHAR(20)   COMMENT "code for fourth level administrative division, varchar(20)",
    population  DECIMAL(11,0)  COMMENT "bigint (8 byte int) ",
    elevation  SMALLINT UNSIGNED  COMMENT "in meters, integer",
    dem        SMALLINT UNSIGNED  COMMENT "digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.",
    timezone   VARCHAR(40)     COMMENT "the timezone id (see file timeZone.txt) varchar(40)",
    modification_date DATE     COMMENT "date of last modification in yyyy-MM-dd format",
    PRIMARY KEY(geonameid)
) ENGINE=InnoDB DEFAULT CHARACTER SET ascii COMMENT 'http://download.geonames.org/export/dump/';

LOAD DATA INFILE 'C:/htdocs/misc/allcountries.csv'
    INTO TABLE allcountries
    CHARACTER SET utf8mb4
    FIELDS TERMINATED BY "\t"
    LINES TERMINATED BY '\n';

Note the careful picking of ascii vs utf8mb4 for charset.

A SHOW WARNINGS indicated that that is still not 'correct':

Incorrect integer value: '' for column 'elevation' at row ...

That can be solved via use of an @variable, and a SET in the LOAD DATA.

You were stopped at least by Afghanistan in Gothic: ??????????. That's almost readable!

(That was fun.)