MySQL Import – Speeding Up 50+GB Fixed Width Text File Import

importmyisamMySQL

I'm using MySQL 5.7.17 to import a ~50GB fixed width text file with over 1000 columns, but after 24 hours it still hasn't finished running. I'm using MyISAM to get around the InnoDB column limit, and also because I thought it would be faster than InnoDB for importing (source here: https://dbahire.com/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/).

Everything I'm importing is from a fixed width text file, and I'm adding all of the text file data to CHAR rows in a ~1200 column table like this:

CREATE TABLE bigFile.BigFileTest (
  `var1` CHAR(3),
  `var2` CHAR(5),
  `var3` CHAR(12),
  ...
) ENGINE=MyISAM CHARSET = latin1;

LOAD DATA LOCAL INFILE
'C:\\path\\to\\file'
INTO TABLE bigFile.BigFileTest
(@row)
SET `var1` = TRIM(SUBSTR(@row,1,3)),
  `var2` = TRIM(SUBSTR(@row,4,8)),
  `var3` = TRIM(SUBSTR(@row,9,20))
  ...
;

I'd definitely prefer to not have a ~1200 column table, of course, but it's a requirement for the data I'm working with and I'm unable to change it.

Is there any way to speed up the import process? I don't have much experience with database administration, so even basic changes you're all familiar with could be things I skipped. I'm using MySQL Workbench with the defaults for everything except I specified ENGINE=MyISAM (as shown above).

Thanks for your help!

Best Answer

MyISAM is probably as fast as InnoDB for this action.

I suggest you minimize the work during the LOAD and plan on doing TRIM, etc, after the data is loaded. That is, load the data into a staging table; then perform various conversions as you copy the data to the real table; then drop the staging table.

This 'staging' can also give you an easy way of splitting up the columns into tables. Or consolidating the columns into a single JSON string, or something else.

If you provide more details on the data and the transformations and the 1000+, I can be more specific.

More

Oh! Sounds like Punched cards. Or mag tape.. I would

  • Read the text into a single column declared as TEXT.
  • Probably I would also have another column: id INT UNSIGNED AUTO_INCREMENT NOT NULL.
  • Meanwhile, I would generate SUBSTR(t, 12, 23) clauses based on what the split up needs to be.
  • And decide how to build the schema. Are these medical records? Animal characteristics? Quiz results? What? Census data?
  • Are some of the "columns" repeated a lot? For example for census data, I might expect the first two columns to be "state" and have about 51 distinct 2-character values, then the next ~20 columns be the county, with about 3000 distinct values, etc. If so, then that definitely suggests a structure (states contain counties contain cities ...)
  • After deciding on the several tables to spread the data across, build SELECT queries to pull the data out and put them into the other tables.

Continuing with the example:

CREATE TABLE Cities (
    id MEDIUMINT UNSIGNED AUTO_INCREMENT NOT NULL,
    state CHAR(2) NOT NULL,
    county VARCHAR(20) NOT NULL,
    city VARCHAR(20) NOT NULL,
    PRIMARY KEY(id),
    INDEX(state, city)
) ENGINE=InnoDB  CHARACTER SET ascii;

INSERT INTO Cities (state, county, city)
    SELECT DISTINCT
            SUBSTR(t, 1, 2) AS state,
            SUBSTR(t, 3, 20) AS county,
            SUBSTR(t, 23, 20) AS city
        FROM Raw;

After that, it gets trickier, because you want to use cities.id in whatever table comes next, rather than repeating the state+county+city again and again.

Get the idea? Want to share the type of data you have?

Sure it will take hours, probably even days. But the structured result will be better than a nearly useless regurgitation of the raw data.

Also, think about how you want to eventually query (SELECT) the data.