Mysql – Read massive flat file into relational database

MySQLpython

I have a massive flat text file with ~4M records that I would like to read into MySQL using Ruby/Rails, Python, or any other method.

The flat file is roughly this format:
business_name,address,city,state,zip,employee_name

If there are multiple employees at the business, there will be one row for each employee with the business name, address, etc. repeated.

I'd like to read this in to two tables structured with the employees table having a foreign key like so:

businesses (id int, name varchar, address varchar, city varchar, etc.)
employees (id int, name varchar, business_id int)

How do I go about efficiently reading this file in so that the employees/businesses are created and referenced as needed?

Note: it is NOT safe to assume that the file is sorted.

I've tried reading the file into SQL first and then using SQL to get a unique list of business_name/address/city/state combinations that I could then insert into the businesses table, but my server runs out of memory. And this is only with 1/4th of the records loaded.

Best Answer

Try using LOAD DATA INFILE into an ISAM table with few or no indexes. ISAM should reduce memory consumption and get all data in there the fastest possible way. I have never run out of memory when doing this with larger data sets than yours, on pretty old servers with alot less memory than is used today.

Once the data is imported, use ALTER TABLE to add required indexes, in essence everything needed to uniquely identify a business (name, address, state etc). This may take some time.

Create another ISAM table having an auto increment primary key as well as all columns from the first table which are required to uniquely identify a business (e.g the ones that were indexed, but nothing related to employees). Add a UNIQUE key which includes ALL all columns in the table except the primary key.

Use REPLACE INTO to fill the second table with data from the first, and you will in the process generate unique numeric keys for all the businesses. Add the key column to the first table and UPDATE it using data from the second, which then can be dropped afterwards.

From there you should have a table where all businesses have unique, numeric and indexed keys and should easily be able to split the data into the (InnoDb) tables you require.