Mysql – Quickly updating table from large tsv, retaining a timestamp

importMySQL

Consider this tsv file with millions of rows, where uid is a unique identifier of any possible value.

  1  audi    a3      blue
 42  ferrari f60     red
 23  porsche 911     green

It should be used to initially create a MySQL table, and then update it periodically:

imported | uid | type   | model | color
---------+-----+--------+-------+------
monday       1   audi     a3      blue
monday      42   ferrari  f40     red
monday      23   porsche  911     green

Now, observe the following requirements:

  1. If uid exists, the row should be updated, except imported, which should be retained.
  2. If uid doesn't exist, new values should be inserted, and imported set to current timestamp.

Thus, the table might look like this later:

imported | uid | type   | model | color
---------+-----+--------+-------+-------
monday       1   audi     a3      blue
monday      42   ferrari  f60     white
monday      23   porsche  911     purple
thursday     6   bugatti  veyron  yellow

In other words, the procedure should update the table, but keep a distinction of when a row was imported for the first time. Doing this programatically is easy, but will be slow.

So how can I do this using only mysqlimport, load data infile or a similar very fast method? Using --delete and --ignore do not seem to offer a way to achieve the desired effect.

Best Answer

It's 4 statements:

CREATE TABLE Staging ...;  (and a DROP or TRUNCATE)
LOAD DATA ... INTO Staging ...;
INSERT INTO Real (...)  SELECT ... FROM Staging LEFT JOIN Real ...;
UPDATE Real JOIN Staging ON ... SET ...;

The LOAD is very fast for getting the TSV/CSV into a table. staging is purely for this periodic 'update' process.

The SELECT finds the 'new' rows and feeds them to the INSERT to do the rest add new rows.

The UPDATE is to update all the rows for which you have new data, and to update imported.

If uid is AUTO_INCREMENT, I deliberately avoided INSERT ... ON DUPLICATE KEY UPDATE ... because the latter would 'burn' ids, thereby threatening to eventually overflow the range of uid.

For more discussion, see my blog on ingestion . That section of the blog details how to augment a Normalization table with one INSERT and one UPDATE. Your case is very similar. The rest of the blog focuses on 'continual' ingestion, unlike your 'periodic' ingestion.

If your incremental load is bigger than a million rows, there are other aspects we need to discuss.