Mysql – Quickly updating table from large tsv, retaining a timestamp

importMySQL

Consider this tsv file with millions of rows, where uid is a unique identifier of any possible value.

  1  audi    a3      blue
 42  ferrari f60     red
 23  porsche 911     green

It should be used to initially create a MySQL table, and then update it periodically:

imported | uid | type   | model | color
---------+-----+--------+-------+------
monday       1   audi     a3      blue
monday      42   ferrari  f40     red
monday      23   porsche  911     green

Now, observe the following requirements:

If uid exists, the row should be updated, except imported, which should be retained.
If uid doesn't exist, new values should be inserted, and imported set to current timestamp.

Thus, the table might look like this later:

imported | uid | type   | model | color
---------+-----+--------+-------+-------
monday       1   audi     a3      blue
monday      42   ferrari  f60     white
monday      23   porsche  911     purple
thursday     6   bugatti  veyron  yellow

In other words, the procedure should update the table, but keep a distinction of when a row was imported for the first time. Doing this programatically is easy, but will be slow.

So how can I do this using only mysqlimport, load data infile or a similar very fast method? Using --delete and --ignore do not seem to offer a way to achieve the desired effect.

Best Answer

It's 4 statements:

CREATE TABLE Staging ...;  (and a DROP or TRUNCATE)
LOAD DATA ... INTO Staging ...;
INSERT INTO Real (...)  SELECT ... FROM Staging LEFT JOIN Real ...;
UPDATE Real JOIN Staging ON ... SET ...;

The LOAD is very fast for getting the TSV/CSV into a table. staging is purely for this periodic 'update' process.

The SELECT finds the 'new' rows and feeds them to the INSERT to do the rest add new rows.

The UPDATE is to update all the rows for which you have new data, and to update imported.

If uid is AUTO_INCREMENT, I deliberately avoided INSERT ... ON DUPLICATE KEY UPDATE ... because the latter would 'burn' ids, thereby threatening to eventually overflow the range of uid.

For more discussion, see my blog on ingestion . That section of the blog details how to augment a Normalization table with one INSERT and one UPDATE. Your case is very similar. The rest of the blog focuses on 'continual' ingestion, unlike your 'periodic' ingestion.

If your incremental load is bigger than a million rows, there are other aspects we need to discuss.

Related Solutions

Product Attribute List Design Pattern in MySQL

I personally would use a model similar to the following:

The product table would be pretty basic, your main product details:

create table product
(
  part_number int, (PK)
  name varchar(10),
  price int
);
insert into product values
(1, 'product1', 50),
(2, 'product2', 95.99);

Second the attribute table to store the each of the different attributes.

create table attribute
(
  attributeid int, (PK)
  attribute_name varchar(10),
  attribute_value varchar(50)
);
insert into attribute values
(1, 'color', 'red'),
(2, 'color', 'blue'),
(3, 'material', 'chrome'),
(4, 'material', 'plastic'),
(5, 'color', 'yellow'),
(6, 'size', 'x-large');

Finally create the product_attribute table as the JOIN table between each product and its attributes associated with it.

create table product_attribute
(
  part_number int, (FK)
  attributeid int  (FK) 
);
insert into product_attribute values
(1,  1),
(1,  3),
(2,  6),
(2,  2),
(2,  6);

Depending on how you want to use the data you are looking at two joins:

select *
from product p
left join product_attribute t
  on p.part_number = t.part_number
left join attribute a
  on t.attributeid = a.attributeid;

See SQL Fiddle with Demo. This returns data in the format:

PART_NUMBER | NAME       | PRICE | ATTRIBUTEID | ATTRIBUTE_NAME | ATTRIBUTE_VALUE
___________________________________________________________________________
1           | product1   | 50    | 1           | color          | red
1           | product1   | 50    | 3           | material       | chrome
2           | product2   | 96    | 6           | size           | x-large
2           | product2   | 96    | 2           | color          | blue
2           | product2   | 96    | 6           | size           | x-large

But if you want to return the data in a PIVOT format where you have one row with all of the attributes as columns, you can use CASE statements with an aggregate:

SELECT p.part_number,
  p.name,
  p.price,
  MAX(IF(a.ATTRIBUTE_NAME = 'color', a.ATTRIBUTE_VALUE, null)) as color,
  MAX(IF(a.ATTRIBUTE_NAME = 'material', a.ATTRIBUTE_VALUE, null)) as material,
  MAX(IF(a.ATTRIBUTE_NAME = 'size', a.ATTRIBUTE_VALUE, null)) as size
from product p
left join product_attribute t
  on p.part_number = t.part_number
left join attribute a
  on t.attributeid = a.attributeid
group by p.part_number, p.name, p.price;

See SQL Fiddle with Demo. Data is returned in the format:

PART_NUMBER | NAME       | PRICE | COLOR | MATERIAL | SIZE
_________________________________________________________________
1           | product1   | 50    | red   | chrome   | null
2           | product2   | 96    | blue  | null     | x-large

As you case see the data might be in a better format for you, but if you have an unknown number of attributes, it will easily become untenable due to hard-coding attribute names, so in MySQL you can use prepared statements to create dynamic pivots. Your code would be as follows (See SQL Fiddle With Demo):

SET @sql = NULL;
SELECT
  GROUP_CONCAT(DISTINCT
    CONCAT(
      'MAX(IF(a.attribute_name = ''',
      attribute_name,
      ''', a.attribute_value, NULL)) AS ',
      attribute_name
    )
  ) INTO @sql
FROM attribute;

SET @sql = CONCAT('SELECT p.part_number
                    , p.name
                    , ', @sql, ' 
                   from product p
                   left join product_attribute t
                     on p.part_number = t.part_number
                   left join attribute a
                     on t.attributeid = a.attributeid
                   GROUP BY p.part_number
                    , p.name');

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

This generates the same result as the second version with no need to hard-code anything. While there are many ways to model this I think this database design is the most flexible.

Mysql – Easily, quickly replace MySQL table

The RENAME TABLE syntax allows for atomic, multiple renames. See as follows:

CREATE TABLE myschema.mytable_new LIKE myschema.mytable;
RENAME TABLE myschema.mytable TO myschema.mytable_old, myschema.mytable_new to myschema.mytable;
DROP TABLE myschema.mytable_old;

In the above it is impossible for your code to access the table during the RENAME operation, and this it cannot happen that the code "does not find the table". It just blocks until RENAME operation completes, and the table is always there.

Also worth noting that unless using Percona Server with "fast DROP TABLE patch", DROPping a table might hold locks on your database (namely the table cache) for the duration of the operation. And the operation may take a while since it involves evacuating table pages from InnoDB's cache and deleting the table file(s) from the filesystem.

Waiting for some time (eg a couple hours) after the RENAME operation can do good since your old-table's pages are more likely to be flushed out by that time, reducing the overall time from DROPping the table.

Best Answer

Related Solutions

Product Attribute List Design Pattern in MySQL

Mysql – Easily, quickly replace MySQL table

Related Question