Mysql – How to improve this table, and subsequently the performance of the queries I execute on it

indexMySQLnormalization

I have inherited this table of POS transaction level data. In my opinion it is terrible. There are 3.5 million rows, which span ~4 years, with a distribution of anywhere between 1 result and 100k results for things like sku, brand, nationality and destination.

There is no primary key, and I am struggling to figure how I can implement one.

here is the CREATE TABLE

CREATE TABLE `transaction_data` (
  `txn_id` varchar(32) DEFAULT NULL,
  `nationality_id` int(3) DEFAULT NULL,
  `transaction date` int(8) DEFAULT NULL,
  `YrQtr` varchar(6) DEFAULT NULL,
  `Yr` int(4) DEFAULT NULL,
  `mth` int(2) DEFAULT NULL,
  `dy` int(2) DEFAULT NULL,
  `destination` varchar(80) DEFAULT NULL,
  `sku` int(6) DEFAULT NULL,
  `sales` decimal(8,2) DEFAULT NULL,
  `units` int(8) DEFAULT NULL,
  KEY `txnid` (`txn_id`),
  KEY `cov` (`sku`,`nationality_id`,`Yr`,`YrQtr`),
  KEY `period` (`Yr`,`YrQtr`,`mth`),
  KEY `natperiod` (`nationality_id`,`Yr`,`YrQtr`),
  KEY `desperiod` (`destination`,`Yr`,`YrQtr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Here are 4 example rows of data:

txn_id – is a non unique weird concatenate of field values, which represent which basket the sku was purchased in.

In some transactions, the sku is not unique, as POS reductions are recorded by another row with the same txn_id and sku but with negative values in sales and units to reflect those.

    INSERT INTO `2_sar`.`transaction_data`
(`txn_id`,
`nationality_id`,
`transaction date`,
`YrQtr`,
`Yr`,
`mth`,
`dy`,
`destination`,
`sku`,
`sales`,
`units`)
VALUES
('1_20120113_EZY 1832_3_08:58:00', '143', '20120113', '2012Q1', '2012', '1', '13', 'United Kingdom - Manchstr UK', '30312', '52.00', '1'
), ('1_20120121_OR  367_1_11:21:00', '143', '20120121', '2012Q1', '2012', '1', '21', 'Neth Antills - Curacao', '710028', '3.95', '1'
), ('1_20120203_PC  672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '1840', '27.95', '1'
), ('1_20120203_PC  672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '2156', '12.95', '1'
);
;

Here is a query which I run on the data.

This query finds the top 10 brands that are purchased with <Brand> (not including <brand>) ordered by total number of transactions.

This query took 11.344 seconds to run directly from mysql workbench. The server is a t2.micro AWS RDS Instance, and I usually run the queries from a separate EC2 web server. which I believe is (1GHz, 1GB).

The wildcards in this query represent <select> filters on the PHP page which the query is run from. They may or may not be selected.

SELECT  `products`.`brand`, COUNT(DISTINCT transaction_data.txn_id) AS numOrders
    FROM  `products`
    INNER JOIN  `transaction_data` ON `transaction_data`.sku=`products`.sku
    INNER JOIN  `nationalities` ON `transaction_data`.`nationality_id` =
                                      `nationalities`.`nationality_id`
    INNER JOIN  
      ( SELECT  DISTINCT `transaction_data`.`txn_id`
            FROM  `transaction_data`
            INNER JOIN  `products` USING (sku)
            INNER JOIN  `nationalities` USING (nationality_id)
            WHERE  brand = <Brand>
      ) AS tmp_txns ON tmp_txns.txn_id=`transaction_data`.txn_id
    WHERE  brand <> <Brand>
      AND  nationality LIKE '%'
      AND  Yr = '2014'
      AND  YrQtr LIKE '%'
    GROUP BY  brand
    ORDER BY  numOrders DESC, brand ASC
    LIMIT  10

And the explain:

+----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+
| id | select_type |      table       |  type  |       possible_keys        |   key    | key_len |                  ref                  |  rows   |                    Extra                     |
+----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>       | ALL    |                            |          |         |                                       | 1295294 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY     | transaction_data | ref    | txnid,cov,period,natperiod | txnid    |      99 | tmp_txns.txn_id                       |       1 | Using where                                  |
|  1 | PRIMARY     | nationalities    | eq_ref | PRIMARY,natv               | PRIMARY  |       4 | 2_sar.transaction_data.nationality_id |       1 | Using where                                  |
|  1 | PRIMARY     | products         | ref    | skubrand                   | skubrand |       5 | 2_sar.transaction_data.sku            |       1 | Using where; Using index                     |
|  2 | DERIVED     | products         | index  | skubrand                   | skubrand |     128 |                                       |    2669 | Using where; Using index; Using temporary    |
|  2 | DERIVED     | transaction_data | ref    | txnid,cov,natperiod        | cov      |       5 | 2_sar.products.sku                    |     647 | Using index condition                        |
|  2 | DERIVED     | nationalities    | eq_ref | PRIMARY,natv               | PRIMARY  |       4 | 2_sar.transaction_data.nationality_id |       1 | Using index                                  |
+----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+

Here are create table statements for the reference tables:

CREATE TABLE `nationalities` (
  `nationality_id` int(3) NOT NULL,
  `nationality` varchar(14) DEFAULT NULL,
  PRIMARY KEY (`nationality_id`),
  KEY `natv` (`nationality_id`,`nationality`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `products` (
  `sku` int(6) DEFAULT NULL,
  `category` varchar(40) DEFAULT NULL,
  `subcategory` varchar(40) DEFAULT NULL,
  `brand` varchar(40) DEFAULT NULL,
  `product name` varchar(50) DEFAULT NULL,
  `pack size` decimal(8,2) DEFAULT NULL,
  `rk_group` varchar(20) DEFAULT NULL,
  `rk_category` varchar(30) DEFAULT NULL,
  KEY `skubrand` (`sku`,`brand`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

My Thoughts So Far

The txn_id is almost guaranteed to bring poor performance, not only due to the fact it's non-unique, but also because it's a string rather than an int. But I need some sort of non-unique identifier for the basket.

I think my indexes are okay. All my queries use them, and obviously some indexes are specific to certain queries where a single composite index won't serve the different where clauses in queries.

EDIT: This question has been elaborated on HERE

Best Answer

  • InnoDB tables really need an explicit PRIMARY KEY. If you don't have a 'natural' key, add an AUTO_INCREMENT.
  • INT(3) and INT(99) mean the same thing -- a 4-byte binary number. Consider changing to TINYINT UNSIGNED, which takes 1 byte and stores 0..255. Or SMALLINT, etc. (Smaller -> more cacheable -> less I/O -> faster.)
  • The subquery needs an index starting with brand. Either flip (sku, brand) or add INDEX(brand). That will be the biggest performance boost.
  • In nationalities, there is no need for the secondary key. The data and the PRIMARY KEY are clustered together, effectively making it like that key. If you ever need to look up by nationality to find the id, you need to add UNIQUE(nationality)
  • Every column but one is NULLable -- are they really optional? Make them NOT NULL where appropriate.
  • Splitting a date into its component parts wastes space and usually provides no performance. The parts can be extracted during the queries.
  • I suspect transaction_date is a date; if so, use the DATE datatype so that functions are easy to apply to it.
  • After the above date changes, AND Yr = '2014' becomes AND transaction_date >= '2014-01-01' AND transaction_date < '2014-01-01' + INTERVAL 1 YEAR. Do something similar for quarters, etc.: + INTERVAL 3 MONTH.
  • YrQtr LIKE '%' has terrible performance characteristics since YrQtr is INT. (The above changes obviate the need for fixing it.)
  • When something is not relevant (eg AND nationality LIKE '%') have the app not include it in the query.

Is the transaction data "write-once"? That is, once written, do you delete or update the rows or insert more "old" rows? If not, the building and maintaining Summary tables will further speed up the queries, perhaps 10-fold.

Make those changes, then come back for more advice. (Start a new Question; this one would become too messy to follow with new schema, queries, etc.)