I have inherited this table of POS transaction level data. In my opinion it is terrible. There are 3.5 million rows, which span ~4 years, with a distribution of anywhere between 1 result and 100k results for things like sku
, brand
, nationality
and destination
.
There is no primary key, and I am struggling to figure how I can implement one.
here is the CREATE TABLE
CREATE TABLE `transaction_data` (
`txn_id` varchar(32) DEFAULT NULL,
`nationality_id` int(3) DEFAULT NULL,
`transaction date` int(8) DEFAULT NULL,
`YrQtr` varchar(6) DEFAULT NULL,
`Yr` int(4) DEFAULT NULL,
`mth` int(2) DEFAULT NULL,
`dy` int(2) DEFAULT NULL,
`destination` varchar(80) DEFAULT NULL,
`sku` int(6) DEFAULT NULL,
`sales` decimal(8,2) DEFAULT NULL,
`units` int(8) DEFAULT NULL,
KEY `txnid` (`txn_id`),
KEY `cov` (`sku`,`nationality_id`,`Yr`,`YrQtr`),
KEY `period` (`Yr`,`YrQtr`,`mth`),
KEY `natperiod` (`nationality_id`,`Yr`,`YrQtr`),
KEY `desperiod` (`destination`,`Yr`,`YrQtr`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Here are 4 example rows of data:
txn_id – is a non unique weird concatenate of field values, which represent which basket the sku was purchased in.
In some transactions, the sku is not unique, as POS reductions are recorded by another row with the same txn_id
and sku
but with negative values in sales and units to reflect those.
INSERT INTO `2_sar`.`transaction_data`
(`txn_id`,
`nationality_id`,
`transaction date`,
`YrQtr`,
`Yr`,
`mth`,
`dy`,
`destination`,
`sku`,
`sales`,
`units`)
VALUES
('1_20120113_EZY 1832_3_08:58:00', '143', '20120113', '2012Q1', '2012', '1', '13', 'United Kingdom - Manchstr UK', '30312', '52.00', '1'
), ('1_20120121_OR 367_1_11:21:00', '143', '20120121', '2012Q1', '2012', '1', '21', 'Neth Antills - Curacao', '710028', '3.95', '1'
), ('1_20120203_PC 672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '1840', '27.95', '1'
), ('1_20120203_PC 672_2_13:07:00', '200', '20120203', '2012Q1', '2012', '2', '3', 'Turkey - Istanbul', '2156', '12.95', '1'
);
;
Here is a query which I run on the data.
This query finds the top 10 brands that are purchased with <Brand>
(not including <brand>
) ordered by total number of transactions.
This query took 11.344 seconds to run directly from mysql workbench. The server is a t2.micro AWS RDS Instance, and I usually run the queries from a separate EC2 web server. which I believe is (1GHz, 1GB).
The wildcards in this query represent <select>
filters on the PHP page which the query is run from. They may or may not be selected.
SELECT `products`.`brand`, COUNT(DISTINCT transaction_data.txn_id) AS numOrders
FROM `products`
INNER JOIN `transaction_data` ON `transaction_data`.sku=`products`.sku
INNER JOIN `nationalities` ON `transaction_data`.`nationality_id` =
`nationalities`.`nationality_id`
INNER JOIN
( SELECT DISTINCT `transaction_data`.`txn_id`
FROM `transaction_data`
INNER JOIN `products` USING (sku)
INNER JOIN `nationalities` USING (nationality_id)
WHERE brand = <Brand>
) AS tmp_txns ON tmp_txns.txn_id=`transaction_data`.txn_id
WHERE brand <> <Brand>
AND nationality LIKE '%'
AND Yr = '2014'
AND YrQtr LIKE '%'
GROUP BY brand
ORDER BY numOrders DESC, brand ASC
LIMIT 10
And the explain:
+----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | | | | | 1295294 | Using where; Using temporary; Using filesort | | 1 | PRIMARY | transaction_data | ref | txnid,cov,period,natperiod | txnid | 99 | tmp_txns.txn_id | 1 | Using where | | 1 | PRIMARY | nationalities | eq_ref | PRIMARY,natv | PRIMARY | 4 | 2_sar.transaction_data.nationality_id | 1 | Using where | | 1 | PRIMARY | products | ref | skubrand | skubrand | 5 | 2_sar.transaction_data.sku | 1 | Using where; Using index | | 2 | DERIVED | products | index | skubrand | skubrand | 128 | | 2669 | Using where; Using index; Using temporary | | 2 | DERIVED | transaction_data | ref | txnid,cov,natperiod | cov | 5 | 2_sar.products.sku | 647 | Using index condition | | 2 | DERIVED | nationalities | eq_ref | PRIMARY,natv | PRIMARY | 4 | 2_sar.transaction_data.nationality_id | 1 | Using index | +----+-------------+------------------+--------+----------------------------+----------+---------+---------------------------------------+---------+----------------------------------------------+
Here are create table statements for the reference tables:
CREATE TABLE `nationalities` (
`nationality_id` int(3) NOT NULL,
`nationality` varchar(14) DEFAULT NULL,
PRIMARY KEY (`nationality_id`),
KEY `natv` (`nationality_id`,`nationality`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `products` (
`sku` int(6) DEFAULT NULL,
`category` varchar(40) DEFAULT NULL,
`subcategory` varchar(40) DEFAULT NULL,
`brand` varchar(40) DEFAULT NULL,
`product name` varchar(50) DEFAULT NULL,
`pack size` decimal(8,2) DEFAULT NULL,
`rk_group` varchar(20) DEFAULT NULL,
`rk_category` varchar(30) DEFAULT NULL,
KEY `skubrand` (`sku`,`brand`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
My Thoughts So Far
The txn_id
is almost guaranteed to bring poor performance, not only due to the fact it's non-unique, but also because it's a string rather than an int. But I need some sort of non-unique identifier for the basket.
I think my indexes
are okay. All my queries use them, and obviously some indexes
are specific to certain queries where a single composite index
won't serve the different where
clauses in queries.
EDIT: This question has been elaborated on HERE
Best Answer
PRIMARY KEY
. If you don't have a 'natural' key, add anAUTO_INCREMENT
.INT(3)
andINT(99)
mean the same thing -- a 4-byte binary number. Consider changing toTINYINT UNSIGNED
, which takes 1 byte and stores 0..255. Or SMALLINT, etc. (Smaller -> more cacheable -> less I/O -> faster.)brand
. Either flip(sku, brand)
or addINDEX(brand)
. That will be the biggest performance boost.nationalities
, there is no need for the secondary key. The data and thePRIMARY KEY
are clustered together, effectively making it like that key. If you ever need to look up bynationality
to find the id, you need to addUNIQUE(nationality)
NULLable
-- are they really optional? Make themNOT NULL
where appropriate.transaction_date
is a date; if so, use theDATE
datatype so that functions are easy to apply to it.AND Yr = '2014'
becomesAND transaction_date >= '2014-01-01' AND transaction_date < '2014-01-01' + INTERVAL 1 YEAR
. Do something similar for quarters, etc.:+ INTERVAL 3 MONTH
.YrQtr LIKE '%'
has terrible performance characteristics since YrQtr isINT
. (The above changes obviate the need for fixing it.)AND nationality LIKE '%'
) have the app not include it in the query.Is the transaction data "write-once"? That is, once written, do you delete or update the rows or insert more "old" rows? If not, the building and maintaining Summary tables will further speed up the queries, perhaps 10-fold.
Make those changes, then come back for more advice. (Start a new Question; this one would become too messy to follow with new schema, queries, etc.)