I have table structure similar to the following –
CREATE TABLE `ProductCatalog` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`SerialNumber` varchar(20) DEFAULT NULL,
`BasePrice` decimal(10,2) DEFAULT NULL,
`BatchCode` tinyint(3) unsigned DEFAULT NULL,
`Type` varchar(5) DEFAULT NULL,
`ItemCode` varchar(5) DEFAULT NULL,
`ArrivalDate` datetime DEFAULT NULL,
`InsertTimestamp` int(10) unsigned NOT NULL,
`BrandID` tinyint(3) unsigned DEFAULT NULL,
`Model` varchar(10) NOT NULL DEFAULT 'RX209',
`Description` text,
PRIMARY KEY (`id`),
KEY 'idx_ic_sn_ad' (`ItemCode`, `SerialNumber`, `ArrivalDate`),
KEY 'idx_sn_ad' (`SerialNumber`, `ArrivalDate`, `ItemCode`, `BasePrice`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
There are about 600 Million rows in this table and is growing fastly. Everyday about 0.5 million records are inserted sometimes more sometimes less. So there is massive write activity going on during certain period of the day (6 am to 8 pm)
Following are the queries I run against this table
SELECT *
FROM ProductCatalog
WHERE SerialNumber='1234567890'
AND ItemCode!="ABCD"
ORDER BY id DESC LIMIT 1;
SELECT BasePrice
FROM ProductCatalog
WHERE SerialNumber='123456789'
AND ItemCode!="ABCD"
and ItemCode!="PQRS"
AND ItemCode!="MNOP"
ORDER BY ID Desc LIMIT 1
SELECT *
FROM ProductCatalog
WHERE SerialNumber='123456789' AND ItemCode='ABCD'
AND (ArrivalDate>='2019-01-01 00:00:00' AND ArrivalDate<='2020-12-31 23:59:59')
ORDER BY ArrivalDate ASC
SELECT BatchCode
FROM ProductCatalog
WHERE SerialNumber='123456789'
AND ItemCode!="ABCD"
and ItemCode!="PQRS"
AND ItemCode!="MNOP"
ORDER BY ID Desc LIMIT 1
Above queries and table indexes are optimized as suggested in my previous question here
MySQL table proper indexes for performance optimization
PROBLEM
Since ProductCatalog
Table is getting bigger, and third query mentioned above is frequently used, I am thinking of partitioning this table. For this I am thinking of range partitioning using ArrivalDate
column. Following is the query I came up with.
ALTER TABLE ProductCatalog PARTITION BY RANGE (TO_DAYS(ArrivalDate)) (
PARTITION p11 VALUES LESS THAN (TO_DAYS('2011-01-01')),
PARTITION p12 VALUES LESS THAN (TO_DAYS('2012-01-01')),
PARTITION p13 VALUES LESS THAN (TO_DAYS('2013-01-01')),
PARTITION p14 VALUES LESS THAN (TO_DAYS('2014-01-01')),
PARTITION p15 VALUES LESS THAN (TO_DAYS('2015-01-01')),
PARTITION p16 VALUES LESS THAN (TO_DAYS('2016-01-01')),
PARTITION p17 VALUES LESS THAN (TO_DAYS('2017-01-01')),
PARTITION p18 VALUES LESS THAN (TO_DAYS('2018-01-01')),
PARTITION p19 VALUES LESS THAN (TO_DAYS('2019-01-01')),
PARTITION p20 VALUES LESS THAN (TO_DAYS('2020-01-01')),
PARTITION p21 VALUES LESS THAN (TO_DAYS('2021-01-01')),
PARTITION p22 VALUES LESS THAN MAXVALUE);
Later every year I will re-organize the partition like this
ALTER TABLE ProductCatalog
REORGANIZE PARTITION p22 INTO (
PARTITION p22 VALUES LESS THAN (TO_DAYS('2022-01-01')),
PARTITION p23 VALUES LESS THAN (TO_DAYS('2023-01-01')),
PARTITION p24 VALUES LESS THAN MAXVALUE
);
QUESTIONS
Every year about 200 Million rows in average are inserted. So taking this into consideration –
- Do this partitioning do any good in terms of performance with my existing table structures and indexes.
- Do this partitioning do any good for selects statements I have specified above?
- Do this partitioning do any good for lots of inserts going every moment?
- I know we can delete particular partition, but is it easily possible to archive particular partition to another archive database before deleting?
- Is there any other better idea, to get maximum possible read and write performance in the database?
- Do I need to worry about column
id int(10)
when data is increasing rapidly? - What this partitioning will impact on other three queries where no
ArrivalDate
is used inWHERE
condition?
Best Answer
So going down your list of questions:
No! If you always query on date, maybe.
Only your third query, and that's a maybe.
It depends on if the secondary indexes are also partitioned. But generally there can be more overhead as you have to query each partition that possibly contains rows. Partitioning also breaks a lot of features in MySQL so it shouldn't be undertaken lightly.
No. You're still appending according to the meaningless row identifier. Everything will be stuck on the last page.
I'm not sure about MySQL - moving/deleting partitions is relatively simple in other databases.
Yes! You either fix the problem now or have to rebuild the table with a larger, equally meaningless row identifier and further your woes.
Yes. It's called define the primary key, and cluster on that.
Right now you have a big unorganized mess of rows. I'd call it a heap, but people reserve that for a certain type of unorganized mess of rows that hides the row pointer from the user instead of declaring it a primary key.
Your rows are located in the table according to their approximate insert order. To locate these rows more efficiently you slap an index on top so you can find where they're hidden, but even with the index you could be reading one page per row, or the entire table, depending on how the data is distributed. See my answer here for a very basic illustration: Save performance with large update on Index with Included Column
So it's not entirely obvious from your question if
(SerialNumber, ArrivalDate, ItemCode)
is unique, but let's assume it is (we can adjust later if an additional column is necessary to define uniqueness). If we define your table like so:Three things happen:
(SerialNumber, ItemCode, ArrivalDate)
.The consequences of #3 is that all of the queries you submitted will traverse one b-tree and at usually no more than 1-2 pages to satisfy the query.
Yes, this can cause page splits (fragmentation), but the impact of page splits is generally less than the cost of additional indexes (which also fragment) and more read I/O. If it gets really bad (which you'll monitor just like anything else), you can rebuild the table (clustered index).
Other general observations:
DATETIME
/TIMESTAMP
. The data type exists for a reason, if you have to convert on insert, that's better than having to convert on read.Brand
,Model
, orDescription
is dependent on theItemCode
, then that data should be stored in its own table. Makes your main table even more compact.SerialNumber
andItemCode
are fixed length, you're probably better off usingCHAR(<the max length>)
versusVARCHAR
.