Mysql – How to efficiently partition a large MySQL (Aurora) table with multiple search columns

MySQLpartitioning

I have a large and rapidly growing table. It takes in about 5k rows / second and that will be doubling in about a month. The table is Amazon's Aurora (heavily modified InnoDB.)

As of today, total aggregate rows is around 4B.

Data is mostly searched on in 6 ways:

count(distinct col1,col2) where timestamp range (1 day range)
count(distinct col1,col2) where timestamp range (30 day range)
count(1) where country IN (list,of,countries) and timestamp range (30 day range)
count(1) where foreign_key = int and timestamp range (1 day range)
count(1) where foreign_key = int and timestamp range (30 day range)
select * where timestamp range (1 day range)

The foreign_key, timestamp and country columns are all indexed.

Data is only really "active" for a 90 day periods with most select's happening between -42 and -12 days.

How do I effectively partition this able to speed up those 6 queries?

And, almost as important, how do I accomplish this with minimal downtime?
(assumption here is to create a new table, move all inserts to the new table & copy old data in… like I would do with any new index)

Best Answer

I would consider using a date range as the partitioning scheme, and splitting by week which you can do using a function against the timestamp field. Using days would create too many partitions, and months would not help your queries much especially when the range spans two months.

Using the range partition, you can create mutliple partitions covering your active time period, and out into the future for a year, then nearer the end of that range you can add partitions to the schema as required.
If necessary, you can also drop the old partitions when they are no longer needed.

I can't prepare a sample partition schema for you, as I'm running short of time at the moment, but the MySQL docs have pretty good coverage for this type of partitioning.

Hope that helps,
Dave

Related Solutions

Mysql – effective thesql table/index design for 35 million rows+ table, with 200+ corresponding columns (double), any combination of which may be queried

coincidently I am also looking into one of the client support where we designed key-value pair structure for flexibility and currently table is over 1.5B rows and ETL is way too slow. well there are lot of other things in my case but have you thought about that design. you will have one row with all 200 columns present value, that row will convert in to 200 rows in Key-Value pair design. you will gain space advantage with this design depending on for a given AssetID and Date how many rows has actually all 200 f1 to f200 values present? if you say even 30% od columns have NULL value than that is your space saving. because in key-value pair design if value id NULL that row doesn't need to be in table. but in existing column structure design even NULL takes space.(I am not 100% sure but if you have more that 30 columns NULL in table then NULL take 4bytes). if you see this design and assume that all 35M rows has values in all 200 columns then you current db will become 200*35M=700M rows in table right away. but it will not be much high in table space what you had with all columns in single table as we are just Transposing the Columns in to row. in this transpose operation actually we will not have rows where the values are NULL. so you can actually run query against this table and see how many nulls are there and estimate you target table size before you actually implement it.

second advantage is read performance. as you mentioned that new way of querying the data is any combination this f1 to f200 column in where clause. with key value pair design f1 to f200 are present in one column lets say "FildName" and their values are present in second column lets say "FieldValue". you can have CLUSTERED index on both columns. your query will be UNION of those Selects.

WHERE (FiledName = 'f1' and FieldValue BETWEEN 5 AND 6)

UNION

(FiledName = 'f2' and FieldValue BETWEEN 8 AND 10)

etc.....

I will give you some performance numbers form actual prod server. we have 75 price columns for each security TICKER.

Mysql – Implications of a second index on a timestamp field in a sequential table

Implication is that you need ~1144Mb extra storage to index column time for ~100M rows...

Because the InnoDB engine wiil store the data off an PRIMARY or UNIQUE index within an non PRIMARY or UNIQUE index, as result your secondary index will be become larger

How much larger?? you can calculate it with this formula

int = 4 bytes
datetime = 8 bytes

100000000 records * (4 + 8 bytes) = 
100000000 * 12 bytes ~ 1200000000 bytes ( 1144.40918 Mb ) extra storage (note index records/page overhead are not in the calculation)

An larger index size will slow down inserts, delete and only updates when you update an value whats indexed.. An larger index size in thoery can slow down selects because off the InnoDB index page off 16K (read http://www.ovaistariq.net/733/) But still it depends on innodb configuration and cached data within the innodb buffer pool..

Or maybe you can use your approach by using an lookup table

CREATE TABLE tracker_snapshot_lookup (
    tracker_date DATE NOT NULL
  , tracker_snapshot_start_id INT UNSIGNED NOT NULL 
  , tracker_snapshot_end_id INT UNSIGNED NOT NULL

  , PRIMARY KEY(tracker_date)
  --   Covering index below is overkill... 
  -- , PRIMARY KEY(tracker_date, tracker_snapshot_start_id, tracker_snapshot_end_id)
) ENGINE = InnoDB;


insert into tracker_snapshot_lookup values('2013-11-13', 1, 10000);
insert into tracker_snapshot_lookup values('2013-11-14', 10001, 20000);

If you use an JOIN or deliverd table the MySQL optimzer can use in worse case

1 index key (Random disk I/O) lookup on tracker_snapshot_lookup.date (assuming with WHERE tracker_date = '2013-11-13' )
1 table (Random disk I/O) record key for tracker_snapshot_start_id and tracker_snapshot_end_id (not necessary when you make it an covering index)

Based on tracker_snapshot_start_id and tracker_snapshot_end_id MySQL will most likly choose an range scan (sequential disk I/O what is low costing with I/O waittime) on the tracker_snapshot table.

Your savings

DATE            3 bytes
INT NOT NULL    4 bytes

So in one year you lose on storage...

Table data

356 days * (3 + 4 bytes) 
356 * 12 = 4272 bytes ( 0.004 Mb )

Index data 

356 days * (3 bytes) = 1068 bytes ( 0.001 Mb )

It's magic because you use that ~1143Mb storage space for more important data

Best Answer

Related Solutions

Mysql – effective thesql table/index design for 35 million rows+ table, with 200+ corresponding columns (double), any combination of which may be queried

Mysql – Implications of a second index on a timestamp field in a sequential table

Related Question