MySQL – Creating Partition Expression Based on Two Keys or Indexes

MySQLpartitioning

if i have a table with, lets say 500 million rows, and among others there are two main individual indexes on that table. the table looks like:

CREATE TABLE `t1` (
    id_1 bigint unsigned not null,
    id_2 bigint unsigned not null,
    col3, col4 ... colN ...
    KEY `index1` (`id_1`),
    KEY `index2` (`id_2`),
    ...
) ENGINE=InnoDB;

100% of the queries on the table are using WHERE clause on index1 OR index2.

Taking this into account, and assuming it is now very reasonable time to partition that table, into 100 partitions, I would like to ask your help to understand these issues:

Is it possible to create a partition expression in such a way that after the partitioning, for any query that using one of these indexes:
```
SELECT * FROM `t1` WHERE id_1 = 123;
-- or
SELECT * FROM `t1` WHERE id_2 = 456;
```
the engine will scan one partition only?
issue 1 must happen without changing the indexes. I mean, without making back references in them, because it will be too much RAM consuming. So, this must not happen:
```
KEY `index1` (`id_1`,`id_2`),
KEY `index1` (`id_2`,`id_1`),
```
Am I thinking in a wrong direction? How would you solve this issue with Amazon RDS 7.5 GB RAM instance ?

feel free to edit this question, to make it more possible to be answered, thanks

reference to stackoverflow.com

Best Answer

I had written an interesting alternative of implementing your own manual hash indexes for your table, and then I made maths and realised your constraints:

Having into memory 3 bigints will cost you 500*10^6*(8*8*8)/(1024*1024*1024) = 11.17GB that you do not have. RDS is simply not adequate for you anymore, as it is not flexible enough to try some alternative engines -other than InnoDB (you need an engine that works well with indexes on disk/clustering on several keys/hash indexes)- and probably too costly to handle a large table like that.

You need either a higher-end instance or migrate to EC2 to deploy an alternative engine.

Best recommendation that I could give you for your current constraints (7GB ram, InnoDB):

Use your most frequently accessed keys as your primary key, partition by RANGE on that (lets call it id_1). Do not create any other secondary keys:

CREATE TABLE `t1` (
    id_1 bigint unsigned PRIMARY KEY,
    id_2 bigint unsigned not null,
    col3, col4 ... colN ...
) ENGINE=InnoDB
PARTITION BY RANGE (id_1) (
    PARTITION p0 VALUES LESS THAN (n),
    PARTITION p1 VALUES LESS THAN (m),
    ...
);

Create a separate table with (id_2, id_1):

CREATE TABLE `t1_id2_index` (
    id_2 bigint unsigned PRIMARY KEY,
    id_1 bigint unsigned not null,
) ENGINE=InnoDB
PARTITION BY RANGE (id_2) (
    PARTITION p0 VALUES LESS THAN (n),
    PARTITION p1 VALUES LESS THAN (m),
    ...
);

Obviously you will have to insert on this second table each time you insert on the first one. You may think this is worse, but it will not be that bad as you are getting rid of huge merging processes of the secondary keys and minimising memory usage (which is your goal, afterwards).

This will only access 1 partition on access by id_1 and 2 partitions (one on each separate table) on access through id_2:

SELECT * FROM `t1` WHERE id_1 = 123;
-- or
SELECT STRAIGHT_JOIN t1.* 
FROM `t1_id2_index` 
JOIN `t1`
ON t1_id2_index.id_2 = 456 
   and 
   t1.id_1 = t2.id_1;

If your most frequent accesses are on the latest partitions, you will get the desired improvements -make sure you partition with that in mind. You can check partition pruning by using EXPLAIN PARTITIONS. Of course, if access patterns are completely random, you will not get any advantage. The goal is to maintain everything on disk except for a small set of primary keys for both id_1 and id_2 and selected rows.

You may want to minimise read ahead caching and tune innodb_old_blocks_pct and innodb_old_blocks_time for more effective caching/eviction on the buffer pool. I hope you are also using SSDs.

This is not beautiful, but please refer to my initial suggestion of migrating away from SAAS for custom requirements.

Related Solutions

Mysql – Are two indexes needed

An index can seek by a subset of characters, as long as you're searching from the left. E.g., "Inter%" can seek, "%net" will not.

However, the first character is not necessarily the character under which the article would be sorted. "The Internet" should go under "I", not "T". You probably need two fields, DisplayTitle and SortTitle; a single-character index on the latter may be worthwhile, but most likely a full-length index will be just fine.

Indexes are typically B-trees, and a seek will jump to the right location about equally quickly whether you have 10 or 100 entries per page. Scans are another matter, but I'd start with the simplest solution and add an extra index only if performance proves inadequate in practice.

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Best Answer

Related Solutions

Mysql – Are two indexes needed

MySQL looking up more rows than needed (indexing issue)

Related Question