Mysql – InnoDB: secondary key that’s ‘parallel’ to primary–do joins wherein the resulting primary keys are contiguous perform better

indexMySQLprimary-key

Say I have a table

(
person_id INT AUTO_INCREMENT, 
birthday_timestamp BIGINT, 
country, 
state, 
..., 
PRIMARY KEY (person_id), 
KEY (birthday_timestamp)
)

where people are added the instant they're born. In other words, the order of person_id is the same as that of birthday_timestamp. More formally, if I had two rows, A, and B,

A.person_id < B.person_id --> A.birthday_timestamp <= B.birthday_timestamp

Would it be reasonable to assume that performance on a WHERE by birthday_timestamp range should not be significantly worse than the equivalent WHERE on person_id range?

My rudimentary understanding of how this would be executed is the engine would fetch the list of person_id in the secondary index on birthday_timestamp and then fetch those results from the clustered index. Since these person_ids will be 100% contiguous, this should lead to minimal fetching of clustered index pages, right?

In contrast, if instead of birthday_timestamp, I had something whose order was completely irrelevant…say, favorite_number, this wouldn't be the case. In general, the person_id's corresponding to favorite numbers between 15 and 1,000,000 would be completely spread out and non-contiguous.

Back to the former case. The one alternative I can think of that wouldn't involve a join at all is the following. Say I want person_id, country, etc., for people born between 0 and 12345678. I could do

select * from people where birthday_timestamp >= 0 and birthday_timestamp <= 12345678

Which is what I described above and I assume would necessitate a join since MySQL doesn't actually know the keys are parellel, OR I could do two (almost)-point queries into the secondary index to grab the primary key range myself, i.e.

start = $(select person_id from people where birthday_timestamp >= 0 order by birthday_timestamp asc limit 1)

end = $(select person_id from people where birthday_timestamp <= 12345678 order by birthday_timestamp desc limit 1)

select * from people where person_id >= start and person_id <= end

Which seems like it could be a bit more optimal, though not by an order of magnitude. The downside is it's less concise.

Any thoughts?

Thanks

Best Answer

a) .. engine would fetch the list of person_id in the secondary index on birthday_timestamp and then fetch those results from the clustered index .. - Only InnoDB uses clustering by PK, MyISAM uses HEAP tables (no clustering) and the primary key is just another index with all indexes using pointer/offset to the heap to find the right row.

b) sequential vs random order of fetches only matters when you read data pages from disk, but when your query by birthday fetches first row from new page, innodb loads the entire page anyway and fetching next row of following born is "free" - no read from disk is needed because the page has already been loaded - it does not matter that mysql does not know the primary and secondary indexes are "parallel", it will have the right effect even without this knowledge - but the effect might be diminished if you have long rows so not many of them in one page. And MySQL won't use such index if it leads to reading more than ~20% of the table because it will prefer the clustered index for full scan in such case.

c) clustering by primary key does not actually mean that all the rows are stored totaly sequentialy in some file. It means that the table is organized as a B-Tree ordered by the PK. That B-Tree is stored as separate pages (nodes), each one containing some number of rows (depending on row size). So one page contains rows whose PK values are close together. But there is no guarantee that the logically next page will be stored physically next to the previous one. It may be "anywhere" in the table file (or in the big ibdata file if tables are not stored separately). When you only append to the table and never delete, you may expect that the new page will usually be allocated at the end so after the previous one, but there are possibilities of page splits even in that case and more so if the records are updated and deleted. That leads to "fragmentation" and non-continuous storage of the B-Tree pages. Afaik there is no direct defragmentation the way MyISAM offers (but rebuild might do that) for this but from what I gathered in the most extreme cases it may slow down table scans significantly. This is separate from file-level fragmentation but is very similar to that.

Mostly I agree with your assesment of the situation, the 3-query version looks promising and might be worth benchmarking for some cases. But you asked for some thoughts and this is what I got.

Related Solutions

Sql-server – Will an index that is a partial key of the begining of another index perform any better than the longer index

There really isn't even any guarantee that the optimizer will use the index in the first place. The difference of only one column between the two indexes is (for the most part) trivial. But if it did, any performance gain you would see (if any) would be trivial.

The reason for this is in how the B-Tree index is implemented by SQL Server. Both indexes are equally capable of satisfying the predicate (WHERE clause) and therefore locating only the rows that meet the search parameters. The only difference between the two would be how many pages in the index SQL Server would have to read in order to locate the required rows. And how many pages will ultimately be determined by the size of the index rows (Storage size in bytes of (TransactionDate, ClientID) vs. the alternative, with only the size of the State column being the difference between the two.

Just as an example, if a single index row was 60 bytes, SQL Server would only have to read 2 pages in order to locate a row up until there were around 2.5 million rows in the index. Then it would only require 3 pages to be read until the table reached more than 300 million rows.

So, what does that mean? Whether it is looking through an index of 2 or all 3 columns, SQL Server is still only looking through the same number of pages to locate the rows, unless the State column is so large as to create a significant difference in the size of a single row, thus causing the index pages to fill considerably faster. Any noticeable performance difference would only be caused if SQL Server had to read through more levels of the index to satisfy the query (3 levels vs. 4 levels, etc) and a single column of text, especially if State is a 2 char state abbreviation, simply won't be enough to make a significant difference.

If you really want to boost performance of the query, and the table is quite large, you might be better off exploring table partitioning, or if not on Enterprise Edition perhaps a partitioned view, all depending on whether there are certain ranges that are primarily searched and others that are searched much less often.

Mysql – query records to fetch by time stamp

Since the seconds are from 0 (1970-01-01 00:00:00 UTC), you should look for every multiple of 60

SELECT * FROM mytable WHERE MOD(TimeStamp,60)=0;

or if TimeStamp is indexed, you can do

SELECT T.* FROM
(SELECT TimeStamp FROM mytable WHERE MOD(TimeStamp,60)=0) M
INNER JOIN mytable T USING (TimeStamp);

Give it a Try !!!

SUGGESTION #1

You should store the timestamp of the minute and index it

ALTER TABLE mytable ADD COLUMN MinuteTimeStamp AFTER TimeStamp;
UPDATE mytable SET MinuteTimeStamp = TimeStamp - MOD(TimeStamp,60);
ALTER TABLE mytable ADD INDEX MinuteTimeStamp_UniqueKey_ndx (MinuteTimeStamp,UniqueKey);

Then, you can do MIN aggregation on MinuteTimeStamp.

SELECT MinuteTimeStamp,MIN(UniqueKey) UniqueKey
FROM mytable GROUP BY MinuteTimeStamp;

and use it get those records

SELECT B.* FROM
(SELECT MinuteTimeStamp,MIN(UniqueKey) UniqueKey
FROM mytable GROUP BY MinuteTimeStamp) A
INNER JOIN mytable B USING (UniqueKey);

It was tactfully pointed out that triggers would degrade performance

Perhaps doing INSERTs like this may help

INSERT INTO mytable (UniqueKey,TimeStamp,MinuteTimeStamp) VALUES
(
    uniquevalue,
    UNIX_TIMESTAMP(NOW()),
    UNIX_TIMESTAMP(NOW() - INTERVAL SECOND(NOW()) SECOND)
);

SUGGESTION #2

Since you have over 1000 columns (Ugh), perhaps a table of those minute timestamps would be better.

CREATE TABLE MinuteKeys
(
    MinuteTimeStamp INT UNSIGNED NOT NULL,
    UniqueKey INT UNSIGNED NOT NULL,
    PRIMARY KEY (UniqueKey)
    KEY MinuteTimeStamp_UniqueKey_ndx (MinuteTimeStamp,UniqueKey)
) ENGINE=MyISAM;
ALTER TABLE MinuteKeys DISABLE KEYS;
INSERT INTO MinuteKeys SELECT TimeStamp - MOD(TimeStamp,60),UniqueKey FROM mytable;
ALTER TABLE MinuteKeys ENABLE KEYS;

Then, you could use that table for the aggregation

SELECT B.* FROM
(SELECT MinuteTimeStamp,MIN(UniqueKey) UniqueKey
FROM MinuteKeys GROUP BY MinuteTimeStamp) A
INNER JOIN mytable B USING (UniqueKey);

EPILOGUE

Other suggestions are possible but you should really consider normalization of the table

See my post Too many columns in MySQL as to why