Mysql – way to improve performance of this simple SELECT query

awsinnodbMySQLmysql-5.7

I'm still relatively new to MySQL so please forgive any ignorance on my part. I've tried looking through other posts on here and StackExchange, but haven't had much luck in improving the performance of this simple SELECT query. Let me start with the table schema, the SELECT statement, and the EXPLAIN for it:

Table:

CREATE TABLE `Observations` (
  `InstrumentID` bigint(20) NOT NULL,
  `Epoch` bigint(20) NOT NULL,
  `EndingEpoch` bigint(20) DEFAULT NULL,
  `PhenomenonName` varchar(16) NOT NULL,
  `Value` decimal(18,5) DEFAULT NULL,
  `IsEstimated` bit(1) DEFAULT NULL,
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`InstrumentID`,`PhenomenonName`,`Epoch`,`Value`),
  KEY `FK_31` (`PhenomenonName`),
  KEY `idx_Epoch` (`Epoch`),
  CONSTRAINT `FK_27` FOREIGN KEY (`InstrumentID`) REFERENCES `Instruments` (`InstrumentID`),
  CONSTRAINT `FK_31` FOREIGN KEY (`PhenomenonName`) REFERENCES `Phenomena` (`PhenomenonName`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

SELECT:

SELECT PhenomenonName, Value, Epoch
FROM Observations
WHERE InstrumentID=2
AND Epoch BETWEEN 1514782800 AND 1546318740
AND PhenomenonName IN ('demand')
ORDER BY Epoch

EXPLAIN:

{
    "id": 1,
    "select_type": "SIMPLE",
    "table": "Observations",
    "partitions": null,
    "type": "range",
    "possible_keys": "PRIMARY,FK_31,idx_Epoch",
    "key": "PRIMARY",
    "key_len": "34",
    "ref": null,
    "rows": 1042464,
    "filtered": 100,
    "Extra": "Using where"
}

Sorry the explain is in JSON, I know that's not typical but the network configuration needs to be updated for my company's VPN so I can't connect via Workbench right now.

So, this query is for 1 year of 1-minute data, which is ~525K rows out of a table of ~18M rows and it takes ~22 seconds. Originally the PRIMARY key had Epoch before PhenomenonName, which was forcing it to do an index_merge_intersect. But, fixing the order of the key (which as you can see leads to it just using the PRIMARY key now) only shaved off 1 second. I've also tried running ANALYZE on the table, and changing Epoch BETWEEN 1514782800 AND 1546318740 to Epoch >= 1514782800 AND Epoch <= 1546318740, both of which seemed to make no difference.

Granted the hardware right now is very small since we're still developing the server–only a db.t2.small, which is 1 vCPU and 2GB RAM. However, increasing it to a db.t2.large, 2 vCPU and 8GB RAM, brings it down to 13 seconds. And then increasing it again to a db.t2.2xlarge, 8 vCPU and 32GB RAM, doesn't seem to improve performance–it's still ~13 seconds. Also, the storage is 100 GiB SSD and the buffer pool size is 75% of total RAM as per AWS's default setting.

Is this just a hardware limitation? If so, that's fine, but I want to make sure I'm not missing any opportunities to optimize the table schema or select statement, especially since this is just a query for one instrument, and queries will regularly have to be made aggregating several of them together.

Thanks in advance, and please let me know if I can provide anymore information.

Edit: By the way, this is unrelated, but as you can see I have two foreign key restraints, but only one of them has created an actual key. Is there any reason for this? My understanding was that foreign key restraints automatically create a key.

Edit2: The query took ~23 seconds on the small instance, and since adding Value to the PRIMARY KEY as per @SQLRaptor's suggestion, it's down to ~21 seconds. Still about ~13 seconds on the db.t2.2xlarge.

Best Answer

525K rows takes time to shovel. It takes time to read from disk (unless it happens to be cached in RAM. It takes time to send that many rows across the wire to the client. And the client has to receive them.

The index is optimal; it has the desired columns in the desired order. But it is suspicious. Keep in mind that a PRIMARY KEY is, by definition, unique. It is usually not reasonable to have a "value" in the PK. And there is no difference if you leave it out.

One thing that will make a small improvement is to shrink the datatypes where practical. Almost no application need BIGINT, but you have 3. BIGINT is 8 bytes. INT is only 4 bytes and can hold a timestamp (a la epoch). Surely you don't have billions of 'Instruments'. PhenomenonName could be normalized or perhaps turned into a 1-byte ENUM.

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

MYSQL improve query performance when using OR

What about using a union?

SELECT * FROM table WHERE (col1 = '123' AND col2 = '456') UNION ALL SELECT * FROM table WHERE (col1 = '456' AND col2 = '123')

Best Answer

Related Solutions

MySQL looking up more rows than needed (indexing issue)

MYSQL improve query performance when using OR

Related Question