I'm running a relatively simply query against a very large DB (~2B records) which, even with indexes, is taking a very long time (currently several minutes and running).
The script relies on two tables (rd and prices). Running the query with the below script runs in ~5 seconds, but when I included the commented out clause (as I'd ultimately like a timeseries) performance deteriorates. As I already have indexes on the columns used, I'm not sure how to speed this up further. Sample data, query and indexes are included below.
rd
datadate, sedol, ticker, name, q, year, reportDate
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2018, 2018-02-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5902, 2018, 2018-05-16
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5903, 2018, 2018-08-15
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5904, 2018, 2018-11-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2019, 2019-02-13
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5902, 2019, 2019-05-15
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5903, 2019, 2019-08-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5904, 2019, 2019-11-13
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2020, 2020-02-12
prices
dataDate, ticker, optionSymbol, expDate, type, price, strike, last, bid, ask, volume, OI, IV, delta, gamma, theta, vega, aka
2002-02-08, CSCO, CYQ020216C00007500, 2002-02-16, call, 16.76, 7.5, 9.8, 9.2, 9.3, 0, 444, 0.7553, 1, 0, -0.1439, 0, CYQBU
2002-02-08, CSCO, CYQ020216P00007500, 2002-02-16, put, 16.76, 7.5, 0, 0, 0.05, 0, 0, 1.4853, 0, 0.0001, -0.0157, 0.0004, CYQNU
2002-02-08, CSCO, CYQ020216C00010000, 2002-02-16, call, 16.76, 10, 6.7, 6.7, 6.8, 20, 372, 0.7553, 1, 0, -0.192, 0, CYQBB
2002-02-08, CSCO, CYQ020216P00010000, 2002-02-16, put, 16.76, 10, 0, 0, 0.05, 0, 0, 1.4853, -0.0052, 0.0043, -1.3207, 0.0358, CYQNB
2002-02-08, CSCO, CYQ020216C00012500, 2002-02-16, call, 16.76, 12.5, 4, 4.2, 4.3, 11, 159, 0.7553, 0.9974, 0.0045, -0.5961, 0.019, CYQBV
2002-02-08, CSCO, CYQ020216P00012500, 2002-02-16, put, 16.76, 12.5, 0.05, 0, 0.05, 2, 311, 1.4853, -0.0668, 0.0367, -11.3483, 0.3077, CYQNV
2002-02-08, CSCO, CYQ020216C00015000, 2002-02-16, call, 16.76, 15, 1.9, 1.85, 1.95, 1577, 3887, 0.7553, 0.8629, 0.1223, -10.0396, 0.5213, CYQBC
2002-02-08, CSCO, CYQ020216P00015000, 2002-02-16, put, 16.76, 15, 0.1, 0.05, 0.15, 2595, 43062, 0.6816, -0.1149, 0.1198, -7.7802, 0.461, CYQNC
2002-02-08, CSCO, CYQ020216C00017500, 2002-02-16, call, 16.76, 17.5, 0.3, 0.25, 0.3, 9144, 32613, 0.5865, 0.3179, 0.256, -12.4631, 0.8472, CYQBW
2002-02-08, CSCO, CYQ020216P00017500, 2002-02-16, put, 16.76, 17.5, 0.95, 0.95, 1.1, 12218, 54357, 0.6048, -0.676, 0.2502, -12.6152, 0.8539, CYQNW
Query
SELECT rd.reportDate, o.*
FROM op.prices o
JOIN (
SELECT *
, CASE WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 1
THEN date_add(rd.reportDate, INTERVAL -23 DAY)
WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 7
THEN date_add(rd.reportDate, INTERVAL -22 DAY)
ELSE date_add(rd.reportDate, INTERVAL -21 DAY)
END as rdadj
FROM eq.rd
WHERE ticker = 'CSCO'
AND reportDate > '2018-01-01'
) r
ON r.ticker = o.ticker
AND o.strike = 45
WHERE r.reportDate between date_add(o.expDate, INTERVAL -6 MONTH) AND o.expDate
AND o.dataDate = rdadj #and o.dataDate >= rdadj
AND r.reportDate > '2018-01-01'
ORDER BY o.expDate, o.dataDate
Indexes
Create Table
| rd | CREATE TABLE `rd` (
`datadate` date DEFAULT NULL,
`sedol` text,
`ticker` text,
`name` text,
`q` int DEFAULT NULL,
`year` int DEFAULT NULL,
`reportDate` date DEFAULT NULL,
KEY `reportDate` (`reportDate`,`ticker`(255),`year`,`q`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
| prices | CREATE TABLE `prices` (
`dataDate` date NOT NULL,
`ticker` varchar(8) NOT NULL,
`optionSymbol` varchar(30) NOT NULL,
`expDate` date NOT NULL,
`type` char(4) NOT NULL,
`price` double DEFAULT NULL,
`strike` double DEFAULT NULL,
`last` double DEFAULT NULL,
`bid` double DEFAULT NULL,
`ask` double DEFAULT NULL,
`volume` int DEFAULT NULL,
`OI` int DEFAULT NULL,
`IV` double DEFAULT NULL,
`delta` double DEFAULT NULL,
`gamma` double DEFAULT NULL,
`theta` double DEFAULT NULL,
`vega` double DEFAULT NULL,
`aka` varchar(30) DEFAULT NULL,
KEY `dataDate` (`dataDate`,`ticker`,`expDate`,`type`),
KEY `optionSymbol` (`optionSymbol`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
Explain
mysql> explain select rd.reportDate, op.*
-> from op.prices op
-> inner join (select *
-> , CASE WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 1 THEN date_add(rd.reportDate, INTERVAL -23 DAY)
-> WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 7 THEN date_add(rd.reportDate, INTERVAL -22 DAY)
-> ELSE date_add(rd.reportDate, INTERVAL -21 DAY)
-> END as rdadj
-> from eq.rd where ticker = 'CSCO' and reportDate > '2018-01-01') rd
-> on rd.ticker = op.ticker
-> and op.strike = 45
-> where rd.reportDate between date_add(op.expDate, INTERVAL -6 MONTH) and op.expDate
-> and op.dataDate > rdadj #and op.dataDate >= rdadj
-> and rd.reportDate > '2018-01-01' order by op.expDate, op.dataDate;
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | rd | NULL | range | reportDate | reportDate | 4 | NULL | 47538 | 10.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | op | NULL | ALL | dataDate | NULL | NULL | NULL | 2014263835 | 0.04 | Range checked for each record (index map: 0x1) |
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)
Best Answer
Some hints:
r.*
is pulling out lots of columns, yet only 2(?) are needed: reportDate and ticket. It is better (by a little or by a lot) to fetch only what you need.rd
refers to a table and to a subquery. That makes it very difficult for me to read. Please change. Ditto forop
being a database and an alias.Don't use
TEXT
when the value cannot be very long. (There are multiple reasons -- some minor, some major.)Don't use prefix indexing such as
KEY ... ticker(255)
, especially whenticker
seems to be limited to 8 characters. This can have a big impact in your quer.Have an explicit
PRIMARY KEY
on each table.If
year
is part ofreportDate
ordataDate
, it may be counterproductive to have it as a separate column. (Also, there is aYEAR
datatype.)If it is possible to "summarize" the day's data each night, that would speed up queries a lot.
Are you saying that changing > to >= causes a significant slowdown?Indexes:
It is usually inefficient to start an index with a date that is used in a "range".
A minor simplification, assuming "21 days" is always a multiple of 7:
Sargeable...
This may help significantly.
The outer
AND r.reportDate > '2018-01-01'
seems to be redundant.In picking a
PRIMARY KEY
, is there a combination of columns that is unique? And does it includeticker
? And are you I/O-bound? If yes to all those, be sure to haveticker
first; this will make any query based on single ticker run much faster. This is because of the I/O savings by having all the CSCO rows "clustered" together. (Let me know what combo would work; I'll probably have an opinion of how to order the rest.) Caveat: Adding a PK is much more invasive than adding a secondary index; it will rebuild the entire multi-billion-row table, and probably block any activity during theALTER
.The table is about 800GB, correct? How much RAM? What is the value of
innodb_buffer_pool_size
? Was the table built withinnodb_file_per_table=ON
? How much spare disk space do you have?DOUBLE
takes 8 bytes;FLOAT
takes 4 bytes. Some of those metrics could probably be put intoFLOAT
, thereby saving 8GB per column changed. (Again, this needs a costly table rebuild. But it could be combined with theADD PRIMARY KEY(...)
.)Suggest looking carefully at each column to see if it can be shrunk or eliminated or easily recomputed when fetched.