MySQL index performance

MySQLmysql-8.0Ubuntu

I'm running a relatively simply query against a very large DB (~2B records) which, even with indexes, is taking a very long time (currently several minutes and running).

The script relies on two tables (rd and prices). Running the query with the below script runs in ~5 seconds, but when I included the commented out clause (as I'd ultimately like a timeseries) performance deteriorates. As I already have indexes on the columns used, I'm not sure how to speed this up further. Sample data, query and indexes are included below.

rd

datadate, sedol, ticker, name, q, year, reportDate
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2018, 2018-02-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5902, 2018, 2018-05-16
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5903, 2018, 2018-08-15
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5904, 2018, 2018-11-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2019, 2019-02-13
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5902, 2019, 2019-05-15
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5903, 2019, 2019-08-14
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5904, 2019, 2019-11-13
2020-05-29, 2198163, CSCO, Cisco Systems Inc, 5901, 2020, 2020-02-12

prices

dataDate, ticker, optionSymbol, expDate, type, price, strike, last, bid, ask, volume, OI, IV, delta, gamma, theta, vega, aka
2002-02-08, CSCO, CYQ020216C00007500, 2002-02-16, call, 16.76, 7.5, 9.8, 9.2, 9.3, 0, 444, 0.7553, 1, 0, -0.1439, 0, CYQBU
2002-02-08, CSCO, CYQ020216P00007500, 2002-02-16, put, 16.76, 7.5, 0, 0, 0.05, 0, 0, 1.4853, 0, 0.0001, -0.0157, 0.0004, CYQNU
2002-02-08, CSCO, CYQ020216C00010000, 2002-02-16, call, 16.76, 10, 6.7, 6.7, 6.8, 20, 372, 0.7553, 1, 0, -0.192, 0, CYQBB
2002-02-08, CSCO, CYQ020216P00010000, 2002-02-16, put, 16.76, 10, 0, 0, 0.05, 0, 0, 1.4853, -0.0052, 0.0043, -1.3207, 0.0358, CYQNB
2002-02-08, CSCO, CYQ020216C00012500, 2002-02-16, call, 16.76, 12.5, 4, 4.2, 4.3, 11, 159, 0.7553, 0.9974, 0.0045, -0.5961, 0.019, CYQBV
2002-02-08, CSCO, CYQ020216P00012500, 2002-02-16, put, 16.76, 12.5, 0.05, 0, 0.05, 2, 311, 1.4853, -0.0668, 0.0367, -11.3483, 0.3077, CYQNV
2002-02-08, CSCO, CYQ020216C00015000, 2002-02-16, call, 16.76, 15, 1.9, 1.85, 1.95, 1577, 3887, 0.7553, 0.8629, 0.1223, -10.0396, 0.5213, CYQBC
2002-02-08, CSCO, CYQ020216P00015000, 2002-02-16, put, 16.76, 15, 0.1, 0.05, 0.15, 2595, 43062, 0.6816, -0.1149, 0.1198, -7.7802, 0.461, CYQNC
2002-02-08, CSCO, CYQ020216C00017500, 2002-02-16, call, 16.76, 17.5, 0.3, 0.25, 0.3, 9144, 32613, 0.5865, 0.3179, 0.256, -12.4631, 0.8472, CYQBW
2002-02-08, CSCO, CYQ020216P00017500, 2002-02-16, put, 16.76, 17.5, 0.95, 0.95, 1.1, 12218, 54357, 0.6048, -0.676, 0.2502, -12.6152, 0.8539, CYQNW

Query

SELECT rd.reportDate, o.*
FROM op.prices o
JOIN (
    SELECT *
        , CASE WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 1 
               THEN date_add(rd.reportDate, INTERVAL -23 DAY)
               WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 7 
               THEN date_add(rd.reportDate, INTERVAL -22 DAY)
               ELSE date_add(rd.reportDate, INTERVAL -21 DAY) 
          END as rdadj 
    FROM eq.rd 
    WHERE ticker = 'CSCO' 
      AND reportDate > '2018-01-01'
) r
    ON r.ticker = o.ticker
    AND o.strike = 45
WHERE r.reportDate between date_add(o.expDate, INTERVAL -6 MONTH) AND o.expDate
  AND o.dataDate = rdadj  #and o.dataDate >= rdadj
  AND r.reportDate > '2018-01-01' 
ORDER BY o.expDate, o.dataDate

Indexes

enter image description here

Create Table

    | rd    | CREATE TABLE `rd` (
      `datadate` date DEFAULT NULL,
      `sedol` text,
      `ticker` text,
      `name` text,
      `q` int DEFAULT NULL,
      `year` int DEFAULT NULL,
      `reportDate` date DEFAULT NULL,
      KEY `reportDate` (`reportDate`,`ticker`(255),`year`,`q`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 

| prices | CREATE TABLE `prices` (
  `dataDate` date NOT NULL,
  `ticker` varchar(8) NOT NULL,
  `optionSymbol` varchar(30) NOT NULL,
  `expDate` date NOT NULL,
  `type` char(4) NOT NULL,
  `price` double DEFAULT NULL,
  `strike` double DEFAULT NULL,
  `last` double DEFAULT NULL,
  `bid` double DEFAULT NULL,
  `ask` double DEFAULT NULL,
  `volume` int DEFAULT NULL,
  `OI` int DEFAULT NULL,
  `IV` double DEFAULT NULL,
  `delta` double DEFAULT NULL,
  `gamma` double DEFAULT NULL,
  `theta` double DEFAULT NULL,
  `vega` double DEFAULT NULL,
  `aka` varchar(30) DEFAULT NULL,
  KEY `dataDate` (`dataDate`,`ticker`,`expDate`,`type`),
  KEY `optionSymbol` (`optionSymbol`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |

Explain

mysql> explain select rd.reportDate, op.*
    ->             from op.prices op 
    ->             inner join (select *
    ->                                 , CASE WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 1 THEN date_add(rd.reportDate, INTERVAL -23 DAY)
    ->                                         WHEN dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY)) = 7 THEN date_add(rd.reportDate, INTERVAL -22 DAY)
    ->                                         ELSE date_add(rd.reportDate, INTERVAL -21 DAY) 
    ->                                         END as rdadj 
    ->                                         from eq.rd where ticker = 'CSCO' and reportDate > '2018-01-01') rd
    ->             on rd.ticker = op.ticker
    ->             and op.strike = 45
    ->             where rd.reportDate between date_add(op.expDate, INTERVAL -6 MONTH) and op.expDate
    ->             and op.dataDate > rdadj  #and op.dataDate >= rdadj
    ->             and rd.reportDate > '2018-01-01' order by op.expDate, op.dataDate;
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys | key        | key_len | ref  | rows       | filtered | Extra                                                               |
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
|  1 | SIMPLE      | rd    | NULL       | range | reportDate    | reportDate | 4       | NULL |      47538 |    10.00 | Using index condition; Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | op    | NULL       | ALL   | dataDate      | NULL       | NULL    | NULL | 2014263835 |     0.04 | Range checked for each record (index map: 0x1)                      |
+----+-------------+-------+------------+-------+---------------+------------+---------+------+------------+----------+---------------------------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

Best Answer

Some hints:

  • r.* is pulling out lots of columns, yet only 2(?) are needed: reportDate and ticket. It is better (by a little or by a lot) to fetch only what you need.

  • rd refers to a table and to a subquery. That makes it very difficult for me to read. Please change. Ditto for op being a database and an alias.

  • Don't use TEXT when the value cannot be very long. (There are multiple reasons -- some minor, some major.)

  • Don't use prefix indexing such as KEY ... ticker(255), especially when ticker seems to be limited to 8 characters. This can have a big impact in your quer.

  • Have an explicit PRIMARY KEY on each table.

  • If year is part of reportDate or dataDate, it may be counterproductive to have it as a separate column. (Also, there is a YEAR datatype.)

  • If it is possible to "summarize" the day's data each night, that would speed up queries a lot.

  •     and op.dataDate > rdadj
       #and op.dataDate >= rdadj   -- Is this the commented out part?
    

Are you saying that changing > to >= causes a significant slowdown?

  • Indexes:

      rd:  index(ticker, reportDate)  -- in this order!  and not TEXT!
      prices:  INDEX(strike, expDate, dataDate)   -- in this  order
    
  • It is usually inefficient to start an index with a date that is used in a "range".

  • A minor simplification, assuming "21 days" is always a multiple of 7:

  •     dayofweek(date_add(rd.reportDate, INTERVAL -21 DAY))   
       -->
        dayofweek(rd.reportDate)
    
  • Sargeable...

       WHERE r.reportDate between date_add(o.expDate, INTERVAL -6 MONTH)
                                       AND o.expDate
      -->
       WHERE o.expDate >= r.reportDate
         AND o.expDate <= r.reportDate + INTERVAL 6 MONTH
    
  • This may help significantly.

      INDEX(ticker, strike,   -- in either order,
            expDate, dataDate)  -- then these, in this order
    
  • The outer AND r.reportDate > '2018-01-01' seems to be redundant.

  • In picking a PRIMARY KEY, is there a combination of columns that is unique? And does it include ticker? And are you I/O-bound? If yes to all those, be sure to have ticker first; this will make any query based on single ticker run much faster. This is because of the I/O savings by having all the CSCO rows "clustered" together. (Let me know what combo would work; I'll probably have an opinion of how to order the rest.) Caveat: Adding a PK is much more invasive than adding a secondary index; it will rebuild the entire multi-billion-row table, and probably block any activity during the ALTER.

  • The table is about 800GB, correct? How much RAM? What is the value of innodb_buffer_pool_size? Was the table built with innodb_file_per_table=ON? How much spare disk space do you have?

  • DOUBLE takes 8 bytes; FLOAT takes 4 bytes. Some of those metrics could probably be put into FLOAT, thereby saving 8GB per column changed. (Again, this needs a costly table rebuild. But it could be combined with the ADD PRIMARY KEY(...).)

  • Suggest looking carefully at each column to see if it can be shrunk or eliminated or easily recomputed when fetched.