Mysql – Significant processing time creating indexes on a large MySQL table

MySQLmysql-8.0

I'm creating indexes on a few tables I've recently created within MySQL.

The first (call below) completed within ~20 minutes and was applied to a table with a little over 20M records:

mysql> alter table fu.myTable add index(date), add index(id), add index(symbol), add index(method), algorithm=inplace, lock=none;
Query OK, 0 rows affected (20 min 20.52 sec)
Records: 0  Duplicates: 0  Warnings: 0

The second, applied to a much larger table consisting of nearly 1.1B rows, is taking considerably longer (~19 hours in at the moment). Sorting is obviously an O(n*log(n)) operation, so for a table 50x the size, by my calcs, processing on the larger table would be expected to take ~22 hours. Call I'm using for it below:

mysql> alter table op.myOtherTable add index(dataDate), add index(ticker), add index(symbol), add index(expDate), algorithm=inplace, lock=none;

With that, I wonder if there are any settings I may be able to adjust to speed things up some given my machine doesn't appear to be overly taxed at this point.

Edit:
The dataset consists of ~20 years of historical data and is parsed into daily CSVs (of ~+/-1M rows) to make it a little easier to work with. I've successfully gotten the data into MySQL but now need to apply indexes to make it usable. Fields I will likely search on are those I'm indexing on (dataDate, ticker, symbol, expDate). Additional detail below:

mysql> select table_rows from information_schema.tables where table_name = 'options';
+------------+
| table_rows |
+------------+
| 1097863845 |
+------------+
1 row in set (2.70 sec)


mysql> show create table op.options;
+---------+------------------------------------------------------------+
| Table   | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+---------+------------------------------------------------------------+
| options | CREATE TABLE `options` (
  `dataDate` date NOT NULL,
  `ticker` varchar(8) NOT NULL,
  `symbol` varchar(30) NOT NULL,
  `expDate` date NOT NULL,
  `type` char(4) NOT NULL,
  `price` double DEFAULT NULL,
  `strike` double DEFAULT NULL,
  `last` double DEFAULT NULL,
  `bid` double DEFAULT NULL,
  `ask` double DEFAULT NULL,
  `volume` int(11) DEFAULT NULL,
  `OI` int(11) DEFAULT NULL,
  `IV` double DEFAULT NULL,
  `delta` double DEFAULT NULL,
  `gamma` double DEFAULT NULL,
  `theta` double DEFAULT NULL,
  `vega` double DEFAULT NULL,
  `aka` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------------------------------------------------+
1 row in set (0.01 sec)

Problem table layout sample:

mysql> select * from op.options limit 10;
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| dataDate   | ticker | optionSymbol     | expDate    | type | price | strike | last | bid  | ask  | volume | OI   | IV     | delta   | gamma  | theta   | vega   | aka  |
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| 2003-02-26 | A      | A030322C00010000 | 2003-03-22 | call |  12.8 |     10 |  3.7 | 2.8 ||  3.1 |      0 |   66 | 0.8389 |  0.8987 | 0.0653 | -3.8774 | 0.5735 | ACB
| 2003-02-26 | A      | A030322P00010000 | 2003-03-22 | put  |  12.8 |     10 | 0.15 |  0. ||  0.2 |      0 | 1436 | 0.8448 | -0.1027 | 0.0654 | -3.8059 | 0.5789 | AOB
| 2003-02-26 | A      | A030322C00012500 | 2003-03-22 | call |  12.8 |   12.5 |  1.1 | 0.9 || 1.05 |      0 | 6274 | 0.6549 |  0.5914 | 0.1833 | -6.5268 | 1.2571 | ACV
| 2003-02-26 | A      | A030322P00012500 | 2003-03-22 | put  |  12.8 |   12.5 |  0.6 | 0.6 || 0.75 |     75 |  929 | 0.6644 | -0.4089 | 0.1807 | -6.4545 | 1.2573 | AOV
| 2003-02-26 | A      | A030322C00015000 | 2003-03-22 | call |  12.8 |     15 | 0.15 |  0. ||  0.2 |     45 |  903 | 0.5812 |  0.1586 | 0.1286 | -3.5849 | 0.7831 | ACC
| 2003-02-26 | A      | A030322P00015000 | 2003-03-22 | put  |  12.8 |     15 | 2.15 | 2.2 ||  2.4 |      0 |  107 |  0.558 | -0.8526 | 0.1276 | -3.0784 | 0.7459 | AOC
| 2003-02-26 | A      | A030322C00017500 | 2003-03-22 | call |  12.8 |   17.5 | 0.05 |     ||  0.1 |      0 |  439 | 0.7012 |  0.0474 | 0.0435 | -1.7616 | 0.3198 | ACW
| 2003-02-26 | A      | A030322P00017500 | 2003-03-22 | put  |  12.8 |   17.5 |    5 |  4. ||  4.8 |      0 |  215 |  0.558 | -0.9838 | 0.0224 | -0.3395 | 0.1307 | AOW
| 2003-02-26 | A      | A030322C00020000 | 2003-03-22 | call |  12.8 |     20 | 0.05 |     || 0.05 |      0 |  224 | 0.8113 |  0.0193 | 0.0179 | -0.9668 | 0.1519 | ACD
| 2003-02-26 | A      | A030322P00020000 | 2003-03-22 | put  |  12.8 |     20 |  7.1 |  7. ||  7.3 |      0 |    6 |  0.558 |  -0.999 | 0.0019 |  0.2187 |  0.011 | AOD
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
10 rows in set (0.44 sec)

Best Answer

  • Every day you bulk-insert 1M rows to add to a 1B-row table, correct?
  • There are several indexes on the table?
  • The goal is to add the rows reasonably fast?
  • The table is otherwise "readonly", so we are not worrying about writes?
  • The table is InnoDB, I hope.

Then...

Leave the indexes in place. The table is growing by only 0.1% each day. It would be faster to incrementally update the indexes than to drop and rebuild them.

  • InnoDB indexes are B+Trees.
  • Inserting one row, for example, will augment the B+Tree for each index.
  • The "Change buffer" delays the actual I/O involved in each index update, thereby making it not as bad as you might imagine.
  • Keep innodb_buffer_pool_size at a suitable size. (How much RAM do you have. The simple default of 70% of RAM may be close to optimal.) (The Change buffer is in the buffer_pool.)

Schema

For a billion-row table, space is important...

  • symbol and ticker? Do you need both? Consider normalizing the pair.
  • DOUBLE (8 bytes, 16 significant digits) is probably overkill for gamma, etc. Consider FLOAT (4 bytes, 7 significant digits)
  • Unless you have some limitations, volume can exceed a 32-bit INT (2B limit) or even INT UNSIGNED (4B limit)
  • I don't see id. If it is an AUTO_INCREMENT you add later, then I suggest you are risking overflow if it is INT.
  • A 'good' PRIMARY KEY would be (date, symbol_id) or possibly the opposite order.
  • What SELECTs do you have? If you are fetching "all info for one ticker", (symbol_id, date) is optimal. If you are talking about "today's data" then the opposite is better.
  • type could be an ENUM (1 byte)

INDEX(a), INDEX(b) is not the same as INDEX(a,b). The latter is significantly better in some situations. For querying a billion rows, you need to understand the differences. http://mysql.rjweb.org/doc.php/index_cookbook_mysql