Mysql – Database slows down at millions of rows, how to optimize schema and queries

innodbMySQLperformancequery-performancescalabilityschema

I am building a database to store Ethereum Transactions that are connected to specific addresses. Once the database table eth_addresses_tx starts hitting 3 million + rows, things start slowing down a lot and CPU starts hitting 100%. Storage is fine, 1 million rows is about 300mb. Before spending more money on a server, I would like to see if there is any ways to improve the schema and queries. The project will have 15-20 million rows to start and may have a lot more in the near future. There will be 200-500 unique addresses and those addresses may contain a small amount of rows (1,000) up to a larger amount (3 million). I could use some advice. Thank you.

Update: 1 more index has been added to datetimeTx on the transactions table eth_address_tx which fixed performance on almost all the queries besides the last 2 on this post.

Table purposes:

eth_addresses – It stores addresses and meta data about the address being tracked.

eth_addresses_stats – It stores statistics queried from the transactions table (eth_addresses_tx) such as how many transactions in certain periods etc. A programming script is queried on a cron job

eth_addresses_txTX is short for transaction in the queries and schema. This table stores data from blockchain transactions attached to certain addresses. The value column contains ETH amount for that transaction. There is 1 index from_tx.

MYSQL SCHEMA

CREATE DATABASE addresses CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

USE addresses;

CREATE TABLE `eth_addresses` (
    `id` INT(11) NOT NULL AUTO_INCREMENT ,
    `name` VARCHAR(32) NOT NULL ,
    `website` VARCHAR(255) NOT NULL ,
    `address` VARCHAR(1024) NOT NULL COMMENT 'ETH addresses' ,
    PRIMARY KEY (`id`), UNIQUE `unique` (`website`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `eth_addresses_stats` (
    `id` int(11) NOT NULL,
    `balance` decimal(16,6) NULL DEFAULT NULL,
    `tx_1h` int(11) NULL DEFAULT NULL COMMENT 'transactions',
    `tx_24h` int(11) NULL DEFAULT NULL COMMENT 'transactions',
    `tx_7d` int(11) NULL DEFAULT NULL COMMENT 'transactions',
    `tx_all` int(11) NULL DEFAULT NULL COMMENT 'transactions',
    `volume_1h` decimal(14,6) NULL DEFAULT NULL,
    `volume_24h` decimal(14,6) NULL DEFAULT NULL,
    `volume_7d` decimal(14,6) NULL DEFAULT NULL,
    `volume_all` decimal(14,6) NULL DEFAULT NULL,
    `users_1h` int(11) NULL DEFAULT NULL,
    `users_24h` int(11) NULL DEFAULT NULL,
    `users_7d` int(11) NULL DEFAULT NULL,
    `users_all` int(11) NULL DEFAULT NULL,
    `last_scraped` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    CONSTRAINT FOREIGN KEY (id) REFERENCES eth_addresses (id),
    UNIQUE (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

and

CREATE TABLE `eth_addresses_tx` (
    `hash` varchar(66) NOT NULL,
    `address_id` INT(11) NOT NULL,
    `blockNumber` int(11) NOT NULL,
    `datetimeTx` datetime NOT NULL,
    `from_tx` varchar(42) NOT NULL,
    `to_tx` varchar(42) NOT NULL,
    `value` decimal(14,6) NOT NULL,
    CONSTRAINT FOREIGN KEY (address_id) REFERENCES eth_addresses (id),
    PRIMARY KEY (`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Transactions';
ALTER TABLE `eth_addresses_tx` ADD INDEX `from_tx` (`from_tx`);

Slow Queries

The queries below are ran periodically and the results are stored in a caching table.

These queries are gathering total stats from the transactions table eth_addresses_tx to be sent to eth_addresses_stats. I was unsure how to combine them into 1 query so I wrote 4 queries.

UPDATE The 4 queries below are very fast with an index on datetimeTx. Addresses with million + rows are now taking 0.14 seconds

SELECT IFNULL(SUM(value), 0) AS volume_1h,
       count(*) AS tx_1h,
       COUNT(DISTINCT from_tx) AS users_1h
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 1 HOUR)
  AND address_id = $address_id;


-- This query is slow on popular addresses with 1M+ rows
-- 1 row in set (32.30 sec) without explain
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+
| id | select_type | table            | partitions | type | possible_keys    | key        | key_len | ref   | rows    | filtered | Extra       |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | ref  | address_id       | address_id | 4       | const | 1440328 |    33.33 | Using where |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+


SELECT IFNULL(SUM(value), 0) AS volume_24h,
       count(*) AS tx_24h, COUNT(DISTINCT from_tx) AS users_24h
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 24 HOUR)
  AND address_id = $address_id;

SELECT IFNULL(SUM(value), 0) AS volume_7d,
       count(*) AS tx_7d, COUNT(DISTINCT from_tx) AS users_7d
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 7 DAY)
  AND address_id = $address_id;

SELECT IFNULL(SUM(value), 0) AS volume_all,
       count(*) AS tx_all,
       COUNT(DISTINCT from_tx) AS users_all
FROM `eth_addresses_tx`
WHERE address_id = $address_id;

These queries are to get recent transactions from eth_addresses_tx with a few different where clauses: Ordering by datetime is very slow, without ordering the query is very fast.

UPDATE The following 3 queries below are now very fast with an index on datetimeTx. 10 rows in set (0.00 sec)

SELECT  blockNumber, datetimeTx, address_id, from_tx, to_tx,
    value
     FROM  eth_addresses_tx
    WHERE  from_tx = $from_tx
    ORDER BY  datetimeTx DESC
    LIMIT  $limit;
SELECT  blockNumber, datetimeTx, address_id, from_tx, to_tx,
    value
     FROM  eth_addresses_tx
    WHERE  address_id = $address_id
    ORDER BY  datetimeTx DESC
    LIMIT  $limit;
SELECT  blockNumber, datetimeTx, address_id, from_tx, to_tx,
    value
     FROM  eth_addresses_tx
    ORDER BY  datetimeTx DESC
    LIMIT  $limit;

and

-- This query without explain = 10 rows in set (1.20 sec) 
-- This one is faster then the ones below
EXPLAIN SELECT  blockNumber, datetimeTx, address_id, from_tx, to_tx,
    value
     FROM  eth_addresses_tx
    WHERE  from_tx = $from_tx
    ORDER BY  datetimeTx DESC
    LIMIT  10;
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+
| id | select_type | table            | partitions | type | possible_keys | key     | key_len | ref   | rows  | filtered | Extra                                 |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | ref  | from_tx       | from_tx | 128     | const | 51680 |   100.00 | Using index condition; Using filesort |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+

-- This query without explain = 10 rows in set (31.51 sec)   
-- Address ids with a small amount of transactions query fast however address with large amounts of rows query very slow
EXPLAIN     SELECT  blockNumber, datetimeTx, address_id, from_tx, to_tx,
    value
     FROM  eth_addresses_tx
    WHERE  address_id = $address_id
    ORDER BY  datetimeTx DESC
    LIMIT  10;
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+
| id | select_type | table            | partitions | type | possible_keys | key        | key_len | ref   | rows    | filtered | Extra                                 |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | ref  | address_id    | address_id | 4       | const | 1440242 |   100.00 | Using index condition; Using filesort |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+

-- This query without explain = 10 rows in set (9.93 sec)
EXPLAIN SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx, value FROM eth_addresses_tx ORDER BY datetimeTx DESC LIMIT 10;
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+
| id | select_type | table            | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra          |
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 2880430 |   100.00 | Using filesort |
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+

These queries are to plot transaction data to a chart to track the volume.
UPDATE The following 3 queries below are now very fast with an index on datetimeTx. 73 rows in set (0.02 sec)

SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 1 HOUR);


-- This query is slow on popular addresses with 1M+ rows
-- 88 rows in set (31.81 sec) without explain
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+
| id | select_type | table            | partitions | type | possible_keys    | key       | key_len | ref   | rows    | filtered | Extra       |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | ref  | address_id       | address_id| 4       | const | 1440342 |    11.11 | Using where |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+

SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 24 HOUR);
SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 7 DAY);

These queries are to figure out who is interacting with those addresses the most

SELECT d.name, count(d.id) AS total
        FROM `eth_addresses_tx` as tx
        LEFT JOIN eth_addresses AS d ON d.id = tx.address_id
        WHERE tx.from_tx = :address group by d.name

-- This query is not bad, pretty fast on smaller amounts
-- 1 row in set (1.06 sec) without explain
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref   | rows  | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+
|  1 | SIMPLE      | tx    | NULL       | ref  | from_tx       | from_tx | 128     | const | 51680 |   100.00 | Using temporary; Using filesort                    |
|  1 | SIMPLE      | d     | NULL       | ALL  | PRIMARY       | NULL    | NULL    | NULL  |     5 |   100.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+


-----------------------------------------------------


SELECT from_tx, count(*) AS total_tx, SUM(value) as total_volume, count(DISTINCT address_id) as total_interacted
        FROM `eth_addresses_tx`
        GROUP BY from_tx
        ORDER BY total_tx DESC, from_tx DESC
        LIMIT 100;


-- This query is very bad
-- 100 rows in set (2 min 26.45 sec) without explain
-- sorting with any of the 3 items selected is slow
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+
| id | select_type | table            | partitions | type  | possible_keys | key     | key_len | ref  | rows    | filtered | Extra                           |
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+
|  1 | SIMPLE      | eth_addresses_tx | NULL       | index | from_tx       | from_tx | 128     | NULL | 2880587 |   100.00 | Using temporary; Using filesort |
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+

Best Answer

Hash indexing is bad when table is big

Assuming hash is some form of "hash", eth_addresses_tx has a very inefficient PRIMARY KEY. When a table (or index) gets bigger than can be cached in RAM and the key is "random" (as are hashes), things slow down because the block needed is less and less likely to be cached in RAM. What do do?

  • Consider using some non-'random' index
  • Shrink the data -- Do you need 4-byte INT for the fields? eth_addresses -- 3M rows could be handled by a 3-byte MEDIUMINT UNSIGNED.
  • address VARCHAR(42) -- If that is really 0x plus 40 hex digits, pack it into BINARY(20) (20 bytes) with the help of LEFT and UNHEX.
  • Could from/to_tx be normalized? Are they hashes, too?
  • Do the DECIMALs nee to be that big. (14,6) takes 7 bytes; FLOAT takes 4 (but has fewer significant digits). What kind of value is "volume"? Cubic furlongs? Stock shares traded? Books in a trilogy? Fluid ounces? If you need exact values, then DECIMAL is required. But is 6 decimal places enough? (Not for Gwei or Wei.) FLOAT's 7 significant digits may be good enough for summary totals?

Better indexes

WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 1 HOUR)
  AND address_id = $address_id;

needs INDEX(address_id, datetimeTx) in this order.

WHERE  from_tx = $from_tx
ORDER BY  datetimeTx DESC
LIMIT  $limit;

very much needs INDEX(from_tx, datetimeTx) in this order. Ditto for (address_id, datetimeTx). This should eliminate the 'filesort', but more importantly, it will need to look at only 10 rows, not thousands (see your EXPLAINs).

This gets trickier:

WHERE address_id = $address_id
  AND value > 0
  AND datetimeTx >= DATE_SUB(NOW(), ...)

You cannot effectively use two "ranges" in a single index. The best the optimizer can do is pick between the above index and INDEX(address_id, value).

The query with the LEFT JOIN is difficult to improve on because the WHERE and the GROUP BY reference more than one table.

Can't improve ORDER BY total_tx DESC, from_tx DESC LIMIT 100 because total_tx is an aggregate. (But, see "Summary tables". below.)

More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

eth_addresses_tx

Assuming that is the 'big' table, we should look seriously at its PRIMARY KEY. Assuming hash is needed, here is the best thing to do:

PRIMARY KEY(address_id, datetimeTx,   -- to make queries efficient
            hash)                     -- to assure uniqueness
UNIQUE(hash)      -- constraint for `hash`

The table is "clustered" on the PK. The queries would be fetching 'consecutive' rows with this PK. This is much better than the random nature caused by PK(hash).

Summary Table(s)

But the real winner would probably be to have a Summary table. It would get new rows added hourly. Reports on daily, weekly, etc could be computed from this table.

Summary tables: http://mysql.rjweb.org/doc.php/summarytables

Tuning (unlikely)

If you provide the GLOBAL STATUS and VARIABLES, I will look at them. But you "can't tune your way out of a performance problem". Especially when there are indexes to be added and/or summary tables to be built.