I am building a database to store Ethereum Transactions that are connected to specific addresses. Once the database table eth_addresses_tx
starts hitting 3 million + rows, things start slowing down a lot and CPU starts hitting 100%. Storage is fine, 1 million rows is about 300mb. Before spending more money on a server, I would like to see if there is any ways to improve the schema and queries. The project will have 15-20 million rows to start and may have a lot more in the near future. There will be 200-500 unique addresses and those addresses may contain a small amount of rows (1,000) up to a larger amount (3 million). I could use some advice. Thank you.
Update: 1 more index has been added to datetimeTx
on the transactions table eth_address_tx
which fixed performance on almost all the queries besides the last 2 on this post.
Table purposes:
eth_addresses
– It stores addresses and meta data about the address being tracked.
eth_addresses_stats
– It stores statistics queried from the transactions table (eth_addresses_tx) such as how many transactions in certain periods etc. A programming script is queried on a cron job
eth_addresses_tx
– TX is short for transaction in the queries and schema. This table stores data from blockchain transactions attached to certain addresses. The value
column contains ETH amount for that transaction. There is 1 index from_tx.
MYSQL SCHEMA
CREATE DATABASE addresses CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
USE addresses;
CREATE TABLE `eth_addresses` (
`id` INT(11) NOT NULL AUTO_INCREMENT ,
`name` VARCHAR(32) NOT NULL ,
`website` VARCHAR(255) NOT NULL ,
`address` VARCHAR(1024) NOT NULL COMMENT 'ETH addresses' ,
PRIMARY KEY (`id`), UNIQUE `unique` (`website`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `eth_addresses_stats` (
`id` int(11) NOT NULL,
`balance` decimal(16,6) NULL DEFAULT NULL,
`tx_1h` int(11) NULL DEFAULT NULL COMMENT 'transactions',
`tx_24h` int(11) NULL DEFAULT NULL COMMENT 'transactions',
`tx_7d` int(11) NULL DEFAULT NULL COMMENT 'transactions',
`tx_all` int(11) NULL DEFAULT NULL COMMENT 'transactions',
`volume_1h` decimal(14,6) NULL DEFAULT NULL,
`volume_24h` decimal(14,6) NULL DEFAULT NULL,
`volume_7d` decimal(14,6) NULL DEFAULT NULL,
`volume_all` decimal(14,6) NULL DEFAULT NULL,
`users_1h` int(11) NULL DEFAULT NULL,
`users_24h` int(11) NULL DEFAULT NULL,
`users_7d` int(11) NULL DEFAULT NULL,
`users_all` int(11) NULL DEFAULT NULL,
`last_scraped` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
CONSTRAINT FOREIGN KEY (id) REFERENCES eth_addresses (id),
UNIQUE (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and
CREATE TABLE `eth_addresses_tx` (
`hash` varchar(66) NOT NULL,
`address_id` INT(11) NOT NULL,
`blockNumber` int(11) NOT NULL,
`datetimeTx` datetime NOT NULL,
`from_tx` varchar(42) NOT NULL,
`to_tx` varchar(42) NOT NULL,
`value` decimal(14,6) NOT NULL,
CONSTRAINT FOREIGN KEY (address_id) REFERENCES eth_addresses (id),
PRIMARY KEY (`hash`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Transactions';
ALTER TABLE `eth_addresses_tx` ADD INDEX `from_tx` (`from_tx`);
Slow Queries
The queries below are ran periodically and the results are stored in a caching table.
These queries are gathering total stats from the transactions table eth_addresses_tx
to be sent to eth_addresses_stats
. I was unsure how to combine them into 1 query so I wrote 4 queries.
UPDATE The 4 queries below are very fast with an index on datetimeTx. Addresses with million + rows are now taking 0.14 seconds
SELECT IFNULL(SUM(value), 0) AS volume_1h,
count(*) AS tx_1h,
COUNT(DISTINCT from_tx) AS users_1h
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 1 HOUR)
AND address_id = $address_id;
-- This query is slow on popular addresses with 1M+ rows
-- 1 row in set (32.30 sec) without explain
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | ref | address_id | address_id | 4 | const | 1440328 | 33.33 | Using where |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+------------------------+
SELECT IFNULL(SUM(value), 0) AS volume_24h,
count(*) AS tx_24h, COUNT(DISTINCT from_tx) AS users_24h
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 24 HOUR)
AND address_id = $address_id;
SELECT IFNULL(SUM(value), 0) AS volume_7d,
count(*) AS tx_7d, COUNT(DISTINCT from_tx) AS users_7d
FROM `eth_addresses_tx`
WHERE datetimeTx >= DATE_SUB(NOW(),INTERVAL 7 DAY)
AND address_id = $address_id;
SELECT IFNULL(SUM(value), 0) AS volume_all,
count(*) AS tx_all,
COUNT(DISTINCT from_tx) AS users_all
FROM `eth_addresses_tx`
WHERE address_id = $address_id;
These queries are to get recent transactions from eth_addresses_tx
with a few different where clauses: Ordering by datetime is very slow, without ordering the query is very fast.
UPDATE The following 3 queries below are now very fast with an index on datetimeTx. 10 rows in set (0.00 sec)
SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx,
value
FROM eth_addresses_tx
WHERE from_tx = $from_tx
ORDER BY datetimeTx DESC
LIMIT $limit;
SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx,
value
FROM eth_addresses_tx
WHERE address_id = $address_id
ORDER BY datetimeTx DESC
LIMIT $limit;
SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx,
value
FROM eth_addresses_tx
ORDER BY datetimeTx DESC
LIMIT $limit;
and
-- This query without explain = 10 rows in set (1.20 sec)
-- This one is faster then the ones below
EXPLAIN SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx,
value
FROM eth_addresses_tx
WHERE from_tx = $from_tx
ORDER BY datetimeTx DESC
LIMIT 10;
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | ref | from_tx | from_tx | 128 | const | 51680 | 100.00 | Using index condition; Using filesort |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+-------+----------+---------------------------------------+
-- This query without explain = 10 rows in set (31.51 sec)
-- Address ids with a small amount of transactions query fast however address with large amounts of rows query very slow
EXPLAIN SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx,
value
FROM eth_addresses_tx
WHERE address_id = $address_id
ORDER BY datetimeTx DESC
LIMIT 10;
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | ref | address_id | address_id | 4 | const | 1440242 | 100.00 | Using index condition; Using filesort |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+---------------------------------------+
-- This query without explain = 10 rows in set (9.93 sec)
EXPLAIN SELECT blockNumber, datetimeTx, address_id, from_tx, to_tx, value FROM eth_addresses_tx ORDER BY datetimeTx DESC LIMIT 10;
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | ALL | NULL | NULL | NULL | NULL | 2880430 | 100.00 | Using filesort |
+----+-------------+-------------+------------+------+---------------+------+---------+------+---------+----------+---------------------+
These queries are to plot transaction data to a chart to track the volume.
UPDATE The following 3 queries below are now very fast with an index on datetimeTx. 73 rows in set (0.02 sec)
SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 1 HOUR);
-- This query is slow on popular addresses with 1M+ rows
-- 88 rows in set (31.81 sec) without explain
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | ref | address_id | address_id| 4 | const | 1440342 | 11.11 | Using where |
+----+-------------+-------------+------------+------+---------------+---------+---------+-------+---------+----------+-----------------------+
SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 24 HOUR);
SELECT value FROM eth_addresses_tx WHERE address_id = $address_id AND value > 0 AND datetimeTx >= DATE_SUB(NOW(),INTERVAL 7 DAY);
These queries are to figure out who is interacting with those addresses the most
SELECT d.name, count(d.id) AS total
FROM `eth_addresses_tx` as tx
LEFT JOIN eth_addresses AS d ON d.id = tx.address_id
WHERE tx.from_tx = :address group by d.name
-- This query is not bad, pretty fast on smaller amounts
-- 1 row in set (1.06 sec) without explain
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+
| 1 | SIMPLE | tx | NULL | ref | from_tx | from_tx | 128 | const | 51680 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | d | NULL | ALL | PRIMARY | NULL | NULL | NULL | 5 | 100.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+---------+---------+-------+-------+----------+----------------------------------------------------+
-----------------------------------------------------
SELECT from_tx, count(*) AS total_tx, SUM(value) as total_volume, count(DISTINCT address_id) as total_interacted
FROM `eth_addresses_tx`
GROUP BY from_tx
ORDER BY total_tx DESC, from_tx DESC
LIMIT 100;
-- This query is very bad
-- 100 rows in set (2 min 26.45 sec) without explain
-- sorting with any of the 3 items selected is slow
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+
| 1 | SIMPLE | eth_addresses_tx | NULL | index | from_tx | from_tx | 128 | NULL | 2880587 | 100.00 | Using temporary; Using filesort |
+----+-------------+-------------+------------+-------+---------------+---------+---------+------+---------+----------+--------------------------------------+
Best Answer
Hash indexing is bad when table is big
Assuming
hash
is some form of "hash",eth_addresses_tx
has a very inefficientPRIMARY KEY
. When a table (or index) gets bigger than can be cached in RAM and the key is "random" (as are hashes), things slow down because the block needed is less and less likely to be cached in RAM. What do do?eth_addresses
-- 3M rows could be handled by a 3-byteMEDIUMINT UNSIGNED
.address VARCHAR(42)
-- If that is really0x
plus 40 hex digits, pack it intoBINARY(20)
(20 bytes) with the help ofLEFT
andUNHEX
.DECIMALs
nee to be that big.(14,6)
takes 7 bytes;FLOAT
takes 4 (but has fewer significant digits). What kind of value is "volume"? Cubic furlongs? Stock shares traded? Books in a trilogy? Fluid ounces? If you need exact values, thenDECIMAL
is required. But is 6 decimal places enough? (Not for Gwei or Wei.)FLOAT
's 7 significant digits may be good enough for summary totals?Better indexes
needs
INDEX(address_id, datetimeTx)
in this order.very much needs
INDEX(from_tx, datetimeTx)
in this order. Ditto for(address_id, datetimeTx)
. This should eliminate the 'filesort', but more importantly, it will need to look at only 10 rows, not thousands (see yourEXPLAINs
).This gets trickier:
You cannot effectively use two "ranges" in a single index. The best the optimizer can do is pick between the above index and
INDEX(address_id, value)
.The query with the
LEFT JOIN
is difficult to improve on because theWHERE
and theGROUP BY
reference more than one table.Can't improve
ORDER BY total_tx DESC, from_tx DESC LIMIT 100
becausetotal_tx
is an aggregate. (But, see "Summary tables". below.)More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
eth_addresses_tx
Assuming that is the 'big' table, we should look seriously at its
PRIMARY KEY
. Assuminghash
is needed, here is the best thing to do:The table is "clustered" on the PK. The queries would be fetching 'consecutive' rows with this PK. This is much better than the random nature caused by
PK(hash)
.Summary Table(s)
But the real winner would probably be to have a Summary table. It would get new rows added hourly. Reports on daily, weekly, etc could be computed from this table.
Summary tables: http://mysql.rjweb.org/doc.php/summarytables
Tuning (unlikely)
If you provide the
GLOBAL STATUS and VARIABLES
, I will look at them. But you "can't tune your way out of a performance problem". Especially when there are indexes to be added and/or summary tables to be built.