Mysql – Improving sql query execution time for 3 billion rows table

clusteringindexMySQLquery-performance

I have a 2+ billion row mysql table. I am also writing 1 million rows each day. The table contains some large columns like varchar(255) (to hold long urls)

In order to perform analytics on this table I created 5 specific indices which really speed up the execution time. (From 25+ minutes to 2 minutes for some queries).

The problem still persists that 2 minutes is a lot of time for just one query. I would like to run multiple queries for analytics and reporting.

Also, this table is increasing rapidly on a daily basis and I am pretty sure the indexes are as optimised as they can be.

Is this the point where clustering would solve my issues? i.e. is it unusual that a table this size still runs on a single sql node?

or is it still possible to run queries in such large table in milliseconds?

An example query of mine is:

SELECT name, url, SUM(visits), AVG(price), AVG(loc) FROM mytable
    WHERE sname IN ('white') AND usage IN ('three') AND date BETWEEN '2001-01-01' AND '2003-03-10'
    GROUP BY name, url ORDER BY SUM(visits);

I am new to clustering and HPC in general any advice on what I should do here is appreciated.

Best Answer

There is a lot of information not included in your question, which makes providing complete answers rather difficult. However, using just what you've shared:

Is this the point where clustering would solve my issues?

Not likely. Clustering offers a lot of advantages, but it does not seem to be the correct solution for what you're trying to do. With a million rows being added daily, your primary system needs to be optimized for writes. When talking about reporting, you could probably do with a system that is optimized for reads.

Is it unusual that a table this size still runs on a single SQL node?

The level of unusualness here depends greatly on the needs (and expectations) of the business. I would hope that there is a hot spare or replicated instance somewhere that is ready to jump in should the main server suffer a failure. At 11.5 record insertions per second, there's not much room for downtime.

Is it still possible to run queries in such large table in milliseconds?

Given enough hardware, I don't see why not. However, very rarely are people given access to the full computing power of an entire data centre.

Generally when I have had to work with situations like this, I've tried to keep things simple and ask specific questions regarding the types of reports that people are trying to collect from the system. If there are common patterns, then flattened historical tables have saved a great deal of time for everyone. Why query the same data from 2003 a thousand times a week when you can summarize it in a manner that makes long-view reports faster and just as accurate, after all?

However, one of the main ways I've solved this sort of issue — usually for universities that are collecting millions of records daily from seismometers and weather stations spread throughout the country and across the Pacific — is to "cheat" and have a replicated instance that updates its source tables only once or twice a day. This allows for a system that can be optimized for reads with a myriad of indexes, leaving the main server(s) optimized for writes with fewer indexes (if any at all).

For common reports, patterns in the data are found and put into summary tables on an hourly/nightly schedule, allowing for common reports to be quick. Ad hoc or more specific queries can also be run against the replicated instance without worrying about affecting performance on the main system(s). So long as the reports being generated do not need to be "real-time", this method generally works and can be done on a reasonable budget, which management types tend to appreciate.

Mind you, do not take this answer as anything beyond something to think about. As mentioned at the beginning, there is a lot of information that was not included in the original question, such as the target audience for the reports, what other tasks the database(s) are being used for, how often historical data is queried vs. current data, and the like. This is just an option based on past experiences when I have been asked to solve a similar-sounding problem.

Related Solutions

Mysql – How to optimize indexes on MySQL query with various sorts

WHERE playable_character = 0 AND
    date_published BETWEEN date_sub(now(), INTERVAL 3 YEAR) AND now()

Start with the "=" item, then do the range:

INDEX(playable_character, date_published);

"Pagination", a la ORDER BY rating DESC LIMIT 4000, 1000; is best done by remember where you "left off". That way, you don't have scan over the 4000 records that you don't need.

Mysql – Reducing execution time for this gnarly SQL

At the moment, I am using an instance of MySQL 5.6.19 which has the performance_schema enabled on a 2GB x64 Linux laptop (nothing stellar by any means!). My normal response time is approx. 0.5 s - not too bad. What's your setup (RAM, disk config, OS &c.)?

I must say that I'm surprised at 3 seconds on a server. Are you sure that it's not some sort of network issue?

I checked out I/O for the query and noticed something strange - every time the query executes, the order_items table is accessed 27,321 times - which is exactly the same as the number of records. This smells of a correlated subquery - but the SQL is horribly complex and I haven't been able to get to the bottom of it (not enough time). Obviously, you should only want to access the table once (or a few times maybe) at most. Order_status is only read once for example.

It's not really a surprise that your query doesn't use indexes - see this post here which gives some explanation as to why - basically if the optimiser thinks that a FTS (full table scan) is cheaper than using the index, then that's what it will do. This is a function of the size of the table(s) (yours are v. small ===> FTS) and/or percentage of records to be retrieved (yours 100%, again ==> FTS).

I notice that you are using MyISAM tables, for full text indexes? InnoDB now has full text indexes - see here and check the docco. Also (and more importantly), InnoDB allows you to use DRI (declarative referential integrity) which is very important - IMHO, there is no good reason to use MyISAM tables anymore.

Could you explain the business logic behind this query? It strikes me (and this is only an opinion) that it is overly complex for a relatively simple system with 8 tables - has it grown over time and been added to by different people? I am respectfully suggesting that the time may have come for a complete rewrite from scratch rather than tinkering at the edges.

Another problem is that for orders, you have ordered_date as an INT(12) - when I first saw the data, I thought that it was a timestamp. You are creating extra work for yourself if you don't use the correct data type for your columns. Error checking is more difficult.

[EDIT] in response to OP's comments (easy to hard)

INT v TIMESTAMP v DATE/DATETIME

Take a look at these (1, 2, 3, 4) posts. I suppose that INT v TIMESTAMP is relatively trivial (however, see point 4 in the answer here), but INT/TIMESTAMP v DATE/DATETIME isn't. Check out the answer to post 4 - you should be using DATEs/DATETIMEs when you are using INT/TIMESTAMP.

MySQL 5.6.19, but the server is using 5.5.37. Would upgrading improve the performance?

Maybe, maybe not - it's probably trivial in respect to the issues that I'll be going through in point 3 - see InnoDB/Full Text Indexing. If I had to guess, upgrading would decrease performance by no more than a couple of percent (some think that the performance schema adds overhead) and quite possibly be a significant improvement. These things can vary according to many parameters (disk, CPU, RAM, network, server load, phase of the moon...). Overall though, performance should improve release by release although YMMV.

General remarks.

I changed the storage engine for your system from MyISAM to InnoDB. This resulted in the query time being consistently reduced from ~ .45s to ~ .37s (~20%) - a noticeable improvement - maybe you should try it? I can only speak for the testing (not v. thorough) I did on my 2GB x64 Linux laptop. You didn't answer my question about your own server specs. If I'm getting a response time of ~ .4s, I can't for the life of me understand why you're getting 3s? Unless it's (as I suggested) a network issue, or is the server under load when you run the query?

InnoDB now has full text indexing in 5.6 (see my point 2 above) - AFAIK, that will not work in 5.5 (i.e. your table declarations will fail for InnoDB) - another reason to upgrade.

I tried to put in FOREIGN KEY relationships in the tables, but there appears to be problems with your data - I'm getting different counts after reloading the schemas (without FK and with FK). This can mean you have orphaned records, which may or may not affect the result. Anomalies occurred with orders/order_items and customers/customer_addresses. e.g.

mysql> select i.item_id, i.order_id from order_items i where i.order_id not in (SELECT ORDER_ID from orders);
+---------+----------+
| item_id | order_id |
+---------+----------+
|   26892 |     2433 |
+---------+----------+

Worse, none of the order payments appear valid

mysql> select p.order_id from order_payments p where p.order_id not in (select order_id from orders);
+----------+
| order_id |
+----------+
|     2143 |
|     2214 |
|     2216 |
+----------+
3 rows in set (0.00 sec)

mysql> 
mysql> select * from order_payments;
+------------+----------+--------+------------+------------+
| payment_id | order_id | amount | date       | is_deleted |
+------------+----------+--------+------------+------------+
|          1 |     2143 |  99.34 | 1405900800 |          0 |
|          2 |     2216 |   5.09 | 1406505600 |          0 |
|          3 |     2214 |  55.10 | 1406505600 |          0 |
+------------+----------+--------+------------+------------+

I couldn't figure out the other relationships in the time available - maybe we can go over them. It may well be that a solid DB structure with overtly declared FOREIGN KEYs may perform better. As a general rule, the more information you give the optimiser, the better.

The point about the optimiser not using indexes when tables are too small and/or too high a percentage of records is accessed still stands - however, there is no harm in doing things correctly - and as your data volumes change/grow, the optimiser may choose different plans - and the more info, the better.

The orphaned record issue (forgetting about performance for the moment) is another GOOD reason to switch to InnoDB. With proper DRI (declarative referential integrity), you cannot have orphaned records, which can be the source of tricky to find bugs.

Finally, just a few words on the system generally. It seems to me that this query is like a runaway train - it has built up and built up momentum and is now overly complex for purpose. Maybe a step back is required? I got to thinking about triggers and using them to populate other tables (similar to a materialised view) and/or even the use of views themselves. You need to have a system that you are comfortable with and that you can change to suit your needs. Some behemoth sitting in the corner spewing out answers that nobody understands is not very conducive to flexibility, upgrading or peace of mind :-).

I would like to do three things (using InnoDB)

1) get all the data types in order (DATE/DATETIMEs when appropriate)

2) sort out the foreign key relationships (need input on that - please feel free to contact me off list for this - my email is in my profile).

3) break this juggernaut down into its constituent components and explore methods of making it better, faster, easier to understand and modify.

I am unable to spend 09:00 - 18:00 at this, but would like to have a look in the evening if that suits and you are interested.

Best Answer

Related Solutions

Mysql – How to optimize indexes on MySQL query with various sorts

Mysql – Reducing execution time for this gnarly SQL

Related Question