Mysql – Improving sql query execution time for 3 billion rows table

clusteringindexMySQLquery-performance

I have a 2+ billion row mysql table. I am also writing 1 million rows each day. The table contains some large columns like varchar(255) (to hold long urls)

In order to perform analytics on this table I created 5 specific indices which really speed up the execution time. (From 25+ minutes to 2 minutes for some queries).

The problem still persists that 2 minutes is a lot of time for just one query. I would like to run multiple queries for analytics and reporting.

Also, this table is increasing rapidly on a daily basis and I am pretty sure the indexes are as optimised as they can be.

Is this the point where clustering would solve my issues? i.e. is it unusual that a table this size still runs on a single sql node?

or is it still possible to run queries in such large table in milliseconds?

An example query of mine is:

SELECT name, url, SUM(visits), AVG(price), AVG(loc) FROM mytable
    WHERE sname IN ('white') AND usage IN ('three') AND date BETWEEN '2001-01-01' AND '2003-03-10'
    GROUP BY name, url ORDER BY SUM(visits);

I am new to clustering and HPC in general any advice on what I should do here is appreciated.

Best Answer

There is a lot of information not included in your question, which makes providing complete answers rather difficult. However, using just what you've shared:

Is this the point where clustering would solve my issues?

Not likely. Clustering offers a lot of advantages, but it does not seem to be the correct solution for what you're trying to do. With a million rows being added daily, your primary system needs to be optimized for writes. When talking about reporting, you could probably do with a system that is optimized for reads.

Is it unusual that a table this size still runs on a single SQL node?

The level of unusualness here depends greatly on the needs (and expectations) of the business. I would hope that there is a hot spare or replicated instance somewhere that is ready to jump in should the main server suffer a failure. At 11.5 record insertions per second, there's not much room for downtime.

Is it still possible to run queries in such large table in milliseconds?

Given enough hardware, I don't see why not. However, very rarely are people given access to the full computing power of an entire data centre.


Generally when I have had to work with situations like this, I've tried to keep things simple and ask specific questions regarding the types of reports that people are trying to collect from the system. If there are common patterns, then flattened historical tables have saved a great deal of time for everyone. Why query the same data from 2003 a thousand times a week when you can summarize it in a manner that makes long-view reports faster and just as accurate, after all?

However, one of the main ways I've solved this sort of issue — usually for universities that are collecting millions of records daily from seismometers and weather stations spread throughout the country and across the Pacific — is to "cheat" and have a replicated instance that updates its source tables only once or twice a day. This allows for a system that can be optimized for reads with a myriad of indexes, leaving the main server(s) optimized for writes with fewer indexes (if any at all).

For common reports, patterns in the data are found and put into summary tables on an hourly/nightly schedule, allowing for common reports to be quick. Ad hoc or more specific queries can also be run against the replicated instance without worrying about affecting performance on the main system(s). So long as the reports being generated do not need to be "real-time", this method generally works and can be done on a reasonable budget, which management types tend to appreciate.

Mind you, do not take this answer as anything beyond something to think about. As mentioned at the beginning, there is a lot of information that was not included in the original question, such as the target audience for the reports, what other tasks the database(s) are being used for, how often historical data is queried vs. current data, and the like. This is just an option based on past experiences when I have been asked to solve a similar-sounding problem.