Mysql – COUNT and GROUP BY performance

MySQL

I have a table where I store visits (id, datetime, country, browser). I want to count how many visits are from which country.

I use this query:

SELECT country, COUNT(*) as hits 
FROM hits_table 
WHERE timestamp >= CURDATE()
GROUP BY country

But the dataset is huge. So if I add LIMIT 10 to only get the top 10 countries will it improve the performance?

SELECT country, COUNT(*) as hits 
FROM hits_table 
WHERE timestamp >= CURDATE()
GROUP BY country LIMIT 10

Also I don't need exact results I only need some approximate results. So can't I make the query so it skips every 2nd result. Will it improve the performance? Are there any other methods?

Best Answer

Short answer: No.

Detailed answer: Unless you add/already have an index there that covers the country column the database engine has no way of telling the total number of instances to count. Even if you limit the output the entire table has to be scanned to retrieve every instance first. Of course, since this is a logging table, with potentially a large dataset, I'd avoid adding indexes to the table so:

  • not to increase INSERT time
  • to minimize space/memory usage the indexes would otherwise require

Solution: create a second table that will keep a running total for all countries, that gets updated by a trigger on the hits_table or directly by the application that inserts the data into hits_table. Something in the lines of:

CREATE TABLE `country_hits_totals` (
  `country` VARCHAR(255),
  `date`    DATE,
  `visits_total` BIGINT
);

This would make the table very small and allow you to create a PRIMARY KEY on the country and date columns, dedicating minimum resources. Once that key is set up, you could employ a trivial upsert statement (see this SQL Fiddle session for a full test case with a trigger-based usage):

INSERT INTO `country_hits_totals` (
  `country`,
  `date`,
  `visits_total`
)
SELECT
  NEW.country,
  NEW.timestamp,
  1
ON DUPLICATE KEY UPDATE `visits_total` = `visits_total` + 1;

Query blocking note: In a heavy traffic scenario you ideally want to query country_hits_totals with non-blocking statements only. Otherwise the upsert on the trigger gets delayed, and by extensions entire transaction that kicked it off. In a perfect world country_hits_totals would get copied over to a data warehouse at the time of least daily traffic, and you'd run your BI queries on that warehouse. That said, this is large traffic/enterprise territory and possible solutions depend on how quickly you want to get the data. Say, if you need an update only once a day you're better off pre-aggregating data onto country_hits_totals on a schedule instead of relying on a trigger updating the rows live.