Mysql – Indexing for group by + SUM query on thesql

MySQL

I have a table with the following schema:

    CREATE TABLE `wordtrend` (
      `oid` bigint(20) NOT NULL AUTO_INCREMENT,
      `monitorId` bigint(20) DEFAULT NULL,
      `nGram` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
      `nGramWord` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
      `negatives` bigint(20) DEFAULT NULL,
      `neutrals` bigint(20) DEFAULT NULL,
      `positives` bigint(20) DEFAULT NULL,
      `total` bigint(20) DEFAULT NULL,
      `trendCut` datetime DEFAULT NULL,
      PRIMARY KEY (`oid`)
    ) ENGINE=MyISAM AUTO_INCREMENT=358539 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Is it possible to create an index in order to run the following query efficiently?

SELECT nGram, nGramWord, SUM(total) AS sTotal, SUM(positives), SUM(negatives), SUM(neutrals) FROM WordTrend WHERE monitorId = 21751021 AND trendCut >= '2011-01-01 00:00:00' GROUP BY nGram, nGramWord ORDER BY sTotal DESC

We already tried the following:

KEY `RollupIndex` (`monitorId`,`trendCut`)
KEY `RollupIndex2` (`monitorId`,`nGram`,`trendCut`)

but we are getting "Using where; Using temporary; Using filesort" on the extra column.

Best Answer

I noticed something interesting about your query: you are getting these totals for this year. The trendCut date is not a static value. It is a very bad candidate for an index column since the column value changes and it is neither first nor last in the column order of the index.

The index you need is this

monitorID,trendCut,nGram,nGramWord,total

However, this would still be deficient.

Suggestion: If you do queries that require trendCut by year, you should have a trendCutYear column. Try the following:

  • Create New wordtrend Table
  • Create TrendCutYear Column in the New wordtrend Table
  • Create New Index in the New wordtrend Table
  • Load the New wordtrend Table from the Old wordtrend Table
  • Keep the Old wordtrend table under another name

Here are those steps:

CREATE TABLE wordtrendnew LIKE wordtrend;
ALTER TABLE wordtrendnew ADD COLUMN trendCutYear SMALLINT NOT NULL;
ALTER TABLE wordtrendnew ADD INDEX (monitorID,trendCutYear,nGram,nGramWord,total);
ALTER TABLE wordtrendnew DISABLE KEYS;
INSERT INTO wordtrendnew
SELECT oid,monitorId,nGram,nGramWord,negatives,
neutrals,positives,total,trendCut,YEAR(trendCut)
FROM wordtrend;
ALTER TABLE wordtrendnew ENABLE KEYS;
ALTER TABLE wordtrend RENAME wordtrendold;
ALTER TABLE wordtrendnew RENAME wordtrend;

Now, you have a static trendCutYear. You can now query using a static value in the middle of the index. The query can now be rewritten using trendCutYear instead of trendCut:

SELECT nGram, nGramWord, SUM(total) AS sTotal,
SUM(positives), SUM(negatives), SUM(neutrals) FROM WordTrend
WHERE monitorId = 21751021 AND trendCutYear = 2011
GROUP BY nGram, nGramWord ORDER BY sTotal DESC;

This should perform a whole lot better for you.

Give it a Try !!!