Mysql – Can Group By and Count on MySQL be optimized

group byMySQLpivot

I have a 100 million row table on MySQL. I need to count rows for certain ranges and I have the proper indices for filtering rows. Let's say a SELECT statement returns 20000 results, but all I need is the count. Is there other technique in addition to indexing that I can use? Is there another option such as Cassandra that would handle grouping and counting in a faster way?

Here's the table structure:

mysql> desc activity;
+------------------------+---------+------+-----+---------+-------+
| Field                  | Type    | Null | Key | Default | Extra |
+------------------------+---------+------+-----+---------+-------+
| Source                 | text    | YES  | MUL | NULL    |       |
| Customer               | text    | YES  | MUL | NULL    |       |
| Month                  | int(11) | YES  |     | NULL    |       |
| Day                    | int(11) | YES  |     | NULL    |       |
| Year                   | int(11) | YES  | MUL | NULL    |       |
| Time                   | text    | YES  |     | NULL    |       |
| User                   | text    | YES  |     | NULL    |       |
| TimeStamp              | date    | YES  |     | NULL    |       |
| EmailEventType         | text    | NO   |     | NULL    |       |
+------------------------+---------+------+-----+---------+-------+

Indices:

mysql> show index from activity;
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name              | Seq_in_index | Column_name       | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| activity |          1 | idx_activity_customer |            1 | Customer          | A         |      180035 |       64 | NULL   | YES  | BTREE      |         |               |
| activity |          1 | customer_date         |            1 | Customer          | A         |      202831 |       64 | NULL   | YES  | BTREE      |         |               |
| activity |          1 | customer_date         |            2 | Year              | A         |      303263 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | customer_date         |            3 | Month             | A         |      307744 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | customer_date         |            4 | Day               | A         |     1388270 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | dates                 |            1 | Year              | A         |        1286 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | dates                 |            2 | Month             | A         |       20604 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | dates                 |            3 | Day               | A         |      146993 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | timestamp             |            1 | Year              | A         |        1554 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | timestamp             |            2 | TimeStamp         | A         |      119908 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | timestamp_customer    |            1 | Customer          | A         |      188169 |       64 | NULL   | YES  | BTREE      |         |               |
| activity |          1 | timestamp_customer    |            2 | Year              | A         |      261389 |     NULL | NULL   | YES  | BTREE      |         |               |
| activity |          1 | timestamp_customer    |            3 | TimeStamp         | A         |      743716 |     NULL | NULL   | YES  | BTREE      |         |               |
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
14 rows in set (0.01 sec)

Query:

mysql> SELECT Customer, 
   User, 
   Source, 
   Count(CASE 
           WHEN ( Year = 2018 
                  AND Week(TimeStamp) = 1 ) THEN Source 
           ELSE NULL 
         END) AS '2018-W1', 
   Count(CASE 
           WHEN ( Year = 2018 
                  AND Week(TimeStamp) = 2 ) THEN Source 
           ELSE NULL 
         END) AS '2018-W2', 
   Count(CASE 
           WHEN ( Year = 2018 
                  AND Week(TimeStamp) = 3 ) THEN Source 
           ELSE NULL 
         END) AS '2018-W3', 
   Count(CASE 
           WHEN ( Year = 2018 
                  AND Week(TimeStamp) = 4 ) THEN Source 
           ELSE NULL 
         END) AS '2018-W4' 
FROM   activity 

WHERE  customer LIKE 'jones%' 
       AND ( ( Year = 2018 
               AND Week(TimeStamp) = 1 ) 
              OR ( Year = 2018 
                   AND Week(TimeStamp) = 2 ) 
              OR ( Year = 2018 
                   AND Week(TimeStamp) = 3 ) 
              OR ( Year = 2018 
                   AND Week(TimeStamp) = 4 ) ) 
       AND Source IN ( 'online', 'other' ) 
GROUP  BY Source, 
          User 
ORDER  BY Customer, 
          Source; 
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
| Customer                | User                              | Source   | 2018-W1 | 2018-W2 | 2018-W3 | 2018-W4 |
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
| Jones corporation       | 00000000@sample.com               | OTHER    |      87 |      51 |      75 |      20 |
| Jones corporation       | 000000000000@sample.com           | OTHER    |     125 |      98 |     115 |      62 |
| Jones corporation       | 000000000000000@sample.com        | OTHER    |      30 |       0 |       0 |       0 |

...
Truncated
...

| Jones cpa               | 111111111111@sample.com           | ONLINE   |       0 |       0 |       0 |      18 |
| Jones cpa               | 1111111111@sample.com             | ONLINE   |       0 |       0 |       0 |     225 |
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
241 rows in set (9 min 10.93 sec)

So basically the optimizer is selecting and index, but then it takes a long time to calculate only 241 rows. Each User has a small row count of each type. How should the GROUP BY fields be related to the index?

Best Answer

Based on these large row numbers I'm making the assumption that an estimate is sufficient.

You could look at the filtered or rows column on an EXPLAIN SELECT count(*) FROM tbl WHERE ref=value to provide an estimate of the total based on the query planner sampling. ref: EXPLAIN OUTPUT

If using mariadb and histograms you might be able to derive an estimate from the raw tables(255 granularity).