I have a 100 million row table on MySQL. I need to count rows for certain ranges and I have the proper indices for filtering rows. Let's say a SELECT statement returns 20000 results, but all I need is the count. Is there other technique in addition to indexing that I can use? Is there another option such as Cassandra that would handle grouping and counting in a faster way?
Here's the table structure:
mysql> desc activity;
+------------------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------+------+-----+---------+-------+
| Source | text | YES | MUL | NULL | |
| Customer | text | YES | MUL | NULL | |
| Month | int(11) | YES | | NULL | |
| Day | int(11) | YES | | NULL | |
| Year | int(11) | YES | MUL | NULL | |
| Time | text | YES | | NULL | |
| User | text | YES | | NULL | |
| TimeStamp | date | YES | | NULL | |
| EmailEventType | text | NO | | NULL | |
+------------------------+---------+------+-----+---------+-------+
Indices:
mysql> show index from activity;
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| activity | 1 | idx_activity_customer | 1 | Customer | A | 180035 | 64 | NULL | YES | BTREE | | |
| activity | 1 | customer_date | 1 | Customer | A | 202831 | 64 | NULL | YES | BTREE | | |
| activity | 1 | customer_date | 2 | Year | A | 303263 | NULL | NULL | YES | BTREE | | |
| activity | 1 | customer_date | 3 | Month | A | 307744 | NULL | NULL | YES | BTREE | | |
| activity | 1 | customer_date | 4 | Day | A | 1388270 | NULL | NULL | YES | BTREE | | |
| activity | 1 | dates | 1 | Year | A | 1286 | NULL | NULL | YES | BTREE | | |
| activity | 1 | dates | 2 | Month | A | 20604 | NULL | NULL | YES | BTREE | | |
| activity | 1 | dates | 3 | Day | A | 146993 | NULL | NULL | YES | BTREE | | |
| activity | 1 | timestamp | 1 | Year | A | 1554 | NULL | NULL | YES | BTREE | | |
| activity | 1 | timestamp | 2 | TimeStamp | A | 119908 | NULL | NULL | YES | BTREE | | |
| activity | 1 | timestamp_customer | 1 | Customer | A | 188169 | 64 | NULL | YES | BTREE | | |
| activity | 1 | timestamp_customer | 2 | Year | A | 261389 | NULL | NULL | YES | BTREE | | |
| activity | 1 | timestamp_customer | 3 | TimeStamp | A | 743716 | NULL | NULL | YES | BTREE | | |
+----------+------------+-----------------------+--------------+-------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
14 rows in set (0.01 sec)
Query:
mysql> SELECT Customer,
User,
Source,
Count(CASE
WHEN ( Year = 2018
AND Week(TimeStamp) = 1 ) THEN Source
ELSE NULL
END) AS '2018-W1',
Count(CASE
WHEN ( Year = 2018
AND Week(TimeStamp) = 2 ) THEN Source
ELSE NULL
END) AS '2018-W2',
Count(CASE
WHEN ( Year = 2018
AND Week(TimeStamp) = 3 ) THEN Source
ELSE NULL
END) AS '2018-W3',
Count(CASE
WHEN ( Year = 2018
AND Week(TimeStamp) = 4 ) THEN Source
ELSE NULL
END) AS '2018-W4'
FROM activity
WHERE customer LIKE 'jones%'
AND ( ( Year = 2018
AND Week(TimeStamp) = 1 )
OR ( Year = 2018
AND Week(TimeStamp) = 2 )
OR ( Year = 2018
AND Week(TimeStamp) = 3 )
OR ( Year = 2018
AND Week(TimeStamp) = 4 ) )
AND Source IN ( 'online', 'other' )
GROUP BY Source,
User
ORDER BY Customer,
Source;
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
| Customer | User | Source | 2018-W1 | 2018-W2 | 2018-W3 | 2018-W4 |
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
| Jones corporation | 00000000@sample.com | OTHER | 87 | 51 | 75 | 20 |
| Jones corporation | 000000000000@sample.com | OTHER | 125 | 98 | 115 | 62 |
| Jones corporation | 000000000000000@sample.com | OTHER | 30 | 0 | 0 | 0 |
...
Truncated
...
| Jones cpa | 111111111111@sample.com | ONLINE | 0 | 0 | 0 | 18 |
| Jones cpa | 1111111111@sample.com | ONLINE | 0 | 0 | 0 | 225 |
+-------------------------+-----------------------------------+----------+---------+---------+---------+---------+
241 rows in set (9 min 10.93 sec)
So basically the optimizer is selecting and index, but then it takes a long time to calculate only 241 rows. Each User has a small row count of each type. How should the GROUP BY fields be related to the index?
Best Answer
Based on these large row numbers I'm making the assumption that an estimate is sufficient.
You could look at the filtered or rows column on an
EXPLAIN SELECT count(*) FROM tbl WHERE ref=value
to provide an estimate of the total based on the query planner sampling. ref: EXPLAIN OUTPUTIf using mariadb and histograms you might be able to derive an estimate from the raw tables(255 granularity).