Thesql: expected time for a simple group by on a single table, 2M records, single user

MySQLperformance

related to MySQL tuning (my.cnf) for very large tables and ad hoc queries

I have a really small table with 2M records, on a 32GB machine, running Windows 10 Pro with nothing serious running except MySql 5.7. I expected that simple group by on any un-indexed column would take seconds, as Mysql would cache the entire table. It takes about 3 minutes. Am I missing some important tuning tip? or are these the expected times?

the query

mysql> select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----------------+----------+
| specialty_code | count(*) |
+----------------+----------+
| FM             |   199604 |
| GYN            |    14263 |
| OBG            |    78878 |
+----------------+----------+
3 rows in set (2 min 56.27 sec)

the table

mysql> describe physicians;
+-----------------------+---------+------+-----+---------+-------+
| Field                 | Type    | Null | Key | Default | Extra |
+-----------------------+---------+------+-----+---------+-------+
| physician_id          | int(11) | YES  |     | NULL    |       |
| practitioner_id       | int(11) | YES  | MUL | NULL    |       |
| state                 | text    | YES  |     | NULL    |       |
| specialty_code        | text    | YES  |     | NULL    |       |
| specialty_description | text    | YES  |     | NULL    |       |
| CBSA                  | text    | YES  |     | NULL    |       |
+-----------------------+---------+------+-----+---------+-------+
6 rows in set (0.10 sec)    

mysql> show table status like '%physicians%';
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| Name       | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation       | Checksum | Create_options | Comment |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| physicians | InnoDB |      10 | Dynamic    | 1963005 |             66 |   130711552 |               0 |            0 |   2097152 |           NULL | 2016-01-04 08:41:30 | NULL        | NULL       | utf8_general_ci |     NULL |                |         |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

sizing:

mysql> select count(*) from physicians;
+----------+
| count(*) |
+----------+
|  1957997 |
+----------+
1 row in set (9.91 sec) 

explain:

mysql> explain select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                           |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
|  1 | SIMPLE      | physicians | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 1963005 |   100.00 | Using temporary; Using filesort |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+    

params I tuned:

join_buffer_size=2G 
tmp_table_size=5G
innodb_buffer_pool_size=20G
sort_buffer_size=2G

The table uses the default InnoDB engine.

NOTE: this is just a simplified symptom of my bigger problem: I have ~5-6 tables with a total volume of 10 GB. I expected MySql to swoosh all of them into the memory and answer any ad-hoc query within a minute or so… Should I index every column before filtering/grouping/joining on it?

Thanks and sorry if this is ranting. I'm just used to distributed databases crunching data in milliseconds for me 🙂

Best Answer

tmp_table_size=5G

No. Go back to the default. If multiple queries need tmp tables, you could run out of RAM.

Change the TEXT fields to reasonable-length VARCHARs.

The first table scan will do I/O to bring the table into RAM (if possible); subsequent references to the table will see the data cached. Did you run the GROUP BY twice to see this effect?