Thesql: expected time for a simple group by on a single table, 2M records, single user

MySQLperformance

I have a really small table with 2M records, on a 32GB machine, running Windows 10 Pro with nothing serious running except MySql 5.7. I expected that simple group by on any un-indexed column would take seconds, as Mysql would cache the entire table. It takes about 3 minutes. Am I missing some important tuning tip? or are these the expected times?

the query

mysql> select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----------------+----------+
| specialty_code | count(*) |
+----------------+----------+
| FM             |   199604 |
| GYN            |    14263 |
| OBG            |    78878 |
+----------------+----------+
3 rows in set (2 min 56.27 sec)

the table

mysql> describe physicians;
+-----------------------+---------+------+-----+---------+-------+
| Field                 | Type    | Null | Key | Default | Extra |
+-----------------------+---------+------+-----+---------+-------+
| physician_id          | int(11) | YES  |     | NULL    |       |
| practitioner_id       | int(11) | YES  | MUL | NULL    |       |
| state                 | text    | YES  |     | NULL    |       |
| specialty_code        | text    | YES  |     | NULL    |       |
| specialty_description | text    | YES  |     | NULL    |       |
| CBSA                  | text    | YES  |     | NULL    |       |
+-----------------------+---------+------+-----+---------+-------+
6 rows in set (0.10 sec)    

mysql> show table status like '%physicians%';
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| Name       | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time         | Update_time | Check_time | Collation       | Checksum | Create_options | Comment |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+
| physicians | InnoDB |      10 | Dynamic    | 1963005 |             66 |   130711552 |               0 |            0 |   2097152 |           NULL | 2016-01-04 08:41:30 | NULL        | NULL       | utf8_general_ci |     NULL |                |         |
+------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

sizing:

mysql> select count(*) from physicians;
+----------+
| count(*) |
+----------+
|  1957997 |
+----------+
1 row in set (9.91 sec)

explain:

mysql> explain select specialty_code, count(*)
    -> from physicians
    -> group by specialty_code;
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
| id | select_type | table      | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                           |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+
|  1 | SIMPLE      | physicians | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 1963005 |   100.00 | Using temporary; Using filesort |
+----+-------------+------------+------------+------+---------------+------+---------+------+---------+----------+---------------------------------+

params I tuned:

join_buffer_size=2G 
tmp_table_size=5G
innodb_buffer_pool_size=20G
sort_buffer_size=2G

The table uses the default InnoDB engine.

NOTE: this is just a simplified symptom of my bigger problem: I have ~5-6 tables with a total volume of 10 GB. I expected MySql to swoosh all of them into the memory and answer any ad-hoc query within a minute or so… Should I index every column before filtering/grouping/joining on it?

Thanks and sorry if this is ranting. I'm just used to distributed databases crunching data in milliseconds for me 🙂

Best Answer

tmp_table_size=5G

No. Go back to the default. If multiple queries need tmp tables, you could run out of RAM.

Change the TEXT fields to reasonable-length VARCHARs.

The first table scan will do I/O to bring the table into RAM (if possible); subsequent references to the table will see the data cached. Did you run the GROUP BY twice to see this effect?

Related Solutions

Mysql – get column from too many tables in thesql

If all the tables use the MyISAM Storage Engine and have the same table structure, I have some good news for you.

You can create a single table that consumes no additional space except a .frm file and some mapping info. The key is to take advantage of the MERGE (MRG_MyISAM) Storage Engine.

Here is how you can do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4) INSERT_METHOD=LAST;

Using this method, you can query the 4 tables at the same time like this:

SELECT ReqF FROM XMerge WHERE EmpName='John';

Was that simple, or what ???

In your case, you have 75 tables. You would do this:

CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

The beauty of this is that creating a MERGE table takes milliseconds. Just make sure every table has an index on EmpName. Better to do 75 indexed lookups that 75 full table scans. If there is no index on EmpName, you need to do this:

ALTER TABLE X1 ADD UNIQUE KEY (EmpName);
ALTER TABLE X2 ADD UNIQUE KEY (EmpName);
.
.
.
ALTER TABLE X75 ADD UNIQUE KEY (EmpName);
CREATE TABLE XMerge LIKE X1;
ALTER TABLE XMerge ENGINE=MRG_MYISAM
UNION=(X1,X2,X3,X4,X5,X6,X7,X8,X9,
X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,
X20,X21,X22,X23,X24,X25,X26,X27,X28,X29,
X30,X31,X32,X33,X34,X35,X36,X37,X38,X39,
X40,X41,X42,X43,X44,X45,X46,X47,X48,X49,
X50,X51,X52,X53,X54,X55,X56,X57,X58,X59,
X60,X61,X62,X63,X64,X65,X66,X67,X68,X69,
X70,X71,X72,X73,X74,X75) INSERT_METHOD=LAST;
SELECT ReqF FROM XMerge WHERE EmpName='John';

Give it a Try !!!

MySQL table locks solution -> InnoDb / Partitions

key_buffer_size ? RAM size?

On huge tables, indexes, especially if they are not "appended to", cause disk hits -- whether MyISAM or InnoDB.

var0, being an AUTO_INCREMENT, will be "appended to" the end. Essentially no disk hits for it. I can't tell about the other 4 indexes. Let's say they are all very random. This implies that each row inserted will need 4 disk hits. On normal drives that means about 25 rows inserted per second.

Solutions...

RAID striping (0,5,6,10). That will give you a factor of improvement.

SSDs. $$$

PARTITIONing. This may help. It may also help with some SELECTs. To discuss further, please reveal some of the semantics of the indexed fields. (dates? md5s? names?) Also provide some of the SELECTs, so we can optimize them at the same time. The "partition key" must be part of any UNIQUE key. This is not a problem with the AUTO_INCREMENT, but it may be with the other UNIQUE key.

Nothing (yet) says whether the Engine will make any difference.

Best Answer

Related Solutions

Mysql – get column from too many tables in thesql

MySQL table locks solution -> InnoDb / Partitions

Related Question