Mysql – GROUP BY on two large joined tables

group byMySQLscalability

I have two tables:

Person
------
PersonID (PK)
GroupID
DeletedDate


PersonData
----------
PersonID (PK)
FieldID (PK)
Value

The fields are custom-created; hence the schema. I am trying to run a histogram-type query to discover the range of values for a given FieldID together with a count of the times each value occurs. Here is the query:

SELECT pd.Value, count(*) 
FROM Person p 
INNER JOIN PersonData pd ON p.PersonID=cd.PersonID
WHERE p.DeletedDate IS NULL AND p.GroupID=‘cAPHSWovx9d2yaN’
GROUP BY pd.Value

We use GUIDs for the GroupIDs.

Performance is fine for small groups but degrades rapidly. The group in question has 5 million records in table Person and an order of magnitude beyond that in PersonData. The above query takes an hour to run. It would be acceptable to get it down to a few minutes.

I created an index on PersonData on (PersonID, FieldID, Value) in that order. MySql generally chooses not to use it when I run explain. (Actual results of explain vary based on FieldID.) I also have an index on Person on (GroupID, DeletedDate).

I have two questions:

  • On the explain I always get a “Using where; Using index; Using temporary; Using filesort” on the Person table. Why does MySql need to sort the Person table before joining to PersonData?

  • Why does MySql generally decline to use the (PersonID, FieldID, Value) index? Is there a better index I could use? I assumed I needed the PersonID in the index to help with the join.

Edit Here is the relevant portion of the explain:

id    select_type    table    type    key              key_len    ref          rows       Extra
1     SIMPLE         p        ref     deleted_person   67         const,const  3476105    Using where; Using index; Using temporary; Using filesort
1     SIMPLE         pd       ref     person_field_val 62         p.PersonID   10         Using index

Edit 2 Forgot the GROUP BY in the query above.

Best Answer

Sorry, but you have designed yourself into a corner.

Strike 1: EAV (key-value) schema sucks when the tables get large. Further discussion and possible workarounds: http://mysql.rjweb.org/doc.php/eav

Strike 2: GUIDs (and UUIDs, MD5s, etc) suck as KEYs because they are so random. Further discussion and a possible (but unlikely) workaround: http://mysql.rjweb.org/doc.php/uuid

Your best choice may be to keep buying RAM so that everything can be cached. And/or switch from spinning drives to SSDs. Of course, neither hardware solution scales well.

Software solution involves a major rework of the schema and application code. That is beyond the scope of this forum, other than to point you at those two blogs.

As for your specific questions:

On the explain I always get a “Using where; Using index; Using temporary; Using filesort” on the Person table. Why does MySql need to sort the Person table before joining to PersonData?

What you are seeing is bogus. EXPLAIN likes to put those flags on the first line, even though they might apply to some other line. If you have a new enough version of MySQL, do EXPLAIN FORMAT=JSON SELECT ...; it should spell out things correctly.

Why does MySql generally decline to use the PersonID/FieldID/Value index? Is there a better index I could use? I assumed I needed the PersonID in the index to help with the join.

I need to see SHOW CREATE TABLE. But here is a guess... If the WHERE clause matches more than ~20%, it will choose to do a table scan because it is likely to be faster. Another guess: You have PRIMARY KEY(person, field) and INDEX(person, field, value)?? And you are using InnoDB?? The PK includes all columns and is sorted by (person, field). That's identical to the secondary key! So, the optimizer may as well flip a coin to decide between them.

Anyway, it says "Using index"; that about as efficient as it gets.