I have two tables:
Person
------
PersonID (PK)
GroupID
DeletedDate
PersonData
----------
PersonID (PK)
FieldID (PK)
Value
The fields are custom-created; hence the schema. I am trying to run a histogram-type query to discover the range of values for a given FieldID together with a count of the times each value occurs. Here is the query:
SELECT pd.Value, count(*)
FROM Person p
INNER JOIN PersonData pd ON p.PersonID=cd.PersonID
WHERE p.DeletedDate IS NULL AND p.GroupID=‘cAPHSWovx9d2yaN’
GROUP BY pd.Value
We use GUIDs for the GroupIDs.
Performance is fine for small groups but degrades rapidly. The group in question has 5 million records in table Person
and an order of magnitude beyond that in PersonData
. The above query takes an hour to run. It would be acceptable to get it down to a few minutes.
I created an index on PersonData
on (PersonID, FieldID, Value)
in that order. MySql generally chooses not to use it when I run explain. (Actual results of explain vary based on FieldID
.) I also have an index on Person
on (GroupID, DeletedDate)
.
I have two questions:
-
On the explain I always get a “Using where; Using index; Using temporary; Using filesort” on the Person table. Why does MySql need to sort the
Person
table before joining toPersonData
? -
Why does MySql generally decline to use the
(PersonID, FieldID, Value)
index? Is there a better index I could use? I assumed I needed thePersonID
in the index to help with the join.
Edit Here is the relevant portion of the explain:
id select_type table type key key_len ref rows Extra
1 SIMPLE p ref deleted_person 67 const,const 3476105 Using where; Using index; Using temporary; Using filesort
1 SIMPLE pd ref person_field_val 62 p.PersonID 10 Using index
Edit 2 Forgot the GROUP BY
in the query above.
Best Answer
Sorry, but you have designed yourself into a corner.
Strike 1: EAV (key-value) schema sucks when the tables get large. Further discussion and possible workarounds: http://mysql.rjweb.org/doc.php/eav
Strike 2: GUIDs (and UUIDs, MD5s, etc) suck as KEYs because they are so random. Further discussion and a possible (but unlikely) workaround: http://mysql.rjweb.org/doc.php/uuid
Your best choice may be to keep buying RAM so that everything can be cached. And/or switch from spinning drives to SSDs. Of course, neither hardware solution scales well.
Software solution involves a major rework of the schema and application code. That is beyond the scope of this forum, other than to point you at those two blogs.
As for your specific questions:
What you are seeing is bogus. EXPLAIN likes to put those flags on the first line, even though they might apply to some other line. If you have a new enough version of MySQL, do
EXPLAIN FORMAT=JSON SELECT ...
; it should spell out things correctly.I need to see SHOW CREATE TABLE. But here is a guess... If the WHERE clause matches more than ~20%, it will choose to do a table scan because it is likely to be faster. Another guess: You have PRIMARY KEY(person, field) and INDEX(person, field, value)?? And you are using InnoDB?? The PK includes all columns and is sorted by (person, field). That's identical to the secondary key! So, the optimizer may as well flip a coin to decide between them.
Anyway, it says "Using index"; that about as efficient as it gets.