SQL getting IDs from one table with multiple entries in another table

performancequery-performance

In my database I have two tables: Persons and PersonAttributes.
One Person may have several Attributes associated with it.

What I would like to do is get a list of all Persons(/IDs) that have at least the specified attributes associated with them in the PersonAttributes table.

For one attribute this can easily be done with just one WHERE statement.
My problem is that I want to do this for multiple attributes.

The only solution I could come up with is doing a SELECT for each attribute and then JOINing them.
While I can build such a query programmatically it seems pretty complex and I was hoping that there may be an easier solution.

For better understanding here is a SQLFiddle including my solution for 3 attributes.

EDIT: Changed the SQLFiddle link. Query now looks like this:

SELECT Persons.ID 
FROM Persons
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'b') t1
    ON Persons.ID = t1.ID
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'c') t2
    ON t1.ID = t2.ID
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'd') t3
    ON t2.ID = t3.ID;

Best Answer

Apparently, your PersonAttributes table is designed using the EAV model. This model has the advantage of being easily extensible: attributes are stored as rows, and adding new rows is easy. However, querying this kind of table is more difficult than those designed the traditional way (attributes stored as columns).

Your solution is quite illustrative of how much more trouble it can be to accomplish a fairly simple task with an EAV-modelled table. It is actually one of the common ways to solve a problem like yours, although I would suggest you try rewriting it without using derived tables – like this:

SELECT p.ID 
FROM Persons AS p
  JOIN PersonAttributes AS paB ON p.ID = paB.ID
  JOIN PersonAttributes AS paC ON p.ID = paC.ID
  JOIN PersonAttributes AS paD ON p.ID = paD.ID
WHERE paB.Attr = 'b'
  AND paC.Attr = 'c'
  AND paD.Attr = 'd';

The performance will likely remain the same as with your syntax, but without making the query any faster this rewrite will at least make it more concise and arguably more readable.

That being said, there is another method, fairly common as well, that you could employ, which might offer better performance as the number of the attributes increases. It uses grouping and aggregation:

SELECT
  ID
FROM
  PersonAttributes
WHERE
  Attr IN ('b', 'c', 'd')
GROUP BY
  ID
HAVING
  COUNT(*) = 3
;

By this method, all rows that have any of the specified attributes are retrieved and grouped by ID. In order to determine the groups (persons) having all three attributes, a HAVING filter is introduced to compare the number of rows* in each group to the total number of attributes in the IN list.

The method can be slightly generalised if you can afford storing the attributes to search for in a (temporary) table. Here is what it would look like in that case:

SELECT
  pa.ID
FROM
  PersonAttributes AS pa
  INNER JOIN QueriedAttributes AS qa ON pa.Attr = qa.Attr
GROUP BY
  pa.ID
HAVING
  COUNT(*) = (SELECT COUNT(*) FROM QueriedAttributes)
;

No WHERE clause here – it is replaced by the join to the table of queried attributes, and the total number of attributes required to match is derived from the same table instead of being hard-coded.

This kind of problem is commonly known as relational division. It is discussed in detail in this article by Joe Celko:


*This particular implementation of the grouping method assumes there is always one row per attribute per person, so COUNT(*) works correctly. If attributes of the same kind may, or will later be allowed to, repeat per person, use COUNT(DISTINCT Attr) instead.