SQL getting IDs from one table with multiple entries in another table

performancequery-performance

In my database I have two tables: Persons and PersonAttributes.
One Person may have several Attributes associated with it.

What I would like to do is get a list of all Persons(/IDs) that have at least the specified attributes associated with them in the PersonAttributes table.

For one attribute this can easily be done with just one WHERE statement.
My problem is that I want to do this for multiple attributes.

The only solution I could come up with is doing a SELECT for each attribute and then JOINing them.
While I can build such a query programmatically it seems pretty complex and I was hoping that there may be an easier solution.

For better understanding here is a SQLFiddle including my solution for 3 attributes.

EDIT: Changed the SQLFiddle link. Query now looks like this:

SELECT Persons.ID 
FROM Persons
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'b') t1
    ON Persons.ID = t1.ID
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'c') t2
    ON t1.ID = t2.ID
  JOIN (SELECT * FROM PersonAttributes WHERE PersonAttributes.Attr = 'd') t3
    ON t2.ID = t3.ID;

Best Answer

Apparently, your PersonAttributes table is designed using the EAV model. This model has the advantage of being easily extensible: attributes are stored as rows, and adding new rows is easy. However, querying this kind of table is more difficult than those designed the traditional way (attributes stored as columns).

Your solution is quite illustrative of how much more trouble it can be to accomplish a fairly simple task with an EAV-modelled table. It is actually one of the common ways to solve a problem like yours, although I would suggest you try rewriting it without using derived tables – like this:

SELECT p.ID 
FROM Persons AS p
  JOIN PersonAttributes AS paB ON p.ID = paB.ID
  JOIN PersonAttributes AS paC ON p.ID = paC.ID
  JOIN PersonAttributes AS paD ON p.ID = paD.ID
WHERE paB.Attr = 'b'
  AND paC.Attr = 'c'
  AND paD.Attr = 'd';

The performance will likely remain the same as with your syntax, but without making the query any faster this rewrite will at least make it more concise and arguably more readable.

That being said, there is another method, fairly common as well, that you could employ, which might offer better performance as the number of the attributes increases. It uses grouping and aggregation:

SELECT
  ID
FROM
  PersonAttributes
WHERE
  Attr IN ('b', 'c', 'd')
GROUP BY
  ID
HAVING
  COUNT(*) = 3
;

By this method, all rows that have any of the specified attributes are retrieved and grouped by ID. In order to determine the groups (persons) having all three attributes, a HAVING filter is introduced to compare the number of rows^* in each group to the total number of attributes in the IN list.

The method can be slightly generalised if you can afford storing the attributes to search for in a (temporary) table. Here is what it would look like in that case:

SELECT
  pa.ID
FROM
  PersonAttributes AS pa
  INNER JOIN QueriedAttributes AS qa ON pa.Attr = qa.Attr
GROUP BY
  pa.ID
HAVING
  COUNT(*) = (SELECT COUNT(*) FROM QueriedAttributes)
;

No WHERE clause here – it is replaced by the join to the table of queried attributes, and the total number of attributes required to match is derived from the same table instead of being hard-coded.

This kind of problem is commonly known as relational division. It is discussed in detail in this article by Joe Celko:

Divided We Stand: The SQL of Relational Division

_{^*This particular implementation of the grouping method assumes there is always one row per attribute per person, so COUNT(*) works correctly. If attributes of the same kind may, or will later be allowed to, repeat per person, use COUNT(DISTINCT Attr) instead.}

Related Solutions

PostgreSQL 9.3 – Performance Issue: Counting table entries slower with jdbc postgres driver

There are too many different topics in one question, and even after many updates not all is clear. I'll just pick the elephant in th room and ignore the rest:

Query 1

->  Index Scan using r_pkey_index on relationships r (cost=0.43..57.53 rows=13 width=0)
                                                     (actual rows=1 loops=1)
         Index Cond: (id = 947367::bigint)
         Filter: ((date >= '2010-02-09'::date) AND (date <= '2010-11-24'::date))

Query 2

Filter: ((date >= '2010-02-09'::date) AND (date <= '2010-11-24'::date))
   Rows Removed by Filter: 44104
      ->  Bitmap Index Scan on e_pkey_index (cost=0.00..4652.34 rows=251720 width=0)
                                            (actual rows=263818 loops=1)
          Index Cond: (id = 879::bigint)

Obviously, a single row matches your criteria in the first query, while there are more than 200.000 in the second. Obviously, the first can be much faster.
Leaving aside that we are comparing two queries on two different tables.

Mysql – How to optimize very slow SELECT with LEFT JOINs over big tables

Pick a few attributes to include in person. Index them in a few combinations -- use composite indexes, not single-column indexes.

That is essentially the only way out of EAV-sucks-at-performance, which is where you are.

Here is more discussion: http://mysql.rjweb.org/doc.php/eav including a suggestion of using JSON instead of the key-value table.

Best Answer

Related Solutions

PostgreSQL 9.3 – Performance Issue: Counting table entries slower with jdbc postgres driver

Mysql – How to optimize very slow SELECT with LEFT JOINs over big tables

Related Question