SQLite – Simplifying and Improving Inner Join (Self Joining a Table)

sqlite

I have table which has the following schema:

cid         name        type         notnull     dflt_value  pk
----------  ----------  -----------  ----------  ----------  ----------
0           DatasetID   VARCHAR(15)  0                       1
1           QuestionID  VARCHAR(20)  0                       2
2           PersonID    INTEGER      0                       3
3           answer      text         0                       0

And I have constructed a query that joins several answer rows into one, uses some of them, (age in this case) as filters:

SELECT i.answer AS graduation_year, j.answer AS education, k.answer AS country FROM

(SELECT * FROM answer
WHERE QuestionID = 'year'
AND answer <> ''
AND DatasetID = 'test') i

inner join answer j
    on i.PersonID = j.PersonID
    AND i.DatasetID = j.DatasetID
    AND j.QuestionID = 'education'

inner join answer k
    on i.PersonID = k.PersonID
    AND i.datasetID = k.datasetID
    AND k.QuestionID = 'country'

inner join answer m
    on i.PersonID = m.PersonID
    AND i.datasetID = m.datasetID
    AND m.QuestionID = 'age'
    AND CAST(m.answer as NUMERIC) > 25;

The output is following:

graduation_year  education   country
---------------  ----------  ----------
1995             Bachelors   Finland
1980             PhD eng.    Austria

I'm wondering if the same query can be accomplished in a more efficient way in terms of query time, or if it can be simplified. I'm currently using sqlite3 as the test database, but if there's a way to simplify the query in for example MySQL, that would also be an acceptable solution.

For reference, the answer table contains these values (where datasetID is 'test'), i.e. the results of the query SELECT * FROM answer WHERE datasetID = 'test';

DatasetID   QuestionID  PersonID    answer
----------  ----------  ----------  ----------
test        year        12345       1992
test        country     12345       Austria 
test        age         12345       24
test        education   12345       M. Sc.
test        other       12345       boring...
test        year        123456      1995
test        country     123456      Finland
test        age         123456      26
test        education   123456      Bachelors
test        other       123456      meh
test        year        1234567     1980
test        country     1234567     Austria
test        age         1234567     35
test        education   1234567     PhD eng.
test        other       1234567     interesting 

Best Answer

When you need to retrieve rather many attributes from an EAV structure, you can try pivoting it with grouping & conditional aggregation as an alternative to multiple self-joins:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
GROUP BY
  PersonID
;

That avoids multiple table scans and, as a bonus, makes your query shorter. But how much the former will affect the performance can ultimately be determined only by testing.

Still, a minor (or not – depends on the total number of attributes) improvement to the above could be to reduce the underlying dataset to just the attributes you are actually retrieving, like this:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
;

Logically, the above is still not completely equivalent to the series of inner joins because the latter would give you only entries that have all specified attributes. The method in this answer can return entries that do not have some of the attributes requested. If you would like to exclude those, add a HAVING filter that makes sure the number of attributes matches the WHERE filter:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
;

Your query also has a filter on an attribute that in this case is an aggregated column. Such filters would need to be implemented in the HAVING clause. For your example, the filter on age would look like this:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
  AND CAST(MAX(CASE questionID WHEN 'age' THEN answer END) AS numeric) > 25
;

Again, you will need to test the final query in your environment to see how much, if at all, these additional modifications affect the performance.