I have table which has the following schema:
cid name type notnull dflt_value pk
---------- ---------- ----------- ---------- ---------- ----------
0 DatasetID VARCHAR(15) 0 1
1 QuestionID VARCHAR(20) 0 2
2 PersonID INTEGER 0 3
3 answer text 0 0
And I have constructed a query that joins several answer rows into one, uses some of them, (age in this case) as filters:
SELECT i.answer AS graduation_year, j.answer AS education, k.answer AS country FROM
(SELECT * FROM answer
WHERE QuestionID = 'year'
AND answer <> ''
AND DatasetID = 'test') i
inner join answer j
on i.PersonID = j.PersonID
AND i.DatasetID = j.DatasetID
AND j.QuestionID = 'education'
inner join answer k
on i.PersonID = k.PersonID
AND i.datasetID = k.datasetID
AND k.QuestionID = 'country'
inner join answer m
on i.PersonID = m.PersonID
AND i.datasetID = m.datasetID
AND m.QuestionID = 'age'
AND CAST(m.answer as NUMERIC) > 25;
The output is following:
graduation_year education country
--------------- ---------- ----------
1995 Bachelors Finland
1980 PhD eng. Austria
I'm wondering if the same query can be accomplished in a more efficient way in terms of query time, or if it can be simplified. I'm currently using sqlite3 as the test database, but if there's a way to simplify the query in for example MySQL, that would also be an acceptable solution.
For reference, the answer table contains these values (where datasetID is 'test'), i.e. the results of the query SELECT * FROM answer WHERE datasetID = 'test';
DatasetID QuestionID PersonID answer
---------- ---------- ---------- ----------
test year 12345 1992
test country 12345 Austria
test age 12345 24
test education 12345 M. Sc.
test other 12345 boring...
test year 123456 1995
test country 123456 Finland
test age 123456 26
test education 123456 Bachelors
test other 123456 meh
test year 1234567 1980
test country 1234567 Austria
test age 1234567 35
test education 1234567 PhD eng.
test other 1234567 interesting
Best Answer
When you need to retrieve rather many attributes from an EAV structure, you can try pivoting it with grouping & conditional aggregation as an alternative to multiple self-joins:
That avoids multiple table scans and, as a bonus, makes your query shorter. But how much the former will affect the performance can ultimately be determined only by testing.
Still, a minor (or not – depends on the total number of attributes) improvement to the above could be to reduce the underlying dataset to just the attributes you are actually retrieving, like this:
Logically, the above is still not completely equivalent to the series of inner joins because the latter would give you only entries that have all specified attributes. The method in this answer can return entries that do not have some of the attributes requested. If you would like to exclude those, add a HAVING filter that makes sure the number of attributes matches the WHERE filter:
Your query also has a filter on an attribute that in this case is an aggregated column. Such filters would need to be implemented in the HAVING clause. For your example, the filter on age would look like this:
Again, you will need to test the final query in your environment to see how much, if at all, these additional modifications affect the performance.