PostgreSQL – Optimizing Multi-Table GROUP BY Queries

database-designindexoptimizationpostgresqlquery-performance

Currently I have a multi table/column GROUP BY query which is pretty slow. Below is the tables schema used by the query:

report Table

                                                                Table "report"
       Column       |           Type           |                           Modifiers                           | Storage  | Stats target | Description 
--------------------+--------------------------+---------------------------------------------------------------+----------+--------------+-------------
 id                 | integer                  | not null default nextval('dashboard_report_id_seq'::regclass) | plain    |              | 
 county             | character varying(200)   | not null                                                      | extended |              | 
 district           | character varying(200)   | not null                                                      | extended |              | 
 report_name        | character varying(500)   | not null                                                      | extended |              | 
Indexes:
    "dashboard_report_pkey" PRIMARY KEY, btree (id)

question Table

                                                       Table "question"
   Column    |          Type          |                            Modifiers                            | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------
 id          | integer                | not null default nextval('dashboard_question_id_seq'::regclass) | plain    |              | 
 name        | character varying(255) | not null                                                        | extended |              | 
 label       | character varying(255) | not null                                                        | extended |              | 
 report_type | character varying(255) | not null                                                        | extended |              | 
Indexes:
    "dashboard_question_pkey" PRIMARY KEY, btree (id)

questionanswer Table

                                                           Table "questionanswer"
   Column    |          Type          |                               Modifiers                               | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------------+----------+--------------+-------------
 id          | integer                | not null default nextval('dashboard_questionanswer_id_seq'::regclass) | plain    |              | 
 answer      | character varying(255) | not null                                                              | extended |              | 
 question_id | integer                | not null                                                              | plain    |              | 
 report_id   | integer                | not null                                                              | plain    |              | 
Indexes:
    "dashboard_questionanswer_pkey" PRIMARY KEY, btree (id)
    "dashboard_questionanswer_6f78b20c" btree (report_id)
    "dashboard_questionanswer_7aa0f6ee" btree (question_id)
Foreign-key constraints:
    "dashboard_que_report_id_4c2a87ee585b6121_fk_dashboard_report_id" FOREIGN KEY (report_id) REFERENCES dashboard_report(id) DEFERRABLE INITIALLY DEFERRED
    "dashboard_question_id_522f82097923c241_fk_dashboard_question_id" FOREIGN KEY (question_id) REFERENCES dashboard_question(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

Query:

SELECT 
    "report"."county", 
    "report"."district", 
    "report"."report_name", 
    "question"."report_type", 
    COUNT(DISTINCT "questionanswer"."report_id")
FROM "questionanswer" 
    INNER JOIN "question" ON ( "questionanswer"."question_id" = "question"."id" )
    INNER JOIN "report" ON ( "questionanswer"."report_id" = "report"."id" )
WHERE 
    "question"."name" = 'touch' 
GROUP BY 
    "report"."county", 
    "report"."district", 
    "report"."report_name", 
    "question"."report_type" ;

I'm not sure where to start with optimising this query, at the moment it is fairly slow on my system ~300 ms. I did try individually indexing all the columns referred by the GROUP BY clause but that didn't help the query time either.

So basically user submits reports, the data of which is stored in reports, each report has multiple QuestionAnswers, which are stored in QuestionAnswer Table. The Question meta data is in Question table. The QuestionAnswer table records will be large next comes Report table data, the Question table has a set of only 40 questions. The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.

My Postgres Sql version is PostgreSQL 9.3.5.
Query plan is here.
My machine hardware spec is MacOSX 10.9.5 4-Core 8 GM RAM SSD, encrypted file system.

Would indexing all the columns referred by GROUP BY help?
Or should I go for a multicolumn index?

Best Answer

I do wonder, why you have the report_type as attribute of the question?
Be that as it may, your objective:

The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.

Why would you include report_name in GROUP BY step? That conflicts with your definition. I think you should remove that:

SELECT r.county, r.district, q.report_type
     , count(DISTINCT r.id) AS reports
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1,2,3;

Also, as long as you restrict the query to a single question, there is only one report_type in the result per definition. Including it in the result and GROUP BY clause doesn't change the numbers.

As for performance: either create a UNIQUE constraint on (question_id, report_id) (in that order!) like I suspect you should have:

ALTER TABLE questionanswer ADD CONSTRAINT qa_uni UNIQUE (question_id, report_id);

Or, barring that, at least create an index on (question_id, report_id).
Why is the order of columns in the index / constraint important?

With the UNIQUE constraint in place, the query gets considerably cheaper:

     , count(*) AS reports

As long as you have only 40 questions you don't need an index on question.name, but as long as you select questions by name, you should still have a UNIQUE constraint on that column.

The PK on report does the rest.


Related query if you really want to count distinct counties and districts per question:

SELECT q.id, q.name, q.report_type
     , count(DISTINCT r.county)      AS distinct_counties
     , count(DISTINCT r.district)    AS distinct_districts
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1;  -- the PK column covers the whole table