PostgreSQL – Optimizing Multi-Table GROUP BY Queries

database-designindexoptimizationpostgresqlquery-performance

Currently I have a multi table/column GROUP BY query which is pretty slow. Below is the tables schema used by the query:

report Table

                                                                Table "report"
       Column       |           Type           |                           Modifiers                           | Storage  | Stats target | Description 
--------------------+--------------------------+---------------------------------------------------------------+----------+--------------+-------------
 id                 | integer                  | not null default nextval('dashboard_report_id_seq'::regclass) | plain    |              | 
 county             | character varying(200)   | not null                                                      | extended |              | 
 district           | character varying(200)   | not null                                                      | extended |              | 
 report_name        | character varying(500)   | not null                                                      | extended |              | 
Indexes:
    "dashboard_report_pkey" PRIMARY KEY, btree (id)

question Table

                                                       Table "question"
   Column    |          Type          |                            Modifiers                            | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------
 id          | integer                | not null default nextval('dashboard_question_id_seq'::regclass) | plain    |              | 
 name        | character varying(255) | not null                                                        | extended |              | 
 label       | character varying(255) | not null                                                        | extended |              | 
 report_type | character varying(255) | not null                                                        | extended |              | 
Indexes:
    "dashboard_question_pkey" PRIMARY KEY, btree (id)

questionanswer Table

                                                           Table "questionanswer"
   Column    |          Type          |                               Modifiers                               | Storage  | Stats target | Description 
-------------+------------------------+-----------------------------------------------------------------------+----------+--------------+-------------
 id          | integer                | not null default nextval('dashboard_questionanswer_id_seq'::regclass) | plain    |              | 
 answer      | character varying(255) | not null                                                              | extended |              | 
 question_id | integer                | not null                                                              | plain    |              | 
 report_id   | integer                | not null                                                              | plain    |              | 
Indexes:
    "dashboard_questionanswer_pkey" PRIMARY KEY, btree (id)
    "dashboard_questionanswer_6f78b20c" btree (report_id)
    "dashboard_questionanswer_7aa0f6ee" btree (question_id)
Foreign-key constraints:
    "dashboard_que_report_id_4c2a87ee585b6121_fk_dashboard_report_id" FOREIGN KEY (report_id) REFERENCES dashboard_report(id) DEFERRABLE INITIALLY DEFERRED
    "dashboard_question_id_522f82097923c241_fk_dashboard_question_id" FOREIGN KEY (question_id) REFERENCES dashboard_question(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

Query:

SELECT 
    "report"."county", 
    "report"."district", 
    "report"."report_name", 
    "question"."report_type", 
    COUNT(DISTINCT "questionanswer"."report_id")
FROM "questionanswer" 
    INNER JOIN "question" ON ( "questionanswer"."question_id" = "question"."id" )
    INNER JOIN "report" ON ( "questionanswer"."report_id" = "report"."id" )
WHERE 
    "question"."name" = 'touch' 
GROUP BY 
    "report"."county", 
    "report"."district", 
    "report"."report_name", 
    "question"."report_type" ;

I'm not sure where to start with optimising this query, at the moment it is fairly slow on my system ~300 ms. I did try individually indexing all the columns referred by the GROUP BY clause but that didn't help the query time either.

So basically user submits reports, the data of which is stored in reports, each report has multiple QuestionAnswers, which are stored in QuestionAnswer Table. The Question meta data is in Question table. The QuestionAnswer table records will be large next comes Report table data, the Question table has a set of only 40 questions. The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.

My Postgres Sql version is PostgreSQL 9.3.5.
Query plan is here.
My machine hardware spec is MacOSX 10.9.5 4-Core 8 GM RAM SSD, encrypted file system.

Would indexing all the columns referred by GROUP BY help?
Or should I go for a multicolumn index?

Best Answer

I do wonder, why you have the report_type as attribute of the question?
Be that as it may, your objective:

The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.

Why would you include ~~report_name~~ in GROUP BY step? That conflicts with your definition. I think you should remove that:

SELECT r.county, r.district, q.report_type
     , count(DISTINCT r.id) AS reports
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1,2,3;

Also, as long as you restrict the query to a single question, there is only one report_type in the result per definition. Including it in the result and GROUP BY clause doesn't change the numbers.

As for performance: either create a UNIQUE constraint on (question_id, report_id) (in that order!) like I suspect you should have:

ALTER TABLE questionanswer ADD CONSTRAINT qa_uni UNIQUE (question_id, report_id);

Or, barring that, at least create an index on (question_id, report_id).
Why is the order of columns in the index / constraint important?

Is a composite index also good for queries on the first field?

With the UNIQUE constraint in place, the query gets considerably cheaper:

     , count(*) AS reports

As long as you have only 40 questions you don't need an index on question.name, but as long as you select questions by name, you should still have a UNIQUE constraint on that column.

The PK on report does the rest.

Related query if you really want to count distinct counties and districts per question:

SELECT q.id, q.name, q.report_type
     , count(DISTINCT r.county)      AS distinct_counties
     , count(DISTINCT r.district)    AS distinct_districts
FROM   question q 
JOIN   questionanswer qa ON qa.question_id = q.id
JOIN   report         r  ON qa.report_id = r.id
WHERE  q.name = 'touch' 
GROUP  BY 1;  -- the PK column covers the whole table

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Would a different kind of column be faster? For example an integer

No. timestamp and timestamptz are just unsigned 64-bit integers internally anyway.

Is there some way to not lock the column?

It doesn't lock the column. It takes weak table lock that doesn't really block anything except DDL, and takes a row level lock on the row you're updating.

There is no way to prevent the row level lock. It exists because without it behaviour and ordering concurrent updates would be undefined. We don't like undefined behaviour in RDBMSs.

It only blocks concurrent updates of the same row anyway.

Any other tips to improve this?

Not with the detail provided. There's likely a better way to do what you're trying to do, but it'll probably involve taking a few steps back and looking for a different strategy for solving the underlying problem.

In the specific case of cache invalidation I think you might want to look into LISTEN and NOTIFY. Again though, there just isn't enough info here to go on.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Improve performance on concurrent UPDATEs for a timestamp column in Postgres

Related Question