Currently I have a multi table/column GROUP BY
query which is pretty slow. Below is the tables schema used by the query:
report Table
Table "report"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+--------------------------+---------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_report_id_seq'::regclass) | plain | |
county | character varying(200) | not null | extended | |
district | character varying(200) | not null | extended | |
report_name | character varying(500) | not null | extended | |
Indexes:
"dashboard_report_pkey" PRIMARY KEY, btree (id)
question Table
Table "question"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_question_id_seq'::regclass) | plain | |
name | character varying(255) | not null | extended | |
label | character varying(255) | not null | extended | |
report_type | character varying(255) | not null | extended | |
Indexes:
"dashboard_question_pkey" PRIMARY KEY, btree (id)
questionanswer Table
Table "questionanswer"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------------------------+-----------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('dashboard_questionanswer_id_seq'::regclass) | plain | |
answer | character varying(255) | not null | extended | |
question_id | integer | not null | plain | |
report_id | integer | not null | plain | |
Indexes:
"dashboard_questionanswer_pkey" PRIMARY KEY, btree (id)
"dashboard_questionanswer_6f78b20c" btree (report_id)
"dashboard_questionanswer_7aa0f6ee" btree (question_id)
Foreign-key constraints:
"dashboard_que_report_id_4c2a87ee585b6121_fk_dashboard_report_id" FOREIGN KEY (report_id) REFERENCES dashboard_report(id) DEFERRABLE INITIALLY DEFERRED
"dashboard_question_id_522f82097923c241_fk_dashboard_question_id" FOREIGN KEY (question_id) REFERENCES dashboard_question(id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no
Query:
SELECT
"report"."county",
"report"."district",
"report"."report_name",
"question"."report_type",
COUNT(DISTINCT "questionanswer"."report_id")
FROM "questionanswer"
INNER JOIN "question" ON ( "questionanswer"."question_id" = "question"."id" )
INNER JOIN "report" ON ( "questionanswer"."report_id" = "report"."id" )
WHERE
"question"."name" = 'touch'
GROUP BY
"report"."county",
"report"."district",
"report"."report_name",
"question"."report_type" ;
I'm not sure where to start with optimising this query, at the moment it is fairly slow on my system ~300 ms. I did try individually indexing all the columns referred by the GROUP BY
clause but that didn't help the query time either.
So basically user submits reports, the data of which is stored in reports, each report has multiple QuestionAnswers, which are stored in QuestionAnswer Table. The Question meta data is in Question table. The QuestionAnswer table records will be large next comes Report table data, the Question table has a set of only 40 questions. The objective of the query is to figure out per county, district and report type for a specific question how many reports we have that have answered that question.
My Postgres Sql version is PostgreSQL 9.3.5.
Query plan is here.
My machine hardware spec is MacOSX 10.9.5 4-Core 8 GM RAM SSD, encrypted file system.
Would indexing all the columns referred by GROUP BY
help?
Or should I go for a multicolumn index?
Best Answer
I do wonder, why you have the
report_type
as attribute of the question?Be that as it may, your objective:
Why would you include
inreport_name
GROUP BY
step? That conflicts with your definition. I think you should remove that:Also, as long as you restrict the query to a single question, there is only one
report_type
in the result per definition. Including it in the result andGROUP BY
clause doesn't change the numbers.As for performance: either create a
UNIQUE
constraint on(question_id, report_id)
(in that order!) like I suspect you should have:Or, barring that, at least create an index on
(question_id, report_id)
.Why is the order of columns in the index / constraint important?
With the
UNIQUE
constraint in place, the query gets considerably cheaper:As long as you have only
40 questions
you don't need an index onquestion.name
, but as long as you select questions byname
, you should still have aUNIQUE
constraint on that column.The PK on
report
does the rest.Related query if you really want to count distinct counties and districts per question: