Postgresql – Indices for a join query with datetime range

optimizationperformancepostgresqlquery-performance

Given the following query:

SELECT DISTINCT COUNT(DISTINCT "patients"."id") 
FROM "patients" 
INNER JOIN "transactions" ON "transactions"."patient_data_id" = "patients"."patient_data_id" 
WHERE "patients"."practice_id" = 6 
AND "patients"."species" = 'canine' 
AND "transactions"."practice_id" = 6 
AND (("transactions"."transaction_date" BETWEEN '2015-01-01' AND '2015-12-31' OR 1=0));

This query is generated by Rails, and that's where the OR 1=0 shorcut comes from.

What indices would you make sure were in place to ensure that this query runs as quickly as possible?

I'm working with a large database, where the transactions table is around 200 million records. The patients table is much smaller.

Execution time with just one query running is around 2.3 seconds, but we need to be able to run about 50 of these queries simultaneously. When that happens, they bog down incredibly, to about one query/minute.

How would you make this run faster?

Best Answer

For this specific query, I would create indexes on patients(practice_id, species) and transactions(patient_data_id, practice_id, transaction_date) and then rewrite the query as follows:

select count(distinct p.id)
from patients p
where p.practice_id = 6 and p.species = 'canine'
    and exists (select 1 
                from transactions 
                where patient_data_id = p.patient_data_id 
                and practice_id = 6 
                and transaction_date BETWEEN '2015-01-01' AND '2015-12-31')

The database won't go through all the transactions but will stop once it finds the first transaction that meets the conditions, as that's all you need to find out the final result (basically, patients that have at least one transaction). I also removed the first distinct as you don't need it - count will return only one row.

Related Solutions

TheSQL efficiency with inner join queries

SELECT record_id from `table_a`
where customer_id="654"
and record_id in
    (SELECT cat_id from `table_b` where cat_id="654");

The meaning of a JOIN is the AND of the meanings of its arguments. ON and WHERE both AND in a conditon. You want rows where (using obvious aliases):
customer [a.customer_id] ...
AND customer [b.cat_id] ...
AND [a.customer_id] = 654 AND [b.cat_id] = 654 AND [a.record_id] = [b.cat_id]

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
WHERE a.customer_id = 654 AND b.cat_id = 654
AND a.record_id = b.cat_id

(In standard SQL ( INNER ) JOIN needs an ON. So you could CROSS JOIN or replace WHERE by ON.)

As a comment says, MySQL has historically not been very good at optimizing. But it is constantly improving. IN has been notoriously slow, even when it is equivalent to other more optimized expressions. You may get better performance by explicitly equating the ids first in an ON:

SELECT a.record_id
FROM `table_a` a JOIN `table_b` b
ON a.record_id = b.cat_id
AND a.customer_id = 654 AND b.cat_id = 654

Declare each of those fields as PK or UNIQUE NOT NULL if it is (which implicitly adds an index), otherwise add an index on it. MySQL unadorned KEY is a synonym for INDEX when not in a column declaration, which does not tell the database that a column set is unique. Yes, uniqueness affects performance, so give your tables their proper rows and declare and enforce any uniqueness by PRIMARY KEY or UNIQUE.

If you make the ids INT then the DBMS only needs to go to the index, not the data.

(Also, read the documentation re keys, indices and optimization. Use EXPLAIN.)

Mysql – Decrease Query Execution Time

In the first query, "pagination via OFFSET" is at least part of the performance problem. In your example, your query must scan 270+10 rows to find the 10 needed. If, instead, you "remembered where you left off", you might be able to look at only 10 rows. More discussion.

(In the first query, you used JOIN .. ON syntax - good. Please do the same for the second.)

Add composite indexes where appropriate. For example: pd: INDEX(language_id, product_id). Please provide SHOW CREATE TABLE so we can discuss this further.

With all the 'attribute' and 'language' tables, I suspect this is an EAV schema that is "over-normalized". Again, seeing the schema will help discuss this.

I see two subqueries referencing oc_product to get two columns (IMAGE and sku). A JOIN to oc_product would probably be more efficient.

Don't hide date_added inside a function (DATE()), Instead do o.date_added >= '2014-12-01' AND o.date_added < '2015-11-01'.

In the EXPLAIN, the 59780 together with Using index suggests that it did a full index scan to find the answer. Probably need a better index (and not hide the column in a function).

IN ( SELECT ... ) is inefficient, can it be written another way?

Here's a big mistake that is hurting the first query:

INNER JOIN oc_category c ON c.category_id

Perhaps you meant

INNER JOIN oc_category c ON c.category_id = p.category_id

That showed up in the EXPLAIN as an unexpectedly big number of rows for c.

Best Answer

Related Solutions

TheSQL efficiency with inner join queries

Mysql – Decrease Query Execution Time

Related Question