Postgresql – Optimizing read query in PostgreSQL

performancepostgresqlpostgresql-performance

I'm having a tough time optimizing a read query from PostgreSQL 10. The query is as follows

SELECT mp.id_person, (
    COUNT(CASE 
          WHEN mp.id_g_type = mc.id_g_type
               (and mc.g1 = mp.g1 or mc.g1 = mp.g2 or mc.g2 = mp.g1 or mc.g2 = mp.g2)
          THEN 1
          END
         ) / CAST(COUNT(mp.id_g_type) AS decimal)) as probability
FROM significant_mg mp
INNER JOIN significant_mg mc on mp.id_g_type= mc.id_g_type
WHERE mp.id_person!= 28076 AND mc.id_person= 28076
GROUP BY mp.id_person
HAVING COUNT(CASE 
          WHEN mp.id_g_type = mc.id_g_type
               and (mc.g1 = mp.g1 or mc.g1 = mp.g2 or mc.g2 = mp.g1 or mc.g2 = mp.g2)
          THEN 1
          END
         ) / CAST(COUNT(mp.id_g_type) AS decimal) > 0.97

I've created an indices, one on the id_person column, the other one on id_g_type column. The significant_mg is not a table but a materialized view that contains columns id_person, id_g_type (both bigints) and columns g1 and g2 (both smallints).

The id 28076 in the where clause won't be fixed as I only use this id for testing purposes, it will be replaced by a variable when I'm done optimizing the select.

The result of explain analyze is as follows

"Finalize HashAggregate  (cost=140784.56..141226.84 rows=6552 width=40) (actual time=2869.754..2891.226 rows=14 loops=1)"
"  Group Key: mp.id_person"
"  Filter: (((count(CASE WHEN ((mp.id_g_type = mc.id_g_type) AND ((mc.g1 = mp.g1) OR (mc.g1 = mp.g2) OR (mc.g2 = mp.g1) OR (mc.g2 = mp.g1))) THEN 1 ELSE NULL::integer END))::numeric / (count(mp.id_g_type))::numeric) > 0.97)"
"  Rows Removed by Filter: 19818"
"  ->  Gather  (cost=136165.16..140293.13 rows=39314 width=24) (actual time=2796.706..2823.029 rows=59496 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Partial HashAggregate  (cost=135165.16..135361.73 rows=19657 width=24) (actual time=2744.110..2753.896 rows=19832 loops=3)"
"              Group Key: mp.id_person"
"              ->  Nested Loop  (cost=0.43..103303.58 rows=1593079 width=32) (actual time=11.259..1589.526 rows=1283988 loops=3)"
"                    ->  Parallel Seq Scan on significant_mg mc  (cost=0.00..44870.04 rows=82 width=12) (actual time=11.189..373.041 rows=66 loops=3)"
"                          Filter: (id_person = 28076)"
"                          Rows Removed by Filter: 1291777"
"                    ->  Index Scan using i_g_type on significant_mg mp  (cost=0.43..518.83 rows=19377 width=20) (actual time=0.041..11.846 rows=19454 loops=198)"
"                          Index Cond: (id_g_type = mc.id_g_type)"
"                          Filter: (id_person <> 28076)"
"                          Rows Removed by Filter: 1"
"Planning Time: 0.498 ms"
"Execution Time: 2893.309 ms"

Is there any way I could improve the query performance?

UPDATED (simplified) query:

SELECT mp.id_person
FROM significant_mg mp
INNER JOIN significant_mg mc on mp.id_g_type = mc.id_g_type
WHERE mp.id_person != 28076 AND mc.id_person = 28076
GROUP BY mp.id_person
HAVING SUM(CASE
          WHEN mc.g1 = mp.g1 or mc.g1 = mp.g2 or mc.g2 = mp.g1 or mc.g2 = mp.g1
          THEN 1
          END
         ) / CAST(COUNT(mp.id_g_type) AS decimal) > 0.97 

Best Answer

You are fundamentally asking it to do a lot of work, there is no way around that. You can perhaps shave some time off of this, but you won't get orders of magnitude improvement. If you explain some of the motivation behind this, perhaps there is a way to materialize the view in a different way which will lead to large improvements.

For just optimizing the query as given:

You could create an index on all four columns, leading with "id_g_type", so that the index scan can be turned into an index-only scan. You probably need to manually vacuum the materialized view to make this effective.

The reason it is doing a seq scan on "mc", rather than using the index on "id_person", is that the seq scan can make use of parallelism while the index scan can not. You can play around with "max_parallel_workers_per_gather" (or with alter table significant_mg set (parallel_workers = ...); to see how much you actually benefit from parallelism.

There is no point in including mp.id_g_type = mc.id_g_type in the WHENs, because the join condition will ensure all rows pass that criterion.