PostgreSQL Query Performance – Why Queries Take Longer Together Than Separately

execution-planpgadminpostgresqlpostgresql-9.5

I have a series of update statements that follow a general pattern: one update aggregates values from another table (or sometimes multiple tables), the next update produces a rank based on the aggregated values. This process is repeated 23 times for a total 46 update statements. Each update couplet takes 30-40 seconds to run on its own, but when I run them all together as a single transaction via PgAdmin it takes over an hour instead of the ~15 mins I would expect based on the individual query times. (I ended up stopping execution and running them separately last time I tried.)

If I run the same set of updates in a file via psql the process completes in the expected 15 min timeframe.

Is there some quirk to the query planner that would change the execution plan based on there being a lot of update statements running in a single transaction? Given the different behavior between psql and PgAdmin, I assume this is related to the way the queries are packaged for execution but I'm not familiar enough to know the difference.

Is there a way to write my code so as to improve performance when running it all as a single transaction through PgAdmin?

I'm on PostgreSQL 9.5 on Ubuntu 16.04.

Here are two example couplets from the code:

-- bike_driver_aggressive
UPDATE  generated.crash_aggregates
SET     bike_driver_aggressive = (
            SELECT  COUNT(*)
            FROM    crashes_bike2 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.aggressive_driverfault
        );
WITH ranks AS (
    SELECT  int_id,
            rank() OVER (ORDER BY bike_driver_aggressive DESC) AS rank
    FROM    crash_aggregates
)
UPDATE  generated.crash_aggregates
SET     bike_driver_aggressive_rank = ranks.rank
FROM    ranks
WHERE   crash_aggregates.int_id = ranks.int_id;

-- bike_allinjury
UPDATE  generated.crash_aggregates
SET     bike_allinjury = (
            SELECT  COUNT(*)
            FROM    crashes_bike1 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.injurycrash
        ) + (
            SELECT  COUNT(*)
            FROM    crashes_bike2 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.injurycrash
        );
WITH ranks AS (
    SELECT  int_id,
            rank() OVER (ORDER BY bike_allinjury DESC) AS rank
    FROM    crash_aggregates
)
UPDATE  generated.crash_aggregates
SET     bike_allinjury_rank = ranks.rank
FROM    ranks
WHERE   crash_aggregates.int_id = ranks.int_id;

The crash_aggregates table is created thusly:

CREATE TABLE crash_aggregates
(
  int_id integer NOT NULL,
  geom geometry(Point,2231),
  bike_driver_aggressive integer,
  bike_driver_aggressive_rank integer,
  bike_driver_failyield integer,
  bike_driver_failyield_rank integer,
  bike_driver_disregardsignal integer,
  bike_driver_disregardsignal_rank integer,
  bike_highspeed integer,
  bike_highspeed_rank integer,
  bike_biker_aggressive integer,
  bike_biker_aggressive_rank integer,
  bike_biker_failyield integer,
  bike_biker_failyield_rank integer,
  bike_biker_disregardsignal integer,
  bike_biker_disregardsignal_rank integer,
  bike_influence integer,
  bike_influence_rank integer,
  bike_driver_distracted integer,
  bike_driver_distracted_rank integer,
  bike_driver_reckless integer,
  bike_driver_reckless_rank integer,
  bike_tbone integer,
  bike_tbone_rank integer,
  bike_opp_lhook integer,
  bike_opp_lhook_rank integer,
  bike_samedir integer,
  bike_samedir_rank integer,
  bike_samedir_rhook1 integer,
  bike_samedir_rhook1_rank integer,
  bike_samedir_rhook2 integer,
  bike_samedir_rhook2_rank integer,
  bike_perp_rhook integer,
  bike_perp_rhook_rank integer,
  bike_perp_rhook_swalk1 integer,
  bike_perp_rhook_swalk1_rank integer,
  bike_perp_rhook_swalk2 integer,
  bike_perp_rhook_swalk2_rank integer,
  bike_tbone_swalk1 integer,
  bike_tbone_swalk1_rank integer,
  bike_tbone_swalk2 integer,
  bike_tbone_swalk2_rank integer,
  bike_allfatal integer,
  bike_allfatal_rank integer,
  bike_allinjury integer,
  bike_allinjury_rank integer,
  bike_injuryfatal integer,
  bike_injuryfatal_rank integer,
  bike_top10 integer,
  bike_num1s integer,
  bike_num2s integer,
  bike_num3s integer,
  bike_num4s integer,
  bike_num5s integer,
  bike_num6s integer,
  bike_num7s integer,
  bike_num8s integer,
  bike_num9s integer,
  bike_num10s integer,
  CONSTRAINT crash_aggregates_pkey PRIMARY KEY (int_id)
);

The bike1 table:

CREATE TABLE received.crashes_bike1
(
  caseid text,
  year integer,
  unittype_one text,
  unittype_two text,
  unittype_three text,
  circumstance text,
  primary_contrib text,
  condition_1 text,
  condition_2 text,
  condition_3 text,
  dirfromint text,
  diroftravel_one text,
  diroftravel_two text,
  tdg_directions text,
  dir_key integer,
  directions text,
  bike_mvmt text,
  veh_mvmt text,
  comb_mvmt text,
  comb_mvmt_sw text,
  map_code integer,
  injury boolean,
  diroftravel_three text,
  disabled_st1 text,
  disabled_st2 text,
  contrib_1 text,
  contrib_2 text,
  contrib_3 text,
  enteredby text,
  entereddate text,
  estvehspeed_one integer,
  estvehspeed_two integer,
  estvehspeed_three integer,
  feetfromint integer,
  firstharmful text,
  mostharmful text,
  secondharmful text,
  internamedir text,
  lightingcondition text,
  location text,
  masterid integer,
  precrashmaneuv_1 text,
  precrashmaneuv_2 text,
  precrashmaneuv_3 text,
  node integer,
  numberinjured integer,
  numberoffatalities integer,
  pedaction_one text,
  pedaction_two text,
  pedaction_three text,
  publicproperty text,
  railroadcrossing text,
  roadcondition text,
  roadcontour text,
  roaddescription text,
  roadsurface text,
  safetyequipmenthelmet_one text,
  safetyequipmenthelmet_two text,
  safetyequipmenthelmet_three text,
  safetyequipsystem_one text,
  safetyequipsystem_two text,
  safetyequipsystem_three text,
  safetyequipuse_one text,
  safetyequipuse_two text,
  safetyequipuse_three text,
  speedlimit_one integer,
  speedlimit_two integer,
  speedlimit_three integer,
  street1 integer,
  street2 integer,
  streetname_st1 text,
  streetname_st2 text,
  street_intersection text,
  totvehs integer,
  unitage_one integer,
  unitage_two integer,
  unitage_three integer,
  vehcomb_one text,
  vehcomb_two text,
  vehcomb_three text,
  vehicledefect_one text,
  vehicledefect_two text,
  vehicledefect_three text,
  technicaljudgement text,
  notes text,
  typology text,
  same_dir boolean,
  opp_dir boolean,
  perpen boolean,
  angle boolean,
  notes2 text,
  sw text,
  ww_sw text,
  cw_dwy_alley text,
  day_week text,
  weekday text,
  trail_access text,
  bike_s_veh_s_st_p boolean,
  bike_s_veh_lt_st_od boolean,
  bike_s_veh_rt_st_p boolean,
  bike_s_veh_rt_st_sd boolean,
  bike_s_veh_s_st_sd boolean,
  bike_s_veh_rt_st_ww_p boolean,
  bike_s_veh_s_sw_ww_p boolean,
  bike_s_veh_rt_sw_ww_p boolean,
  highspeed boolean,
  injurycrash boolean,
  id integer NOT NULL DEFAULT nextval('crashes_bike1_id_seq'::regclass),
  road_id1 integer,
  road_id2 integer,
  at_intersection boolean,
  int_id integer,
  CONSTRAINT crashes_bike1_pkey PRIMARY KEY (id)
);

CREATE INDEX idx_crashbike1bsvlso
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_lt_st_od);

CREATE INDEX idx_crashbike1bsvrsp
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_rt_st_p);

CREATE INDEX idx_crashbike1bsvrssd
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_rt_st_sd);

CREATE INDEX idx_crashbike1bsvrswp
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_rt_sw_ww_p);

CREATE INDEX idx_crashbike1bsvssp
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_s_st_p);

CREATE INDEX idx_crashbike1bsvsstsd
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_s_st_sd);

CREATE INDEX idx_crashbike1bsvsswwwp
  ON received.crashes_bike1
  USING btree
  (bike_s_veh_s_sw_ww_p);

CREATE INDEX idx_crashbike1inj
  ON received.crashes_bike1
  USING btree
  (injurycrash);

CREATE INDEX idx_crashbike1int
  ON received.crashes_bike1
  USING btree
  (int_id);

The bike2 table:

CREATE TABLE received.crashes_bike2
(
  accidentdate date,
  accidenttime time without time zone,
  adverseweather text,
  appovertaketurn text,
  caseid text,
  constructionzone text,
  contribfact_one text,
  contribfact_two text,
  contribfact_three text,
  dirfromint text,
  diroftravel_one text,
  diroftravel_two text,
  diroftravel_three text,
  disabled_st1 text,
  disabled_st2 text,
  driveraction_one text,
  driveraction_two text,
  enteredby text,
  entereddate text,
  estvehspeed_one integer,
  estvehspeed_two integer,
  estvehspeed_three integer,
  feetfromint integer,
  firstharmful text,
  mostharmful text,
  secondharmful text,
  internamedir text,
  lightingcondition text,
  location text,
  masterid integer,
  node integer,
  numberinjured integer,
  numberoffatalities integer,
  roadcontour text,
  roaddescription text,
  roadsurface text,
  rownum integer,
  speedlimit_one integer,
  speedlimit_two integer,
  speedlimit_three integer,
  street1 integer,
  street2 integer,
  streetname_st1 text,
  streetname_st2 text,
  totvehs integer,
  unitage_one integer,
  unitage_two integer,
  unitage_three integer,
  vehcomb_one text,
  vehcomb_two text,
  vehcomb_three text,
  unittype_one text,
  unittype_two text,
  unittype_three text,
  movement_one text,
  movement_two text,
  movement_three text,
  circumstance text,
  sw text,
  othercw text,
  motoristplacement text,
  relationshipofplacements text,
  wwswriding text,
  bicyclelane text,
  dooring text,
  bicyclewwstreetriding text,
  crashmonth integer,
  crashday text,
  hour time without time zone,
  ridinglocation text,
  crashyear integer,
  injurycrash boolean,
  fatalcrash boolean,
  noinjuryfatality boolean,
  unit1_veh boolean,
  unit2_veh boolean,
  unit1_bike boolean,
  unit2_bike boolean,
  intdistance_ft text,
  hrgrp text,
  bicyclist boolean,
  bikeaction_one text,
  bikeaction_two text,
  newdriveraction_one text,
  newdriveraction_two text,
  bikeaction_one2 text,
  bikeaction_two2 integer,
  newbikeaction text,
  newdriveraction_one2 text,
  newdriveraction_two2 integer,
  newdriveraction2 text,
  bike_movement text,
  driver_movement text,
  unit1 text,
  unit2 text,
  unit3 text,
  complex boolean,
  nobike boolean,
  bike_movement2 text,
  bike_fault boolean,
  driver_movement2 text,
  crashtype text,
  relationship text,
  ww boolean,
  xwalk boolean,
  bikelane text,
  sidewalk boolean,
  location2 text,
  direction text,
  newcrashtype text,
  bike_s_veh_s_st_p boolean,
  bike_s_veh_lt_st_od boolean,
  bike_s_veh_rt_st_p boolean,
  bike_s_veh_rt_st_sd boolean,
  bike_s_veh_s_st_sd boolean,
  bike_s_veh_rt_st_ww_p boolean,
  highspeed boolean,
  atfault text,
  influence boolean,
  distracted_driverfault boolean,
  aggressive_driverfault boolean,
  inexperience_bikerfault boolean,
  aggressive_bikerfault boolean,
  failyield_driverfault boolean,
  carereckless_driverfault boolean,
  disregardsignal_driverfault boolean,
  failyield_bikerfault boolean,
  disregardsignal_bikerfault boolean,
  allothercrashtype boolean,
  bike_s_veh_rt_sw_ww_p boolean,
  bike_s_veh_s_sw_ww_p boolean,
  id integer NOT NULL DEFAULT nextval('crashes_bike2_id_seq'::regclass),
  road_id1 integer,
  road_id2 integer,
  at_intersection boolean,
  int_id integer,
  CONSTRAINT crashes_bike2_pkey PRIMARY KEY (id)
);

CREATE INDEX idx_crashbike2aggbkflt
  ON received.crashes_bike2
  USING btree
  (aggressive_bikerfault);

CREATE INDEX idx_crashbike2bsvhrtstsd
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_rt_st_sd);

CREATE INDEX idx_crashbike2bsvlso
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_lt_st_od);

CREATE INDEX idx_crashbike2bsvrsp
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_rt_st_p);

CREATE INDEX idx_crashbike2bsvrswp
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_rt_sw_ww_p);

CREATE INDEX idx_crashbike2bsvssp
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_s_st_p);

CREATE INDEX idx_crashbike2bsvsstsd
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_s_st_sd);

CREATE INDEX idx_crashbike2bsvsswwwp
  ON received.crashes_bike2
  USING btree
  (bike_s_veh_s_sw_ww_p);

CREATE INDEX idx_crashbike2carreckdrv
  ON received.crashes_bike2
  USING btree
  (carereckless_driverfault);

CREATE INDEX idx_crashbike2drvflt
  ON received.crashes_bike2
  USING btree
  (aggressive_driverfault);

CREATE INDEX idx_crashbike2dsrgdsgndrv
  ON received.crashes_bike2
  USING btree
  (disregardsignal_driverfault);

CREATE INDEX idx_crashbike2dsrgsgnbk
  ON received.crashes_bike2
  USING btree
  (disregardsignal_bikerfault);

CREATE INDEX idx_crashbike2dstdrv
  ON received.crashes_bike2
  USING btree
  (distracted_driverfault);

CREATE INDEX idx_crashbike2dui
  ON received.crashes_bike2
  USING btree
  (influence);

CREATE INDEX idx_crashbike2flyldbik
  ON received.crashes_bike2
  USING btree
  (failyield_bikerfault);

CREATE INDEX idx_crashbike2flylddrv
  ON received.crashes_bike2
  USING btree
  (failyield_driverfault);

CREATE INDEX idx_crashbike2ftl
  ON received.crashes_bike2
  USING btree
  (fatalcrash);

CREATE INDEX idx_crashbike2hispd
  ON received.crashes_bike2
  USING btree
  (highspeed);

CREATE INDEX idx_crashbike2inj
  ON received.crashes_bike2
  USING btree
  (injurycrash);

CREATE INDEX idx_crashbike2int
  ON received.crashes_bike2
  USING btree
  (int_id);

Best Answer

Debunking

If you are doing all your updates in the same transaction, each of them will have to work an increasingly bigger set of (physical) tuples. See the following example:

CREATE TABLE explode (id integer, something text);

INSERT INTO explode SELECT i, md5(i::text) FROM generate_series(1, 100000) t(i);

\dt+ explode -- done in psql
                     List of relations
 Schema │  Name   │ Type  │ Owner  │  Size   │ Description 
────────┼─────────┼───────┼────────┼─────────┼─────────────
 test   │ explode │ table │ avaczi │ 6704 kB │ 

BEGIN;

UPDATE explode SET something = something || 'a';

\dt+ explode 
 test   │ explode │ table │ avaczi │ 13 MB │ 

UPDATE explode SET something = something || 'a';

\dt+ explode 
 test   │ explode │ table │ avaczi │ 20 MB │ 

COMMIT;

You might see this even when the different updates are not in the same transaction. This will be the case if autovacuum (or a manual VACUUM) does not keep up with the speed of change.

The reason is how MVCC in PostgreSQL works. When doing an update, it creates a new physical row with the new values, marking the old one invisible to transactions that start after the current one. This means that there will be row versions on the physical level that are 'wasted space' once the transaction commits. These have to be eventually freed by (auto)vacuuming.

Now when you are inside in a transaction, it is not clear if it will be committed or rolled back, therefore the DB has to keep all obsolete row versions at hand, so that it can revert to the original ones. (Well, not all, there seem to be some optimizations around this, but definitely some: in my experiments, the table seen above grew up to 200 MB (instead of 6.6).) This means the physical size of your table will grow and grow - the size of each step depends what exactly you do in the UPDATE statements.

And this is where you can improve your process a lot. Currently, you have many updates that change the whole table unconditionally, effectively meaning that in each iteration, you have twice the size of the table you had in the previous one. After 46 (or 23) rounds, even a relatively small table can grow to something really big.

Accordingly, the time needed to operate on all the versions that are kept around takes more and more time. In my runs, the first UPDATE finished in 190-200 ms, the 20th iteration reached 400 ms. As you have nice wide tables and the updates take already 30-40 seconds when run alone, you can expect some serious slowdown.

The (possible) solution

As your updates look very similar, and they just affect different columns of the same table, you could try something like

UPDATE  generated.crash_aggregates
SET     bike_driver_aggressive = (
            SELECT  COUNT(*)
            FROM    crashes_bike2 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.aggressive_driverfault
        ),
        bike_allinjury = (
            SELECT  COUNT(*)
            FROM    crashes_bike1 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.injurycrash
        ) + (
            SELECT  COUNT(*)
            FROM    crashes_bike2 c
            WHERE   c.int_id = crash_aggregates.int_id
            AND     c.injurycrash
        ),
        [...];

This means the CPUs will have to think a bit about all the counts, but you paid for them to do their work. At the same time, the table will be rewritten only once.

When all this is done, it seems to be enough to calculate the ranks only once, by similarly composing one UPDATE query for updating them in only one round.

Improve more

Well, the above UPDATE (as also suggested in a comment) is far from being the most efficient ever. Here is a take on improving it a bit more.

So, as it looks like, the counts are calculated one by one, meaning that the two source tables will be visited many times. Depending on their size (and some other factors), this might be terrible. In your case it stays under the pain threshold, otherwise you would already complain about that, too ;) In other cases, it might not be so.

The idea is that it is possible to collect all the aggregates in a single run, and to use this as a source for the updates. For this, we can build a big aggregate-all structure:

SELECT int_id,
       sum(c1.injurycrash::integer) + sum(c2.injurycrash::integer), 
       sum(aggressive_driverfault::integer),
       ...
  FROM crashes_bike1 AS c1, crashes_bike2 AS c2
 GROUP BY int_id;

Here I suppose that the distribution of the different flags (like injurycrash) is so that TRUE values are not very rare. If they are, this one big scan of the whole crashes_bike table (well, the two of them) might be worse than many index (only) scans. However, I don't see you have everything indexed there (and it might make no sense - it's hard to judge without actual data).

I've replaced count() for sum() in the aggregates above. The trick is to convert the boolean values to integer and summing them up, so avoiding a really complicated set of CASE expressions.

Once we have the above result set, we can plug it into the UPDATE itself:

WITH aggregates1 AS (
    SELECT int_id,
           sum(injurycrash::integer), 
           sum(aggressive_driverfault::integer) AS aggressive_driverfault,
           ...
      FROM crashes_bike1
     GROUP BY int_id
), aggregates2 AS (
    SELECT sum(injurycrash::integer) AS injurycrash,
           ...
      FROM crashes_bike2
     GROUP BY int_id
)
UPDATE crash_aggregates AS ca
   SET bike_allinjury = a1.injurycrash + a2.injurycrash,
       bike_driver_aggressive = a.aggressive_driverfault,
       ...
  FROM aggregates1 AS a1
  FULL JOIN aggregates2 AS a2 USING (int_id)
 WHERE ca.int_id = a1.int_id OR ca.int_id = a2.int_id;