Postgresql – How to rewrite the query to use less temporary space

disk-spaceperformancepostgresqlquery-performancetemporary-tables

I'm using Postgres 9.5. I want to identify (and then delete) rows from my table that are "duplicates" in the sense that they contain three fields that are the same, namely

my_object_id
name
time_in_ms

The table is defined like so

myproject_production=> \d my_object_times;
                               Table "public.my_object_times"
      Column       |            Type             |              Modifiers
-------------------+-----------------------------+-------------------------------------
 first_name        | character varying           |
 last_name         | character varying           |
 time_in_ms        | bigint                      |
 created_at        | timestamp without time zone | not null
 updated_at        | timestamp without time zone | not null
 name              | character varying           |
 racer_id          | character varying           |
 age               | integer                     |
 city              | character varying           |
 state_id          | integer                     |
 country_id        | integer                     |
 my_object_id           | character varying           | not null
 id                | character varying           | not null default uuid_generate_v4()
Indexes:
    "my_object_times_pkey" PRIMARY KEY, btree (id)
    "index_my_object_times_on_country_id" btree (country_id)
    "index_my_object_times_on_my_object_id" btree (my_object_id)
    "index_my_object_times_on_state_id" btree (state_id)
    "my_object_times_name_idx" btree (upper(name::text) text_pattern_ops)
    "my_object_times_rev_name_idx" btree (reverse(upper(name::text)) text_pattern_ops)
Foreign-key constraints:
    "fk_rails_0fe1d25967" FOREIGN KEY (country_id) REFERENCES countries(id)
    "fk_rails_a8771b3575" FOREIGN KEY (state_id) REFERENCES states(id)
    "fk_rails_ba656ceafa" FOREIGN KEY (my_object_id) REFERENCES my_objects(id) ON DELETE CASCADE
Referenced by:
    TABLE "user_my_object_time_matches" CONSTRAINT "fk_rails_2e7860946c" FOREIGN KEY (my_object_time_id) REFERENCES my_object_times(id) ON DELETE CASCADE
    TABLE "my_objects" CONSTRAINT "fk_rails_dda3297b57" FOREIGN KEY (linked_my_object_time_id) REFERENCES my_object_times(id) ON DELETE CASCADE

I thought this query would do the job

select rt1.id, rt1.name 
FROM my_object_times rt1, 
     my_object_times rt2 
where rt1.my_object_id = rt2.my_object_id 
  and rt1.name = rt2.name 
  and rt1.time_in_ms = rt2.time_in_ms 
  and rt1.id > rt2.id;

But I guess I have so much data, the above query is dying with the below error:

ERROR:  could not write block 1862514 of temporary file: No space left on device

This is after pointing my temporary space (pgsql_tmp directory) at a partition that gives it 20 GB of temp space. So my question is, how can I rewrite the above so that it does not use as much temp space? That is fine if it takes the query a really long time to run.

Best Answer

Try the query using window functions which gets rid of the self join:

select id, name
from (
 select id, name, my_object_id, time_in_ms, 
        count(*) over (partition by name, my_object_id, time_in_ms) as cnt
 from my_object_times t1
) t 
where cnt > 1;

The partition by will still need to sort the rows, so it might be that it still needs a lot of temp space. If your server has a lot of memory, you can also increase work_mem for your session before running the statement.

Another possible way is to use an exists query instead of the self join:

select id, name
from my_object_times t1
where exists (select *
              from my_object_times t2
              where t1.my_object_id = t2.my_object_id 
                and t1.name = t2.name 
                and t1.time_in_ms = t2.time_in_ms 
                and t1.id <> rt2.id);

An index on (my_object_id, name, time_in_ms, id) would help for this.

Check the execution plan for both to see if one is cheaper then then other.