Postgresql – Best way to delete large set of rows knowing field to NOT delete

deleteoptimizationperformancepostgresqlpostgresql-performance

Coming from a Rails background, I have a large production database that I have a copy of where I only need records from 4 of the 2000+ companies. I'm trying to delete all the rows except the ones belonging to the 4, and I know the way I have it is not optimal.

DELETE FROM appointments 
WHERE (appointments.company_id NOT IN (6, 753, 785, 1611))

another example is when I have to delete records on a table where the company_id is on an associated table:

DELETE FROM mappings 
WHERE mappings.id IN (SELECT mappings.id 
                      FROM code_mappings 
                      INNER JOIN codes ON codes.remote_id = mappings.code_remote_id 
                      WHERE (codes.company_id NOT IN (6, 753, 785, 1611)))

Best Answer

In relation to the first table, appointments, make sure that you have an index on company_id column.

In relation to the mappings table, using EXISTS rather than IN may yield better performance. You can re-write your query as following:

DELETE FROM mappings AS m
WHERE EXISTS (  SELECT 1
                FROM code_mappings AS cm
                  INNER JOIN codes AS c
                    ON c.remote_id = cm.code_remote_id
                WHERE 
                (
                c.company_id NOT IN (6, 753, 785, 1611)
                AND cm.id = m.id
                )
)

In the above query, you will also benefit from indexes on the mappings and code tables.

Documentation for creating indexes is @ https://www.postgresql.org/docs/current/static/sql-createindex.html. In your case, you can create indexes on the relevant tables as following:

CREATE INDEX company_id_idx ON appointments (company_id);

CREATE INDEX remote_id_company_id_idx ON codes (remote_id, company_id);

CREATE INDEX code_remote_id_id_idx ON code_mappings (code_remote_id, id);

-- If you don't already have a primary key OR index on `id` column in the `mappings` table, then create one:

ALTER TABLE mappings ADD PRIMARY KEY (id);
-- Choose primary key, or index: CREATE INDEX id_idx ON mappings (id);