Postgresql – Best way to delete large set of rows knowing field to NOT delete

deleteoptimizationperformancepostgresqlpostgresql-performance

Coming from a Rails background, I have a large production database that I have a copy of where I only need records from 4 of the 2000+ companies. I'm trying to delete all the rows except the ones belonging to the 4, and I know the way I have it is not optimal.

DELETE FROM appointments 
WHERE (appointments.company_id NOT IN (6, 753, 785, 1611))

another example is when I have to delete records on a table where the company_id is on an associated table:

DELETE FROM mappings 
WHERE mappings.id IN (SELECT mappings.id 
                      FROM code_mappings 
                      INNER JOIN codes ON codes.remote_id = mappings.code_remote_id 
                      WHERE (codes.company_id NOT IN (6, 753, 785, 1611)))

Best Answer

In relation to the first table, appointments, make sure that you have an index on company_id column.

In relation to the mappings table, using EXISTS rather than IN may yield better performance. You can re-write your query as following:

DELETE FROM mappings AS m
WHERE EXISTS (  SELECT 1
                FROM code_mappings AS cm
                  INNER JOIN codes AS c
                    ON c.remote_id = cm.code_remote_id
                WHERE 
                (
                c.company_id NOT IN (6, 753, 785, 1611)
                AND cm.id = m.id
                )
)

In the above query, you will also benefit from indexes on the mappings and code tables.

Documentation for creating indexes is @ https://www.postgresql.org/docs/current/static/sql-createindex.html. In your case, you can create indexes on the relevant tables as following:

CREATE INDEX company_id_idx ON appointments (company_id);

CREATE INDEX remote_id_company_id_idx ON codes (remote_id, company_id);

CREATE INDEX code_remote_id_id_idx ON code_mappings (code_remote_id, id);

-- If you don't already have a primary key OR index on `id` column in the `mappings` table, then create one:

ALTER TABLE mappings ADD PRIMARY KEY (id);
-- Choose primary key, or index: CREATE INDEX id_idx ON mappings (id);

Related Solutions

Sql-server – What’s better for large changes to a table: DELETE and INSERT every time or UPDATE existing

It really depends on how much of the data is changing. Lets say this table has 20 columns. And you also have 5 indexes - each on a diff. column.

Now if the values in all 20 columns are changing OR even if data in 5 columns are changing and these 5 columns are all indexed, then you may be better off "deleting and inserting". But if only 2 columns are changing and lets say these are not part of any non-clustered indexes, then you may be better off "Updating" the records because in this case only the clustered index will be updated (and indexes will not have to be updated).

On further research, I did find that the above comment by me is sort of redundant as SQL Server internally has 2 separate mechanism for performing an UPDATE. - An "in-place update" (ie by changing a columns value to a new in the original row) or as a "not-in-place UPDATE" (DELETE followed by an INSERT).

In place updates are the rule and are performed if possible. Here the rows stay exactly at the same location on the same page in the same extent. Only the bytes affected are chnaged. The tlog only has one record (provided there are no update triggers). Updates happen in place if a heap is being updated (and there is enough space on the page). Updates also happen in place if the clustering key changes but the row does not need to move at all.

For eg: if you have a clustered index on last name and you have the names: Able, Baker, Charlie Now you want to update Baker to Becker. No rows have to be moved. So this can take in-place. Whereas, if you have to update Able to Kumar, the rows will have to be shifted (even though they will be on the same page). In this case, SQL Server will do a DELETE followed by an INSERT.

Considering the above, I would suggest that you do a normal UPDATE and let SQL Server figure out the best way to how to do it internally.

For more details on "UPDATE" internals or for that matter any SQL Server related internals, check out Kalen Delaney, Paul Randal's, et al.'s book - SQL Server 2008 Internals.

Mysql – remove duplicate rows in thesql table that does not contain primary key

In the spirit of @yercube's answer, I have an answer that has an added twist.

CREATE TABLE stage
(
    id int not null auto_increment,
    name varchar(20),
    primary key (id)
);
CREATE TABLE stage2 LIKE stage;
INSERT INTO stage (name) SELECT name FROM item;
INSERT INTO stage2 (id) SELECT min_id FROM
(SELECT MIN(id) min_id,name FROM stage GROUP BY name) A;
UPDATE stage2 A INNER JOIN stage B USING (id) SET A.name=B.name;
TRUNCATE TABLE item;
INSERT INTO item (name) SELECT name FROM stage2;
DROP TABLE stage;
DROP TABLE stage2;

This will load stage2 with the first occurrence of each name from item, zap the item table, and load the unique occurrences back.

If you look back in @yercube's answer and compare it to my answer, his is much more simplistic because

@yercube uses one temp table, while I use two
I had to create a column for iteration control, @yercube did not need to
@yercube has fewer steps
both answers achieve the same thing

I do not expect my answer to be accepted. The sole purpose of my answer was demonstrate that other answers lose the concise clarity needed to solve your problem. Again, hats off to @yercube.

Best Answer

Related Solutions

Sql-server – What’s better for large changes to a table: DELETE and INSERT every time or UPDATE existing

Mysql – remove duplicate rows in thesql table that does not contain primary key

Related Question