PostgreSQL – How to Quickly Cascade Rare Delete to Large Table

indexperformancepostgresql

I have a schema that looks like this:

create table items (
    id serial primary key
);

create table revisions (
    id serial primary key,
    item_id int not null references items(id),
    name text not null,
    property_a text,
    property_b text,
    property_c text
);

create table deltas (
  id serial primary key,
  item_id int not null references items(id) on delete cascade,
  old_revision_id int not null references revisions(id) on delete cascade,
  new_revision_id int not null references revisions(id) on delete cascade
);

This schema can't be changed.

We need to support the rare operation of deleting from items or revisions with the delete cascading to the deltas table. These deletes rarely happen compared to inserting/selecting data from the deltas table (but at least ~20-30 times a day).

A major problem is that the deltas table is very big so these cascades were very slow (taking > 30 seconds). So, we added separate indexes to item_id, old_revision_id, and new_revision_id which turned the cascading delete into a sub-millisecond operation. Side note: on our actual table, there are other columns and other indexes to support our application's normal query patterns.

This introduced a new problem: the combined size of these indexes for the deltas table is now ~7-8 times the size of the actual table and the rate of growth is pretty high since the deltas table has a lot of writes to it.

It seems silly to use many large indexes to support such a rare operation. We have a couple of options to solve this:

Leave it with multiple indexes and deal with the disk size as a separate problem.
Re-architect the application to avoid doing any deletes for these tables to prevent the cascades from happening in the first place.
Change our indexes.

For #3, I read in the Postgres docs that we might be able to use a multi-column GIN index on (item_id, old_revision_id, new_revision_id) which could support the cascading delete like delete from changeset_deltas where old_revision_id = <deleted_revision_id>:

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree or GiST, index search effectiveness is the same regardless of which index column(s) the query conditions use.

Am I correct in understanding this and will it make a difference rather than 3 separate indexes or should we take a different approach?

Best Answer

The GIN index certainly can be used for the cascading delete (you will need to create extension btree_gin before creating the index).

How much space it will save you is hard to predict, it would depend on how much duplication there is in the values of those columns.

Having 3 separate GIN indexes will also work, and is likely slightly smaller than the one combined index. Each value in a combined index has to "remember" which column it came from, which takes up extra space.

If the btree indexes are bloated, whatever is causing that might cause the gin index to get bloated as well.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Possible to have nested inserts in Postgres 8.4

You should be able to do something like this with a writable CTE:

WITH i AS (
   INSERT INTO host (hostname, hostrole) VALUES ('foobar', 'Virtual') RETURNING id
)
INSERT INTO interface (name, mac, host)
SELECT 'eth0', '00:50:56:9d:34:d4', id
FROM i

(untested, but it should be something like that)

Writable CTE is in PostgreSQL 9.1 and up.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Possible to have nested inserts in Postgres 8.4

Related Question