PostgreSQL Performance – TEMP Table Alternatives for PL/pgSQL

performancepostgresqlpostgresql-performancetemporary-tablesupsert

I am trying to emulate MERGE behavior with pl/pgsql:

-- Generate the data from funtion
CREATE TEMP TABLE temp_x  (id int, id2 int, data text, created_at timestamp, updated_at timestamp)  ON COMMIT DROP;
INSERT INTO temp_x SELECT * FROM set_gernating_function(p);

-- DELETE record with same id2
DELETE FROM x WHERE NOT EXISTS (SELECT 1 FROM temp_x WHERE temp_x.id=x.id) AND id2=p.id2;

-- UPSERT by (id, id2)
INSERT INTO x
    SELECT * FROM temp_x z
  ON CONFLICT(id, id2) DO UPDATE
    SET
      updated_at=excluded.updated_at,
      data=excluded.data;

But the temp table usage is generating bloat in pg_class and pg_attr table very fast — Faster then I could do vacuum — and affect other queries. Any idea?

There are some restrictions:

Preserve the created_at time if the id already exist
x have a delete-trigger, so no unnecessary delete.
The set_generating_function is slow
The size for temp_x is small. (< 50)
The run rate is very high (thousands per second)

Best Answer

You don't need the temp table:

You can do this in a single statement:

with temp_x (id, id2, data, created_at, updated_at) AS (
   SELECT * 
   FROM set_gernating_function(p)
), deleted as (
  DELETE FROM x 
  WHERE NOT EXISTS (SELECT 1 
                    FROM temp_x 
                    WHERE temp_x.id = x.id) 
    AND id2=p.id2
)
-- UPSERT by (id, id2)
INSERT INTO x
SELECT * 
FROM temp_x z
  ON CONFLICT(id, id2) DO UPDATE
    SET updated_at = excluded.updated_at,
        data = excluded.data;

Related Solutions

Postgresql – Consolidating large table

I want to group records that have the same combination of class, ip_address, and hostname, and keep the highest timestamp for each day from each group.

Not using the column name timestamp (like you shouldn't either). It's a reserved word in SQL and a basic type name in Postgres. Using ts instead.

The query is surprisingly simple with DISTINCT ON:

SELECT DISTINCT ON (class, ip_address, hostname, ts::date) *
FROM   agent_log
WHERE  ts < now() - interval '7 days'
ORDER  BY class, ip_address, hostname, ts::date, ts DESC;

Detailed explanation:
Select first row in each GROUP BY group?

Database Design for Handling 1 Billion Rows in SQL Server

5000 inserts per minute are about 83 inserts per second. With 5 indexes that's 400 physical rows inserted per second. If the workload was in-memory this would not pose a problem even to the smallest of servers. Even if this was a row-by-row insert using the most inefficient way I can think of. 83 trivial queries per second are just not interesting from a CPU standpoint.

Probably, you are disk-bound. You can verify this by looking at wait stats or STATISTICS IO.

Your queries probably touch a lot of different pages so that the buffer pool does not have space for all of them. This causes frequent page reads and probably random disk writes as well.

Imagine a table where you only physically insert at the end because of an ever-increasing key. The working set would be one page: the last one. This would generate sequential IO as well wen the lazy writer or checkpoint process writes the "end" of the table to disk.

Imagine a table with randomly-placed inserts (classic example: a guid key). Here, all pages are the working set because a random page will be touched for each insert. IOs are random. This is the worst case when it comes to working set.

You're in the middle. Your indexes are of the structure (SomeValue, SequentialDateTime). The first component partially randomizes the sequentiality provided by the second. I guess there are quite a few possible values for "SomeValue" so that you have many randomly-placed insert-points in your indexes.

You say that data is split into 10GB tables per week. That's a good starting point because the working set is now bounded by 10GB (disregarding any reads you might do). With 12GB of server memory it is unlikely, though, that all relevant pages can stay in memory.

If you could reduce the size of the weekly "partitions" or increase server memory by a bit you are probably fine.

I'd expect that inserts at the beginning of the week are faster then at the end. You can test this theory on a dev server by running a benchmark with a certain data size and gradually reducing server memory until you see performance tank.

Now even if all reads and writes fit into memory you might still have random dirty page flushing IO. The only way to get rid of that is to write into co-located positions in your indexes. If you can at all convert your indexes to use (more) sequential keys that would help a lot.

As a quick solution I'd add a buffering layer between the clients and the main table. Maybe accumulate 15min of writes into a staging table and periodically flush it. That takes away the load spikes and uses a more efficient plan to write to the big table.

Best Answer

Related Solutions

Postgresql – Consolidating large table

Database Design for Handling 1 Billion Rows in SQL Server

Related Question