Postgresql – Plpgsql seems to be deleting and inserting instead of updating – Why

plpgsqlpostgresql

I am using postgreSQL 7.4.

I have a large table , call it table_a:

key1 INT NOT NULL, 
key2 INT NOT NULL, 
data INT NOT NULL, 
itstamp INT NOT NULL DEFAULT (date_part('EPOCH'::text, (timeofday())::timestamp without time zone))::INTEGER

and a table that summaries the last update time for key1, call it table_b:

key1        INT NOT NULL,
max_itstamp INT NOT NULL

I created a trigger function in plpgsql to update or insert rows in table_b as necessary:

CREATE OR REPLACE FUNCTION table_b_update() RETURNS TRIGGER AS '
 DECLARE
  l_key1 INT;
  l_itstamp INT;
 BEGIN
  l_key1 := new.key1;
  l_itstamp := new.itstamp;
  PERFORM TRUE FROM table_b WHERE key1=l_key1;
  IF NOT FOUND THEN 
   INSERT INTO table_b(key1, max_itstamp) values (l_key1, l_itstamp);
  ELSE
   UPDATE table_b SET max_itstamp=l_itstamp WHERE key1=l_key1;
  END IF;
  RETURN NULL;
 END'
LANGUAGE plpgsql IMMUTABLE;

and then I attached a trigger to table_a:

CREATE TRIGGER table_a_trigger1 AFTER INSERT OR UPDATE ON table_a FOR EACH ROW
EXECUTE PROCEDURE table_b_upate();

Now, the time to insert new data into table_a grows incrementally. The file footprint of table_b grows steadily.

I have used RAISE NOTICE commands in the function to confirm that the If statement causes an UPDATE and not an INSERT after the first call per key.

Since the table_a insert time grows for each INSERT, I tried a VACUUM FULL on table_b. The table_a insert time was reduced considerably. The file size for table_b was reduced considerably. After the VACUUM FULL the table_a insert time started to grow again. I don't want to do a VACUUM FULL after every INSERT into table_a though.

Is it possible that the UPDATE is actually doing a DELETE and INSERT in table_b?

Best Answer

I don't have 7.4 to test on, but I'm guessing:

every time you do a vacuum full the table compacts
every time you update, the new version of the row (see MVCC) gets shoved at the end of the heap before the old one is removed by a vacuum

See here for the docs explaining this in more detail, but the simple solution is not to run vacuum full at all - just vacuum. Then your table will probably settle into a steady state where 'holes' in the data are left and can be used by later updates.

As for "insert time", I'm surprised at your results. My expectation would be that insert time would be slower after a vacuum full - but if all the blocks are in the cache, the overhead of finding free space inside the current block might be higher than adding the new row at the end of the heap even if the number of blocks accessed is higher

Related Solutions

Sql-server – PostgreSQL Initial Database Size

No the only thing close to that is when you compile the server with the --with-segsize switch, this might help if your table is taking up more space than a gig and your file system can handle a single file being over a gig. If your inserting 20 gigs it will have to create 20 files if you don't use this switch. If your file system can handle a file over a gig you can just set it to a large value most likely see some benefit, worst case a small benefit.
Take a look at CLUSTER http://www.postgresql.org/docs/9.1/static/sql-cluster.htmland FILLFACTOR http://www.postgresql.org/docs/9.1/static/sql-createtable.html, http://www.postgresql.org/docs/9.1/static/sql-createindex.html

Note that FILLFACTOR can be applied to both tables and indexes.

Postgresql – the best approach to process huge amount of data insertion efficiently

The fragment you posted so far can be simplified to:

INSERT INTO table2 (id, name, date)  -- why "date" if you insert a timestamp?
SELECT NEW.id, t1.name, NEW.timestamp
FROM   table1 t1
WHERE  ST_DWithin(NEW.position
                , ST_SetSRID(ST_MakePoint(t1.longitudedecimal, t1.latitudedecimal), 4326)
                , 0.01447534783)
AND    t1.id > 0;  -- probably redundant!

IF NOT FOUND THEN ...

It is very inefficient to run separate queries with assignments in plpgsql instead of a single query.
Basic type names like date or timestamp lead to confusing error messages and other conflicts, id and name are the worst possible column names, non-descriptive and with countless duplicates all over your tables. Revisit your naming convention ...

You probably do not need any of this. Inserting millions of rows shouldn't be handled by triggers which are fired for each row. Extremely expensive. You need a set-based solution without triggers.

Probably best to COPY to a tmeporary staging table and INSERT / UPDATE from there. But basic information is missing.

Best Answer

Related Solutions

Sql-server – PostgreSQL Initial Database Size

Postgresql – the best approach to process huge amount of data insertion efficiently

Related Question