PostgreSQL 9.x – Efficient Way to Insert/Update/Delete Table Records from Complex Query

best practicesperformancepostgresqlpostgresql-performance

I have this function which returns a set a records and I need to persist those records into a table. I have to do it hundred times a day.

My initial approach was just clear data from my table and re-insert all records again.

-- CLEAR MY TABLE
DELETE FROM MY_TABLE;

-- POPULATE MY TABLE WITH MY FUNCTION'S RESULT
INSERT INTO MY_TABLE (COLUMN1, COLUMN2, COLUMN3)
SELECT COLUMN1, COLUMN2, COLUMN3 
FROM MY_FUNCTION(PARAM1, PARAM2, PARAM3);

So far, so good. But my table has many triggers and when the function returns thousands of records, this approach shows to be very inefficient.

Then, I moved to this approach:

-- CREATE A TEMPORARY TABLE
CREATE GLOBAL TEMPORARY TABLE MY_TEMP_TABLE 
(COLUMN1 TEXT, COLUMN2 TEXT, COLUMN3 TEXT);

-- POPULATE MY TEMP TABLE WITH MY FUNCTION'S RESULT
INSERT INTO MY_TEMP_TABLE (COLUMN1, COLUMN2, COLUMN3)
SELECT COLUMN1, COLUMN2, COLUMN3 
FROM MY_FUNCTION(PARAM1, PARAM2, PARAM3);

-- CREATE AN INDEX FOR HELP PERFORMANCE
CREATE INDEX MY_TEMP_TABLE_INDEX ON MY_TEMP_TABLE (COLUMN1, COLUMN2, COLUMN3);

-- DELETE FROM MY TABLE WHERE NOT EXISTS IN MY TEMP TABLE
DELETE FROM MY_TABLE T 
WHERE NOT EXISTS (SELECT 1 
                  FROM MY_TEMP_TABLE T2 
                  WHERE T2.COLUNN1 = T.COLUMN1);

-- UPDATE MY TABLE WHERE COLUMNS ARE DIFFERENT IN MY TEMP TABLE
UPDATE MY_TABLE T 
SET COLUMN2 = T2.COLUMN2,
    COLUMN3 = T2.COLUMN3 
FROM MY_TEMP_TABLE T2 
WHERE T2.COLUNN1 = T.COLUMN1
  AND (T2.COLUMN2 <> T.COLUMN2 OR T2.COLUMN3 <> T.COLUMN3);

-- INSERT INTO MY TABLE WHER EXISTS IN MY TEMP TABLE
INSERT INTO FROM MY_TABLE T (COLUMN1, COLUMN2, COLUMN3) 
(SELECT COLUMN1, COLUMN2, COLUMN3
   FROM MY_TEMP_TABLE T2
  WHERE NOT EXISTS (SELECT 1 FROM TABLE T3 WHERE T3.COLUNN1 = T2.COLUMN1);

But I still having performance issues. I believe that creating this temp_table consumes very much resources. Beside, I think this isn't the best way of doing it.

Can you guys suggest another approach? Or do you believe this is the best way?

EDIT:

For testing, you can run these scripts above:

Here is the script to creating tables / triggers / functions / etc…

-- THIS TABLE CONTAINS INFORMATION THAT USERS NEED
CREATE TABLE USER_INFO (USER_ID TEXT, INFO_ID TEXT, INFO1 TEXT, INFO2 TEXT, INFO3 TEXT, INFO4 TEXT, INFO5 TEXT);
ALTER TABLE USER_INFO ADD CONSTRAINT USER_INFO_PK PRIMARY KEY (USER_ID, INFO_ID);

-- THIS TABLE CONTAINS A KIND OF FLAG, INDICATING FOR USERS THEIR INFORMATION HAS BEEN "REFRESHED" AND THEY SHOULD GET ROWS FROM "USER_INFO"
CREATE TABLE USER_HAS_NEW_INFO (USER_ID TEXT, INFO_DATE TIMESTAMP);
ALTER TABLE USER_HAS_NEW_INFO ADD CONSTRAINT USER_HAS_NEW_INFO_PK PRIMARY KEY (USER_ID, INFO_DATE);


-- CREATE TRIGGER FUNCTION 
CREATE OR REPLACE FUNCTION TF_USER_INFO()
  RETURNS trigger AS
$BODY$
begin

  -- IF SOME INFO HAS CHANGED 

  if (TG_OP = 'INSERT') 
     OR 
     (
       (TG_OP = 'UPDATE') 
       AND
       (
         (COALESCE(NEW.INFO1,'') <> COALESCE(OLD.INFO1,'')) OR
         (COALESCE(NEW.INFO2,'') <> COALESCE(OLD.INFO2,'')) OR
         (COALESCE(NEW.INFO3,'') <> COALESCE(OLD.INFO3,'')) OR
         (COALESCE(NEW.INFO4,'') <> COALESCE(OLD.INFO4,'')) OR
         (COALESCE(NEW.INFO5,'') <> COALESCE(OLD.INFO5,'')) 
       )
     )
  then

    -- INSERT A NEW ROW INTO USER_HAS_NEW_INFO 
    INSERT INTO USER_HAS_NEW_INFO (USER_ID, INFO_DATE)
    SELECT NEW.USER_ID, CURRENT_TIMESTAMP
    WHERE  NOT EXISTS (SELECT 1 
                         FROM USER_HAS_NEW_INFO 
                        WHERE USER_ID = NEW.USER_ID
                          AND INFO_DATE = CURRENT_TIMESTAMP
                      );
  end if;

  RETURN NEW;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE;



-- CREATE TRIGGER
CREATE TRIGGER T_USER_INFO
  AFTER INSERT OR UPDATE OR DELETE
  ON USER_INFO
  FOR EACH ROW
  EXECUTE PROCEDURE TF_USER_INFO();




CREATE OR REPLACE FUNCTION CALCULATE_USERS_INFO()
RETURNS SETOF USER_INFO AS
$BODY$
DECLARE
    vUSER_INFO          USER_INFO%rowtype;
BEGIN

    -- HERE GOES A COMPLEX QUERY PLUS SOME CALCS AND VALIDATIONS
    -- BUT, FOR TESTING PORPOUSES, WE CAN DO FOLLOWING:


    FOR vUSER_INFO IN   
            SELECT USER_ID,
                   INFO_ID,
                   'A=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'B=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'C=' || TRUNC(RANDOM() * 1000) AS INFO1,

                   'A=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'B=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'C=' || TRUNC(RANDOM() * 1000) AS INFO2,

                   'A=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'B=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'C=' || TRUNC(RANDOM() * 1000) AS INFO3,

                   'A=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'B=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'C=' || TRUNC(RANDOM() * 1000) AS INFO4,

                   'A=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'B=' || TRUNC(RANDOM() * 1000) || '|' || 
                   'C=' || TRUNC(RANDOM() * 1000) AS INFO5


              FROM GENERATE_SERIES(1,1500) AS USER_ID
              CROSS JOIN GENERATE_SERIES(1,500) AS INFO_ID

    LOOP
        RETURN NEXT vUSER_INFO;
    END LOOP;

END;
$BODY$
  LANGUAGE plpgsql VOLATILE;

And here is the script I run many times a day:

-- CREATE A TEMPORARY TABLE
CREATE GLOBAL TEMPORARY TABLE USER_INFO_TEMP 
(USER_ID TEXT, INFO_ID TEXT, INFO1 TEXT, INFO2 TEXT, INFO3 TEXT, INFO4 TEXT, INFO5 TEXT)
ON COMMIT DROP;

-- POPULATE MY TEMP TABLE WITH MY FUNCTION'S RESULT
INSERT INTO USER_INFO_TEMP (USER_ID, INFO_ID, INFO1, INFO2, INFO3, INFO4, INFO5)
SELECT USER_ID, INFO_ID, INFO1, INFO2, INFO3, INFO4, INFO5 
FROM CALCULATE_USERS_INFO();

-- CREATE AN INDEX FOR HELP PERFORMANCE
CREATE INDEX USER_INFO_TEMP_INDEX ON USER_INFO_TEMP (USER_ID, INFO_ID);

-- DELETE FROM MY TABLE WHERE NOT EXISTS IN MY TEMP TABLE
DELETE FROM USER_INFO T 
WHERE NOT EXISTS (SELECT 1 
                  FROM USER_INFO_TEMP T2 
                  WHERE T2.USER_ID = T.USER_ID
                  AND   T2.INFO_ID = T.INFO_ID);

-- UPDATE MY TABLE WHERE COLUMNS ARE DIFFERENT IN MY TEMP TABLE
UPDATE USER_INFO T 
SET INFO1 = T2.INFO1,
    INFO2 = T2.INFO2,
    INFO3 = T2.INFO3,
    INFO4 = T2.INFO4,
    INFO5 = T2.INFO5
FROM USER_INFO_TEMP T2 
WHERE T2.USER_ID = T.USER_ID
  AND T2.INFO_ID = T.INFO_ID
  AND (T2.INFO1 <> T.INFO1 OR 
       T2.INFO2 <> T.INFO2 OR
       T2.INFO3 <> T.INFO3 OR
       T2.INFO4 <> T.INFO4 OR
       T2.INFO5 <> T.INFO5
      );

-- INSERT INTO TABLE WHERE EXISTS IN TEMP AND NOT EXISTS IN TABLE
INSERT INTO USER_INFO (USER_ID, INFO_ID, INFO1, INFO2, INFO3, INFO4, INFO5) 
(SELECT USER_ID, INFO_ID, INFO1, INFO2, INFO3, INFO4, INFO5
   FROM USER_INFO_TEMP T2
  WHERE NOT EXISTS (SELECT 1 
                      FROM USER_INFO T3
                     WHERE T3.USER_ID = T2.USER_ID 
                       AND T3.INFO_ID = T2.INFO_ID
                   )
);

Best Answer

It very much depends on cardinalities.
How many rows in old and new table? How many of those result in DELETE / UPDATE / INSERT?

TRUNCATE is fastest

Generally, if large parts of the table are changed, a TRUNCATE / INSERT from function is probably the fastest way. If done in the same transaction, Postgres does not need to write WAL (since we start from scratch anyway). Also, you get a pristine table without bloat, which reflects positively on the next iteration of this process .. For big tables drop and recreate indexes etc. Details in this related answer:


The rest only applies if you want to keep existing rows in place for some reason.


If triggers are in the way (as you write and as I am not convinced, but let's assume for now). Or if you have additional columns in the table that cannot be lost.

Depending on how many rows are in the change set (returned from the function) ...

Few rows

Less than ~ 1000, it depends on many factors. A data-modifying CTE (with an automatic, cheap, internal temp table for the function results) is probably fastest:

WITH x AS (SELECT * FROM calculate_users_info())
, del AS (
   DELETE FROM user_info t
   WHERE  NOT EXISTS (
      SELECT 1 FROM x
      WHERE  user_id = t.user_id
      AND    info_id = t.info_id
      )
, upd AS (
    UPDATE user_info t 
    SET   (info1,   info2,   info3,   info4,   info5)
      = (x.info1, x.info2, x.info3, x.info4, x.info5)
    FROM   x
    WHERE  x.user_id = t.user_id
    AND    x.info_id = t.info_id
    AND   (x.info1 <> t.info1 OR 
             x.info2 <> t.info2 OR
             x.info3 <> t.info3 OR
             x.info4 <> t.info4 OR
             x.info5 <> t.info5)
    )
INSERT INTO user_info
      (user_id, info_id, info1, info2, info3, info4, info5) 
SELECT user_id, info_id, info1, info2, info3, info4, info5
FROM   x
WHERE  NOT EXISTS (
    SELECT 1 
    FROM   user_info t3
    WHERE  t3.user_id = t2.user_id 
    AND    t3.info_id = t2.info_id
    )
;

Many rows

In this case, your script building on a temporary table looks mostly fine. The major advantage is the index on the temp table - of which you did not properly take advantage, due to missing statistics.

I also have a couple of other suggestions:

CREATE TEMP TABLE user_info_tmp ON COMMIT DROP AS  -- directly from SELECT
SELECT * FROM calculate_users_info();

CREATE INDEX user_info_tmp_idx ON user_info_tmp (user_id, info_id);

ANALYZE user_info_tmp;     -- !!!

DELETE FROM user_info t    -- with EXISTS semi-anti-join
WHERE  NOT EXISTS (
   SELECT 1 FROM user_info_tmp
   WHERE  user_id = t.user_id
   AND    info_id = t.info_id
   );

ANALYZE user_info;         -- only if large parts have been removed

UPDATE user_info t         -- with short syntax
SET   (info1,   info2,   info3,   info4,   info5)
  = (x.info1, x.info2, x.info3, x.info4, x.info5)  -- shorter, not faster
FROM   user_info_tmp x 
WHERE  x.user_id = t.user_id
AND    x.info_id = t.info_id
AND   (x.info1 <> t.info1 OR x.info2 <> t.info2 OR x.info3 <> t.info3
    OR x.info4 <> t.info4 OR x.info5 <> t.info5);

INSERT INTO user_info      -- with join syntax
      (user_id, info_id, info1, info2, info3, info4, info5) 
SELECT user_id, info_id, info1, info2, info3, info4, info5
FROM   user_info_tmp x
LEFT   JOIN user_info u USING (user_id, info_id)
WHERE  u.user_id IS NULL;  -- shorter, maybe faster

Major points

  • Temp tables are not analyzed automatically. Also, tables created in the same transaction and used immediately generally don't give autovacuum a chance to kick in. Details:

  • Are regular VACUUM ANALYZE still recommended under 9.1?

    For both reasons you need to run ANALYZE on the table manually, best in the time line where I put it. This avoids grossly misguided query plans. Minor additional optimization: you might set the statistics target for irrelevant columns to 0 - all except (user_id, info_id) in the example.

  • Don't use GLOBAL for temp table. Per documentation:

    GLOBAL or LOCAL

    Ignored for compatibility. Use of these keywords is deprecated; refer to CREATE TABLE for details.

  • You can create the temp table from the function result automatically. Much shorter code and a bit faster, too.

  • Consider the shorter syntax variants, which mostly shorten the code and won't change performance much.

work_mem / temp_bufers

Either way, you need enough RAM to make it fast. Hardly relevant for small cardinalities, but crucial for big tables. A couple of thousands should not touch any memory limits in a halfway decent setup. For more, be sure to allocate enough work_mem for the CTE or temp_bufers for the temp table. Related answer: