Postgresql – the best approach to process huge amount of data insertion efficiently

postgresqltrigger

I'm dealing with a HUGE amount of data entry. I need to know the best practice to insert/update table records after new entry inserted in another table.

By huge i mean we get more than 2 million rows a day, it's an ongoing process.

I tried creating a trigger that will insert or update the data depending on a complex logic. But I doubt this is a good approach for doing this, I'm afraid it will be heavy on the database.

I use PostgreSQL 9.1

The current trigger looks like:

DECLARE
  AVar integer;
  AdateVar timestamp without time zone;
  AnameVar character varying(40);
BEGIN 
 SELECT id  
    INTO  AVar  
 FROM table1 
 WHERE ST_DWithin(NEW.position,ST_SetSRID(ST_MakePoint(longitudedecimal,latitudedecimal),4326) ,0.01447534783);

 select Adate 
   INTO AdateVar 
 from table2 
 where id = NEW. id  
 ORDER BY Adate DESC limit 1;  

 IF (aVar > 0) THEN
   select name into AnameVar FROM table1 WHERE id = AVar;
   INSERT into table2 (id,name,date) SELECT  NEW.id,AnameVar,NEW.timestamp;
   ELSE ...
 END IF;

 RETURN NULL;
End;

EDIT

here is a piece of the trigger header and function, the function is too long. coz as i know the trigger cannot call more than one function! so it became very long and complicated, I posted just one case out of 12:

CREATE TRIGGER ts_trigger
  AFTER INSERT
  ON table1
  FOR EACH ROW
  EXECUTE PROCEDURE test_trigger();

declare
Aposition geometry;
Cposition geometry;
Plong double precision;
Plat double precision;
Adate timestamp without time zone;
Cdate timestamp without time zone;
startDate timestamp without time zone;
CnotifDate timestamp without time zone;
AnotifDate timestamp without time zone;
lastmsg timestamp without time zone;
InsideCircle  integer;
InsideSquare  integer;
insidePoint  integer;
distance character varying(40);
-- this variables used to calculate the time in the table3
    inAction character varying(40);
    diff character varying(40);
    days character varying(40);
    hours character varying(40);
    str character varying(40);
    CinAction character varying(40);
    AinAction character varying(40);
BEGIN 


select time_stamp INTO Adate  from table1 where userid=NEW.userid and time_stamp < NEW.time_stamp order by time_stamp desc limit 1 ;
select  position INTO Aposition from table1 where userid=NEW.userid and time_stamp < NEW.time_stamp order by time_stamp desc limit 1;
select time_stamp INTO Cdate  from table1 where userid=NEW.userid and time_stamp > NEW.time_stamp order by time_stamp  limit 1; 
select position INTO Cposition from table1 where userid=NEW.userid and time_stamp > NEW.time_stamp order by time_stamp  limit 1;

SELECT p.num  INTO InsideCircle FROM table3 p WHERE ST_DWithin(Aposition,ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326) ,0.02171302174) ORDER BY ST_Distance(ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326),Aposition) limit 1;
SELECT p.num  INTO InsideSquare FROM table3 p WHERE ST_DWithin(NEW.position ,ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326) ,0.02171302174) ORDER BY ST_Distance(ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326),NEW.position) limit 1;
SELECT p.num  INTO insidePoint  FROM table3 p WHERE ST_DWithin(Cposition,ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326) ,0.02171302174) ORDER BY ST_Distance(ST_SetSRID(ST_MakePoint(p.longitudedecimal,p.latitudedecimal ), 4326),Cposition) limit 1;


-
    IF (InsideCircle >0 and (InsideCircle =InsideSquare or InsideSquare is null))THEN 
    select startDate INTO startDate  from myTable where id=NEW.userid and num = InsideCircle and startDate =Adate; 
        IF (InsideSquare >0)then 
            if (Cdate is not null )then 
                if (insidePoint is null)THEN 
                    diff = NEW.time_stamp  -startDate;
                    str= split_part(diff,' ',2);
                    IF(str = '')then
                    hours= split_part(diff,':',1);
                    days = '0';
                    ELSE
                    str= split_part(diff,' ',3);
                    IF(str = '') then
                    hours ='00';
                    days = split_part(diff,' ',1);
                    ELSE
                    hours= split_part( split_part(diff,' ',3),':',1);
                    days = split_part(diff,' ',1);
                    END IF;
                    END IF;
                    inAction = days || ',' ||hours;
                Update myTable SET notifDate = NEW.time_stamp , time_inAction=inAction WHERE id=NEW.userid  and num =InsideSquare ;
                END IF;

    END IF;
Return Null;
END;

Best Answer

The fragment you posted so far can be simplified to:

INSERT INTO table2 (id, name, date)  -- why "date" if you insert a timestamp?
SELECT NEW.id, t1.name, NEW.timestamp
FROM   table1 t1
WHERE  ST_DWithin(NEW.position
                , ST_SetSRID(ST_MakePoint(t1.longitudedecimal, t1.latitudedecimal), 4326)
                , 0.01447534783)
AND    t1.id > 0;  -- probably redundant!

IF NOT FOUND THEN ...
  • It is very inefficient to run separate queries with assignments in plpgsql instead of a single query.

  • Basic type names like date or timestamp lead to confusing error messages and other conflicts, id and name are the worst possible column names, non-descriptive and with countless duplicates all over your tables. Revisit your naming convention ...

You probably do not need any of this. Inserting millions of rows shouldn't be handled by triggers which are fired for each row. Extremely expensive. You need a set-based solution without triggers.

Probably best to COPY to a tmeporary staging table and INSERT / UPDATE from there. But basic information is missing.

Related answers: