Postgresql – select duplicated rows by modified_date – PGSQL 9.2

postgresqlselect

Query:

SELECT j.id INTO junk.ja_jobs
FROM (
  SELECT time_job, recurrenceid, clientid, creatortype, deleted, MIN(id) AS id
  FROM ja_jobs
  WHERE clientid = 33731 AND creatortype = 'legacyrec' AND deleted = false
  GROUP BY time_job, recurrenceid, clientid, creatortype, deleted
  HAVING count(*) > 1
  ) jd
JOIN ja_jobs j USING (time_job, recurrenceid, clientid, creatortype, deleted)
WHERE j.id <> jd.id;
  • The query above gets all duplicated rows by time_job,recurrenceid and put the them into the temp table. Only the oldest ID stays in the ja_jobs table.

ja_jobs table has created_date and modified_date column, as you can see below:
enter image description here

Question:

  • Find duplicate recurring jobs with identical recurrencid and time_job
    and creator_type = 'legacyrec'
  • If created_userid = modified_userid AND the difference between
    modified_date and created_date is smaller than 10 seconds (since
    created_date is not necessarily smaller for strange reasons), we
    treat them as not modified
  • If there are matches in step 2, we keep all of them and delete the rest, and if there is no match, we keep the oldest one and delete the rest

How can I do that? How would the select statement be?

Best Answer

Here is what I did:

CREATE TABLE junk.ja_jobs_WM (
        recurrenceid integer,
        time_job int8,
        time_job_stamp timestamp,
        dupcount integer,
        most_recent_date timestamp
);

INSERT INTO junk.ja_jobs_WM
    SELECT
                recurrenceid,
                time_job,
                TO_TIMESTAMP(time_job),
                COUNT(id) as dupcount,
                MAX(modified_date) AS most_recent_date
        FROM
                ja_jobs
        WHERE
                (NOT (recurrenceid IS NULL))
        AND
                deleted = false
        AND
                (clientid = 15288)
        AND
                creatortype = 'legacyrec'
        AND
                (time_job >= EXTRACT('epoch' FROM statement_timestamp()))
        GROUP BY
                recurrenceid,
                time_job
        HAVING
                COUNT(id) > 1;

This question is now closed.