Postgresql – generate_series for multiple record types in postgresql. Fill missing with last previous value

postgresql

my question is closely related to the following question asked in this thread generate_series for multiple record types in postgresql. The only difference is, that instead of only inserting missing values, I want to fill the gaps with the last previous value. To illustrate it with the same example. I have two tables.

CREATE TABLE pests(id,name)
AS VALUES
  (1,'Thrip'),
  (2,'Fungus Gnosts');

CREATE TABLE pest_counts(id,pest_id,date,count)
AS VALUES
  (1,1,'2015-01-01'::date,14),
  (2,2,'2015-01-02'::date,5);

The desired output should be nearly identical. However, I want to fill missing values with previous values, if possible. So the second entry for Thrip should be 14, since a previous value exists and the first value of Fungus Gnats should be 0, since no previous value exists. Is it possible to do this using Postgres?

expected results

name         | date       | count
-------------+------------+-------
Thrip        | 2015-01-01 | 14
Thrip        | 2015-01-02 | **14** <- fill with existing previous value.
....
Fungus Gnats | 2015-01-01 | 0
Fungus Gnats | 2015-01-02 | 5
...

Best Answer

Example data with two rows added:

insert into pest_counts(pest_id, date, count) values
    (1, '2015-01-03', 10),
    (2, '2015-01-04', 7);

select * from pest_counts;

 id | pest_id |    date    | count 
----+---------+------------+-------
  1 |       1 | 2015-01-01 |    14
  2 |       2 | 2015-01-02 |     5
  3 |       1 | 2015-01-03 |    10
  4 |       2 | 2015-01-04 |     7
(4 rows)

Use count() as a window function to generate groups. Each not null value starts a new group:

select name, day, count, count(count) over (partition by name order by day) as grp
from (
    select id, name, day::date
    from pests
    cross join generate_series('2015-01-01'::timestamp, '2015-01-04', '1 day') day
) d
left join pest_counts on d.id = pest_id and day = date;

     name     |    day     | count | grp 
--------------+------------+-------+-----
 fungus gnats | 2015-01-01 |       |   0
 fungus gnats | 2015-01-02 |     5 |   1
 fungus gnats | 2015-01-03 |       |   1
 fungus gnats | 2015-01-04 |     7 |   2
 thrip        | 2015-01-01 |    14 |   1
 thrip        | 2015-01-02 |       |   1
 thrip        | 2015-01-03 |    10 |   2
 thrip        | 2015-01-04 |       |   2
(8 rows)

Now you can use sum() as a window function over partitions by names and generated groups:

select name, day, coalesce(count, sum(count) over (partition by name, grp order by day), 0) as count
from (
    select name, day, count, count(count) over (partition by name order by day) as grp
    from (
        select id, name, day::date
        from pests
        cross join generate_series('2015-01-01'::timestamp, '2015-01-04', '1 day') day
    ) d
    left join pest_counts on d.id = pest_id and day = date
) s;

     name     |    day     | count 
--------------+------------+-------
 fungus gnats | 2015-01-01 |     0
 fungus gnats | 2015-01-02 |     5
 fungus gnats | 2015-01-03 |     5
 fungus gnats | 2015-01-04 |     7
 thrip        | 2015-01-01 |    14
 thrip        | 2015-01-02 |    14
 thrip        | 2015-01-03 |    10
 thrip        | 2015-01-04 |    10
(8 rows)

Related Solutions

PostgreSQL – Using generate_series for Multiple Record Types

I usually solve such problems by setting up a table for all the possible data points (here the pests and dates). This is easily achieved by a CROSS JOIN, see the WITH query below.

Then, as the finishing step, I just (outer) join the existing measurements, based on the pest ID and date - optionally giving a default for the missing values via COALESCE().

So, the whole query is:

WITH data_points AS (
    SELECT id, name, i::date
    FROM pests
    CROSS JOIN generate_series('2015-01-01'::date, '2015-01-05', '1 day') t(i)
) 
SELECT d.name, d.i, COALESCE(p.cnt, 0) 
FROM data_points AS d 
LEFT JOIN pest_counts AS p 
    ON d.id = p.pest_id 
    AND d.i = p.count_date;

Check it at work on SQLFiddle.

Note: when either the table(s) or the generated series are big, doing the CROSS JOIN inside a CTE might be a bad idea. (It has to materialize all the rows, regardless of there is data for a given day or not). In this case one should do the same in the FROM clause, as a parenthesized sub-join instead of the current reference to data_points. This way the planner has a better understanding about the rows affected and the possibilities for using indexes. I use the CTE in the example because it looks cleaner for the sake of the example.

Filling Missing Rows in PostgreSQL Database from Backup

You can restore all affected tables into a different schema, there are a variety of ways in which you can do that (several good ideas in https://stackoverflow.com/questions/4191653/i-want-to-restore-the-database-with-a-different-schema).

Once the table is in a separate schema, you can compare them by primary key, and insert the missing ones, like so for each affected table:

INSERT INTO production.table
SELECT b.*
FROM backup.table b
LEFT JOIN production.table p ON (p.pkey = b.pkey)
WHERE p.pkey IS NULL;

You will just need to find out the list of affected tables, restore their backup in a separate schema, determine what are their primary keys, create the script with the inserts, and run it.

Best Answer

Related Solutions

PostgreSQL – Using generate_series for Multiple Record Types

Filling Missing Rows in PostgreSQL Database from Backup

Related Question