PostgreSQL – Storing Historical Records and Extracting Data from a Given Period

database-designerdpostgresqlrelational-theory

This a question about database modeling and PostgreSQL. This post partially answers the question but I need advice on more technical stuff: columns and query to extract the data.

I need to store administrative boundaries history within a given territory in order to be able to track boundaries movements back over the passed years.

Here is an example on how boundaries move:

  • Martin Land area has always existed
  • Bulwers Land area has always existed too
  • Those areas were merged together in 2015:
    • The new area was renamed as Martin-Bulwers
    • The new area kept Bulwers Land code (Martin Land code becoming deprecated)

Here is the way I store the data:

gid  | code   | name            | change_date
-----+--------+-----------------+-------------
1    | 86001  | Martin Land     | 2000-01-01
2    | 86002  | Bulwers Land    | 2000-01-01
3    | 86002  | Martin-Bulwers  | 2015-01-01

I have set a default change_date to 2000-01-01 for historical situation of an area, i.e. before the first changement occurs.

Then, I have many different cases: areas that have remained unmerged, areas that have merged with others at different dates. This can produce the following example:

CREATE TEMPORARY TABLE foo AS
SELECT gid,code,area,change_date::date FROM ( VALUES
  ( 1,86001,'Martin Land'   ,'2000-01-01' ),
  ( 2,86002,'Bulwers Land'  ,'2000-01-01' ),
  ( 3,86002,'Martin-Bulwers','2015-01-01' ),
  ( 4,86003,'Coveral Land'  ,'2000-01-01' ),
  ( 5,86004,'Big Tom Area'  ,'2000-01-01' ),
  ( 6,86005,'Small Tom Area','2000-01-01' ),
  ( 7,86004,'Tom Land'      ,'2016-01-01' )
) AS t(gid,code,area,change_date);

Then I have difficulties writing a query that could return the list of areas of a given year. I tried to rely on DISTINCT ON() clause but this does not the job I need…

For instance, the following query will return the following table:

SELECT DISTINCT ON (code) code, area, change_date

  FROM myTable WHERE change_date < '2016-01-01'

  ORDER BY code, change_date DESC ;

-- Result:

code   | area            | change_date
-------+-----------------+-------------
86001  | Martin Land     | 2000-01-01
86002  | Martin-Bulwers  | 2015-01-01
86003  | Coveral Land    | 2000-01-01
86004  | Big Tom Area    | 2000-01-01
86005  | Small Tom Area  | 2000-01-01

This is partially right as before 2016-01-01, Big Tom Area and Small Tom Area had not merged together but Martin Land and Bulwers Land had! There were 5 areas in 2014, then 4 areas in 2015 and 3 in 2016.

In fact, I would need the following result:

code   | area            | change_date
-------+-----------------+-------------
86002  | Martin-Bulwers  | 2015-01-01
86003  | Coveral Land    | 2000-01-01
86004  | Big Tom Area    | 2000-01-01
86005  | Small Tom Area  | 2000-01-01

Is there another way to store these informations or to write the query?

Best Answer

The problem is that when an area is merged and its code is reused, you have that information, but when an area is merged and its code is deprecated, you don't have it. In other words, you don't have the information that an area is deprecated.

Now, I see two solutions. First, keep the schemas as is and whenever an area is deprecated, add a new row as it was a new area and just states it is deprecated, e.g.

INSERT INTO myTable(code, area, change_date)
VALUES(86001, 'deprecated', '2015-01-01')

So in your query it will show as 'deprecated' or you filter it out.

A second, and possible best option, is to have a new column that states when (and if) an area was deprecated:

ALTER TABLE myTable ADD deprecated DATE;
UPDATE myTable SET deprecated = '2015-01-01' WHERE gid = 1;

So, on your query you can simple add the filter:

(deprecated IS NULL OR deprecated >= '2016-01-01')

The full code:

SELECT DISTINCT ON (code) code, area, change_date
FROM myTable
WHERE change_date < '2016-01-01'
    AND (deprecated IS NULL OR deprecated >= '2016-01-01')
ORDER BY code, change_date DESC;

As another option for non-deprecated areas, you can set them as 'infinity' instead of NULL, so the query does not need the deprecated IS NULL condition, as 'infinity' will always be higher or equal than any other non-null value.