PostgreSQL – Design database to avoid frequent update on large dataset

database-designperformancepostgresql

We are trying to optimize out the performance on a database with these numbers and query requisites:

  1. 200-400k network segments identified by a unique ID
  2. each network segment have a state with a limited number of dynamic attributes (i.e. an average speed). A single state of dynamic attributes could be stored in 8 bytes
  3. the segment state will change every 3 minutes, H24, 7/7
  4. Is it possible to query for the state of a group of segments (sometimes all segments) in a particular date range or just the actual situation.
  5. It could be requested a spatial query to find all segments "around me" in a particular date (normally "at now")

With these requisites, we arrived at this solution (with a strong drawback, explained below).

[A] TABLE main_segments_history( 
      id_segment integer NOT NULL,
      day date NOT NULL,    
      day_slices bigint[],
      CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment,day)
)

[B] TABLE current_segment_release_state(
      id_segment integer NOT NULL,
      release_date timestamptz,
      ... all other attributes ...
      CONSTRAINT currsegm_release_state_pk PRIMARY KEY (id_segment,release_date)
)

Explaining the [A] table:

  1. It is partitioned on field "day" with partition_manager (pg_partman). Each partition is one month
  2. The day_slices array is a one-dimensional array of 480 elements, representing the granularity of each 3-minute-slice of the full day

Explaining the [B] table:

  1. It's just the current release state for each segment

There is a back-end process that elaborate the network.
It inserts or updates the states of each segment, every 3 minutes.
In other words this process will insert new rows on day start and will update the inner array every 3 minutes.

The advantages of this solution:

  1. A limited number of rows for each month partition table
  2. Good performance when joining the static data of the segment (i.e. the geometry)
  3. The little redundancy of the current release is very good to respond at real-time requests
  4. Space-safer: only ~12 GB of stored data for each partition (1 month) –

The drawbacks:

  1. constraint_exclusion. When querying on date range there is a need of using that feature/parameter of PostgreSQL. That is using constant values in a precompiled query that spans over multiple partitions. Example:

    constraint exclusion OK (will only search on February and March 2017):

    SELECT * FROM main_segments_history 
             WHERE day BETWEEN '2017-01-01' AND '2017-02-03'
    

    constraint exclusion KO (will search on all partition tables):

    SELECT * FROM main_segments_history m 
             JOIN sometable s ON s.id=m.id_segment 
             WHERE day BETWEEN s.day_from AND s.day_to
    
  2. The UPDATE is evil.

We turned OFF the autovacuum for performance issues, doing it with a batch every night.
Consider that in this way we have almost between 90M and 190M UPDATES each day, this is also the number of rows completely rewritten by postgreSQL (as you certainly know an UPDATE will flag the row deleted and then a NEW row is inserted) every day.

More, the UPDATE is a great time-consuming operation creating often a delay on the writes.

We first investigate at the possibility of using a LINK-TO-DATA design, using for example a table as a container of segments id with a data_id to a BIG TABLE of segment states, but we discarded when we just count the number of rows each month to handle with: ~2.880.000.000 with an amount of space of ~3GB a day. Not so good.

What do you think about? Do you have any solution to optimize this system?

Best Answer

The only way to avoid updates to your tables is just not to do them. Given the fact that you're just logging data, I would suggest a daily table, where you store the data for one day. Instead of having 190 M updates/day, you may have 190 M (smaller) inserts/day; and as many updates as there are segments.

 CREATE TABLE main_segments_history
 ( 
       id_segment integer NOT NULL,
       day date NOT NULL,    
       day_slices bigint[],
       CONSTRAINT main_segments_history_pk PRIMARY KEY (id_segment, day)
 ) ;

 CREATE TABLE dayly_segments
 (
      id_segment integer NOT NULL,
      day date NOT NULL,
      id_slice integer NOT NULL,
      slice bigint,
      PRIMARY KEY (id_segment, day, id_slice)
 ) ;

This would simulate the data for one day (and 200 segments; instead of the millions you might have):

 INSERT INTO 
     dayly_segments
     (id_segment, day, id_slice, slice)
 SELECT
     id_segment, '2017-01-01', id_slice, (random()*1e7)::bigint
 FROM
     generate_series (1, 200) AS s1(id_segment)
     CROSS JOIN generate_series (1, 20*24) AS s2(id_slice) ;

I assume you have some time with low activity, when you run the vacuuming processes. At that time, you can move data from the dayly_segments table to the main_segments_history, and put all the data as an array. This way, you're never updating main_segments_history.

This is, basically, what you would do:

 -- Move segments from dayly_segments to main_segment_history
 INSERT INTO
     main_segments_history
     (id_segment, day, day_slices)
 SELECT
     id_segment, day, 
     (SELECT array_agg(slice) 
      FROM (SELECT slice 
              FROM dayly_segments s1 
             WHERE s1.id_segment = s0.id_segment AND s1.day = s0.day 
          ORDER BY id_slice) AS s2)
 FROM
     (SELECT DISTINCT 
         id_segment, day 
     FROM
         dayly_segments s0
     WHERE 
         day = '2017-01-01'
     ) AS s0 ;

 -- Delete them from original
 DELETE FROM
     dayly_segments
 WHERE
     day = '2017-01-01' ;

 -- At this point, you should also...
 VACUUM dayly_segments ;

dbfiddle here


Assumptions:

  1. You never "miss" any $current_value. That is, there are no holes in your arrays.
  2. id_slice can be assigned the result of any increasing sequence. You could aso use, instead of an integer value, just a time value.