Postgresql – Refresh Materialized View impact on the DB

amazon-rdsmaterialized-viewpostgresqlpostgresql-9.6

Hi we are running an PostgreSQL 9.6 database in Amazon RDS under a m4.large(2cpu 8gb) and a provisioned IOPS of 1000. The use case is the following: We have a table with several million of registries (4M more or less) and we have created a materialized view with a subset of this table (2M aprox) changing some columns types to be more efficient in querying. Our pg_conf has not been changed, being RDS Postgres' default.

This is our view definition:

CREATE MATERIALIZED VIEW public.customers_mv as
SELECT 
    id,
    gender,
    contact_info,
    location,
    social,
    categories,
    (social ->> 'follower_count')::integer AS social_follower_count,
    (social ->> 'following_count')::integer AS social_following_count,
    (social ->> 'peemv')::float AS social_emv,
    (social ->> 'engagement')::float AS social_engagement,
    (social ->> 'v')::boolean AS social_validated,
    search_vector,
    flags,
    to_tsvector('english',concat_ws(' ','aal0_'||(customers.location ->> 'aal0'),
      'aal1_'||(customers.location ->> 'aal1'),
      'aal2_'||(customers.location ->> 'aal2'),
      'frequent_location_aal0_'||(customers.location -> 'frequent_location' ->> 'aal0'),
      'frequent_location_aal1_'||(customers.location -> 'frequent_location' ->> 'aal1'),
      'frequent_location_aal2_'||(customers.location -> 'frequent_location' ->> 'aal2'),
      'last_post_location_aal0_'||(customers.location -> 'last_post_location' ->> 'aal0'),
      'last_post_location_aal1_'||(customers.location -> 'last_post_location' ->> 'aal1'),
      'last_post_location_aal2_'||(customers.location -> 'last_post_location' ->> 'aal2'),
      'admin_location_aal0_'||(customers.location -> 'admin_location' ->> 'aal0'),
      'bio_location_aal0_'||(customers.location -> 'bio_location' ->> 'aal0'))) as loc_vector
FROM public.customers
WHERE (customers.social -> 'follower_count') > '5000'
AND customers.social ? 'last_posts' 
AND (customers.flags IS NULL OR NOT customers.flags @> '{"destroy": true}'::jsonb);


CREATE INDEX customers_mv_followerc_idx ON customers_mv USING BTREE (social_follower_count);
CREATE INDEX customers_mv_folling_idx ON customers_mv USING BTREE (social_following_count);
CREATE INDEX customers_mv_emv_idx ON customers_mv USING BTREE (social_emv);
CREATE INDEX customers_mv_gin_social_idx ON customers_mv USING GIN (social jsonb_path_ops);
CREATE INDEX customers_mv_partial_social_validated_idx ON customers_mv (social_validated) WHERE social_validated = FALSE;
CREATE INDEX customers_mv_categories_idx ON customers_mv USING gin (categories);
CREATE INDEX customers_mv_gin_location_idx ON customers_mv USING GIN (location jsonb_path_ops);
CREATE INDEX customers_mv_gin_loc_vector ON customers_mv USING gin(loc_vector);
CREATE UNIQUE INDEX customers_mv_uniq_id_idx ON customers_mv (id);

Our view has some columns just for accessing data like location or social (both jsonb type) and some other like loc_vector for querying faster.

Now the problem:
The problem occurs when we try to refresh, refresh concurrently or create a new materialized view, when we try to launch the refresh command the CPU or the write IOPS crash the DB.

Here we can se how hard it hits the DB
CPU Usage

enter image description here

What would be the ideal setting given our requirements?

Its a problem of increasing infrastructure or maybe a matter of configuration/optimization?

Best Answer

Your problem is likely the indexes.

IIRC, refreshing a materialized view drops the existing data and create a new "table" with the current data.

What this does for your indexes is re-index the entire subset of data, which based on your indexes send like a significant workload.

My recommendation would be to not use the built-in materialized views and to instead roll your own equivalent, incrementally updating data so your are only writing stale data/indexes.

The trade off is that this will be much more complex to build out than simply creating a materialized view, and you'll need to consider how you want to track invalidation of stale data.