Azure PostgreSQL – Handling 100% Memory Usage Without Easy Server Restart

azurepostgresql

I have recently noticed an issue when using Azure Database for PostgreSQL where my memory usage keeps growing, and upon hitting 100%, the server will stop responding.

This database server is used exclusively for development, so it has a lot of short-lived connections that are frequently closed forcibly (as people restart their application to fix an error here or there.)

In the logs, before this happens, there are two patterns I can see, autovacuum errors:

2018-05-22 11:16:13 UTC-5ae5085b.20-LOG:  CreateProcess call failed: No error (error code 1455)
2018-05-22 11:16:13 UTC-5ae5085b.20-LOG:  could not fork autovacuum worker process: No error
2018-05-22 11:16:14 UTC-5ae5085b.20-LOG:  CreateProcess call failed: A blocking operation was interrupted by a call to WSACancelBlockingCall.

 (error code 1455)

And what appears to be a memory usage dump, followed by a warning that a request failed:

TopMemoryContext: 143584 total in 6 blocks; 68072 free (43 chunks); 75512 used
  TopTransactionContext: 8192 total in 1 blocks; 7960 free (0 chunks); 232 used
  CFuncHash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  Type information cache: 24472 total in 2 blocks; 2840 free (0 chunks); 21632 used
  Record information cache: 24576 total in 2 blocks; 15072 free (5 chunks); 9504 used
  Operator lookup cache: 24576 total in 2 blocks; 10976 free (5 chunks); 13600 used
  TableSpace cache: 8192 total in 1 blocks; 2312 free (0 chunks); 5880 used
  MessageContext: 8192 total in 1 blocks; 6968 free (0 chunks); 1224 used
  Operator class cache: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  smgr relation table: 24576 total in 2 blocks; 8896 free (4 chunks); 15680 used
  TransactionAbortContext: 32768 total in 1 blocks; 32728 free (0 chunks); 40 used
  Portal hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  PortalMemory: 8192 total in 1 blocks; 7880 free (0 chunks); 312 used
    PortalHeapMemory: 1024 total in 1 blocks; 672 free (0 chunks); 352 used
      ExecutorState: 4218936 total in 10 blocks; 1970640 free (11 chunks); 2248296 used
        printtup: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        Table function arguments: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 1040384 total in 7 blocks; 168 free (3 chunks); 1040216 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  Relcache by OID: 24576 total in 2 blocks; 12960 free (4 chunks); 11616 used
  CacheMemoryContext: 1040384 total in 7 blocks; 422592 free (14 chunks); 617792 used
    CachedPlan: 31744 total in 5 blocks; 13368 free (0 chunks); 18376 used
    CachedPlanSource: 3072 total in 2 blocks; 496 free (0 chunks); 2576 used
      unnamed prepared statement: 57344 total in 3 blocks; 31952 free (1 chunks); 25392 used
    pg_shdepend_reference_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_class_tblspc_relfilenode_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_depend_depender_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_depend_reference_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_stat_statements: 15360 total in 4 blocks; 664 free (0 chunks); 14696 used
    pg_settings: 23552 total in 5 blocks; 2040 free (0 chunks); 21512 used
    pg_toast_2619_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_stat_database: 15360 total in 4 blocks; 728 free (1 chunks); 14632 used
    pg_stat_bgwriter: 7168 total in 3 blocks; 568 free (1 chunks); 6600 used
    pg_toast_2618_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_stat_activity: 23552 total in 5 blocks; 6464 free (0 chunks); 17088 used
    pg_extension_name_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    EventTriggerCache: 8192 total in 1 blocks; 8152 free (2 chunks); 40 used
      Event Trigger Cache: 8192 total in 1 blocks; 2840 free (0 chunks); 5352 used
    pg_index_indrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_enum_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_class_relname_nsp_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_foreign_server_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_statistic_relid_att_inh_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_cast_source_target_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_language_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_transform_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_collation_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_amop_fam_strat_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_index_indexrelid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_ts_template_tmplname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_config_map_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_opclass_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_event_trigger_evtname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_dict_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_event_trigger_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_conversion_default_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_enum_typid_label_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_config_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_user_mapping_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_opfamily_am_name_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_foreign_table_relid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_type_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_aggregate_fnoid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_constraint_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_ts_parser_prsname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_config_cfgname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_ts_parser_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_operator_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_namespace_nspname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_ts_template_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_amop_opr_fam_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_default_acl_role_nsp_obj_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_collation_name_enc_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_range_rngtypid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_ts_dict_dictname_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_type_typname_nsp_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_opfamily_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_class_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_transform_type_lang_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_proc_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_language_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_namespace_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1976 free (0 chunks); 1096 used
    pg_foreign_server_name_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 264 free (0 chunks); 760 used
    pg_conversion_oid_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_user_mapping_user_server_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_conversion_name_nsp_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_authid_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_auth_members_member_role_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_tablespace_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_shseclabel_object_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_replication_origin_roname_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_database_datname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_replication_origin_roiident_index: 1024 total in 1 blocks; 448 free (0 chunks); 576 used
    pg_auth_members_role_member_index: 1024 total in 1 blocks; 312 free (0 chunks); 712 used
    pg_database_oid_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
    pg_authid_rolname_index: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
  WAL record construction: 49768 total in 2 blocks; 6584 free (0 chunks); 43184 used
  PrivateRefCount: 8192 total in 1 blocks; 2840 free (0 chunks); 5352 used
  MdSmgr: 8192 total in 1 blocks; 7512 free (1 chunks); 680 used
  LOCALLOCK hash: 8192 total in 1 blocks; 776 free (0 chunks); 7416 used
  Timezones: 104120 total in 2 blocks; 2840 free (0 chunks); 101280 used
  ErrorContext: 8192 total in 1 blocks; 8152 free (4 chunks); 40 used
Grand total: 7133648 bytes in 175 blocks; 2710008 free (99 chunks); 4423640 used
2018-05-22 11:33:03 UTC-5b0107ed.badd4-ERROR:  out of memory
2018-05-22 11:33:03 UTC-5b0107ed.badd4-DETAIL:  Failed on request of size 132.

This picture shows how the memory growth has scaled and never dropped down over the last month.

The staging installation of our application is on AWS RDS, and memory usage is fairly flat there – as you can see on this picture, it has mostly stayed static over the last two weeks.

The amount of options available to me on Azure are very limited to control this. There's no way to restart the server other than scaling it to a higher pricing tier and lowering it again.

Still, I would rather avoid having to do this. Are there any tools I can use to figure out why this might be happening? Is there a better option to control the memory in Azure?

Thank you.

Best Answer

AWS RDS has multiple options in a default parameter group that are calculated from DBInstanceClassMemory variable, for example max_connections parameter which might be {DBInstanceClassMemory/31457280}, which for 8GB server would calculate to 8*1024*1024*1024/31457280=273.

This allows the parameters to scale up and down while changing the database instance RAM size.

I have no idea how is it configured on Azure, but you also need to make sure that those parameters are reasonably configured for your RAM. For the start I'd check which parameters are dependent on DBInstanceClassMemory on AWS RDS and then set those parameters the same on Azure (calculate manually if there's no similar mechanism to automatically scale them).

Try to set them so the query below will return value smaller than available memory for Postgres:

select
    pg_size_pretty(
        shared_buffers::bigint*block_size
        + max_connections*work_mem*1024
        + autovacuum_max_workers*(
            case when autovacuum_work_mem=-1
            then maintenance_work_mem
            else autovacuum_work_mem
            end
        )*1024
    ) as estimated_max_ram_usage
from (
    select
    (select setting::bigint from pg_settings where name='block_size') as block_size,
    (select setting::bigint from pg_settings where name='shared_buffers') as shared_buffers,
    (select setting::bigint from pg_settings where name='max_connections') as max_connections,
    (select setting::bigint from pg_settings where name='work_mem') as work_mem,
    (select setting::bigint from pg_settings where name='autovacuum_max_workers') as autovacuum_max_workers,
    (select setting::bigint from pg_settings where name='autovacuum_work_mem') as autovacuum_work_mem,
    (select setting::bigint from pg_settings where name='maintenance_work_mem') as maintenance_work_mem
) as _
;

But you can still kill the server, for example with a pg_restore or reindex with too many parallel workers (they can use up to maintenance_work_mem per worker) or too complicated queries (they can use up to work_mem per query execution node, and there can be multiple in a single query).

There's no single setting for the server to tell it to use no more than X RAM.