PostgreSQL – Group and Aggregate JSONb Arrays and Non-JSONb Properties

aggregatejsonpostgresqlqueryquery-performance

Version
PostgreSQL 11.7 on x86_64-pc-linux-gnu, compiled by gcc, a 224fe214a p 3971489d3e, 64-bit

I am trying to build a query which (distinctly) groups values from JSONb columns, and also sums values from others, and non JSONb columns.

Table Definition (I have removed other columns that are not relevant)

id varchar(255) NOT NULL,
casualties jsonb NOT NULL,
involved_parties jsonb NULL,
tags jsonb NULL,
reported_at int8 NULL,

casualities

Each row has a single object which represents the count of casualties across categories, for the incident.

{"police_deaths": 0, "civilian_deaths": 0, "criminal_deaths": 0, "military_deaths": 0, "police_injuries": 0, "emergency_deaths": 0, "civilian_injuries": 1, "criminal_injuries": 1, "military_injuries": 0, "emergency_injuries": 0}

involved_parties

This is an array of objects. Each row shows zero or more involved parties (who was involved in the incident). The data looks a little misleading at first as for each entry within the array the involved party/incident relationship has an ID. This doesnt really give us anything and I dont need this, but it is in the data currently.

[
  {"id": "2a0fd9dc-40bd-40dc-88ce-bc819fe9cdd8", "type": "group", "group": {"id": "6d342bfc-72c4-4588-ab95-1b3bdfb4881a", "name": "Naxals"}, "involvement": "Actor"}, 
  {"id": "dafc4726-3d3d-40cb-bbaf-63fa57250b44", "type": "group", "group": {"id": "18c6d3f6-c3eb-45db-9a02-26606f85d7eb", "name": "Indian Security Forces"}, "involvement": "Directly Targeted"}
]

Its the group, and involvement data I am interested in here.

affected sectors

This structure is much like the involved parties.

[
  {"id": "fcb952ef-3139-4fe7-ba15-7d800bdc60ae", "sector": {"id": "668d330e-aee5-4291-be98-df9c32b5b420", "name": "Military"}}, 
  {"id": "d1b71bae-29ac-48a2-ab41-a6979d720171", "sector": {"id": "550a4aa0-6d6f-4be2-ba33-f35d159ee686", "name": "Police/Law"}}
]

Here it is the sector I am interested in.

reported_at

This is the epoch representation of when the incident was reported by our analysts.

Desired Output

For the records in the query I want a single row.
The single row has the following columns:

incident_count,
casualties,
involved_parties,
tags,
min_reported_at,
max_reported_at

The incident count should be just that, the number of rows formally represented.

The casualties object always has the same properties in the JSON and I want to sum them. So, there will be a single object with the sum of all police_deaths, civilian_deaths etc.

For the involved parties and affected sectors; each should have a single array containing a unique set of the parties/sectors from the rows

Min/Max reported at should be the min/max over all rows.

I tried from this starting point:

select 
    jsonb_agg(incidents.affected_sectors) as affected_sectors,
    jsonb_agg(incidents.involved_parties) as involved_parties
from incidents

But this was painfully slow (9 seconds). So I then tried by expanding each object out to a row, then trying to collapse it back but got lost off and bad results.

I'd appreciate any pointers here

Thanks,

Mark.

Best Answer

Ok, so i've got a working query that is happening in an acceptable timeframe. It feels ugly, so if there are obvious ways I can improve it please do let me know.

with base_data as (
        /*This is where the query for incidents/static assets goes*/
        select affected_sectors, involved_parties, reported_at, tags, casualties 
        from incidents
        ------------------------------------------------------------
)
select  /*unique affected_sectors*/ 
        (
            select jsonb_agg(ssect.sector)
            from (
                select sect.sector
                from base_data,
                 jsonb_to_recordset(base_data.affected_sectors) as sect(id varchar, sector jsonb)
                group by sect.sector
                ) ssect
        ) unique_sectors,
        /*unique involved parties*/
        (
            select jsonb_agg(spart.group)
            from    (
                select grp."group"
                from base_data,
                jsonb_to_recordset(base_data.involved_parties) as grp(id varchar, "type" varchar, "group" jsonb, involvement varchar)
                group by grp."group"
            ) spart
        ) unique_groups,
        /*min reported at date*/
        (
            select min(reported_at) from base_data 
        ) min_reported_at,
        /*max reported at date*/
        (
            select max(reported_at) from base_data 
        ) max_reported_at,
        /*unique tags*/
        (
            select jsonb_agg(stags.tags)
            from    (
                select value tags 
                from base_data, 
                jsonb_array_elements(base_data.tags) 
                group by value
            ) stags
        ) unique_tags,
        /*summary casualty counts*/
        (
            select json_object_agg(key, val)
            from (
                select key, sum(value::numeric) val
                from base_data cas, jsonb_each_text(cas.casualties)
                group by key
                ) scas
        ) casualty_counts,
        /*Incident Count*/
        (
            select count(1) from base_data
        ) incident_count

In our database this runs in around 700ms for 10000 incidents on a clear cache. I'd have liked this to be sub 200ms and will continue to hack at it. If I come up with anything more useful I'll add a comment.