PostgreSQL – Group and Aggregate JSONb Arrays and Non-JSONb Properties

aggregatejsonpostgresqlqueryquery-performance

Version
PostgreSQL 11.7 on x86_64-pc-linux-gnu, compiled by gcc, a 224fe214a p 3971489d3e, 64-bit

I am trying to build a query which (distinctly) groups values from JSONb columns, and also sums values from others, and non JSONb columns.

Table Definition (I have removed other columns that are not relevant)

id varchar(255) NOT NULL,
casualties jsonb NOT NULL,
involved_parties jsonb NULL,
tags jsonb NULL,
reported_at int8 NULL,

casualities

Each row has a single object which represents the count of casualties across categories, for the incident.

{"police_deaths": 0, "civilian_deaths": 0, "criminal_deaths": 0, "military_deaths": 0, "police_injuries": 0, "emergency_deaths": 0, "civilian_injuries": 1, "criminal_injuries": 1, "military_injuries": 0, "emergency_injuries": 0}

involved_parties

This is an array of objects. Each row shows zero or more involved parties (who was involved in the incident). The data looks a little misleading at first as for each entry within the array the involved party/incident relationship has an ID. This doesnt really give us anything and I dont need this, but it is in the data currently.

[
  {"id": "2a0fd9dc-40bd-40dc-88ce-bc819fe9cdd8", "type": "group", "group": {"id": "6d342bfc-72c4-4588-ab95-1b3bdfb4881a", "name": "Naxals"}, "involvement": "Actor"}, 
  {"id": "dafc4726-3d3d-40cb-bbaf-63fa57250b44", "type": "group", "group": {"id": "18c6d3f6-c3eb-45db-9a02-26606f85d7eb", "name": "Indian Security Forces"}, "involvement": "Directly Targeted"}
]

Its the group, and involvement data I am interested in here.

affected sectors

This structure is much like the involved parties.

[
  {"id": "fcb952ef-3139-4fe7-ba15-7d800bdc60ae", "sector": {"id": "668d330e-aee5-4291-be98-df9c32b5b420", "name": "Military"}}, 
  {"id": "d1b71bae-29ac-48a2-ab41-a6979d720171", "sector": {"id": "550a4aa0-6d6f-4be2-ba33-f35d159ee686", "name": "Police/Law"}}
]

Here it is the sector I am interested in.

reported_at

This is the epoch representation of when the incident was reported by our analysts.

Desired Output

For the records in the query I want a single row.
The single row has the following columns:

incident_count,
casualties,
involved_parties,
tags,
min_reported_at,
max_reported_at

The incident count should be just that, the number of rows formally represented.

The casualties object always has the same properties in the JSON and I want to sum them. So, there will be a single object with the sum of all police_deaths, civilian_deaths etc.

For the involved parties and affected sectors; each should have a single array containing a unique set of the parties/sectors from the rows

Min/Max reported at should be the min/max over all rows.

I tried from this starting point:

select 
    jsonb_agg(incidents.affected_sectors) as affected_sectors,
    jsonb_agg(incidents.involved_parties) as involved_parties
from incidents

But this was painfully slow (9 seconds). So I then tried by expanding each object out to a row, then trying to collapse it back but got lost off and bad results.

I'd appreciate any pointers here

Thanks,

Mark.

Best Answer

Ok, so i've got a working query that is happening in an acceptable timeframe. It feels ugly, so if there are obvious ways I can improve it please do let me know.

with base_data as (
        /*This is where the query for incidents/static assets goes*/
        select affected_sectors, involved_parties, reported_at, tags, casualties 
        from incidents
        ------------------------------------------------------------
)
select  /*unique affected_sectors*/ 
        (
            select jsonb_agg(ssect.sector)
            from (
                select sect.sector
                from base_data,
                 jsonb_to_recordset(base_data.affected_sectors) as sect(id varchar, sector jsonb)
                group by sect.sector
                ) ssect
        ) unique_sectors,
        /*unique involved parties*/
        (
            select jsonb_agg(spart.group)
            from    (
                select grp."group"
                from base_data,
                jsonb_to_recordset(base_data.involved_parties) as grp(id varchar, "type" varchar, "group" jsonb, involvement varchar)
                group by grp."group"
            ) spart
        ) unique_groups,
        /*min reported at date*/
        (
            select min(reported_at) from base_data 
        ) min_reported_at,
        /*max reported at date*/
        (
            select max(reported_at) from base_data 
        ) max_reported_at,
        /*unique tags*/
        (
            select jsonb_agg(stags.tags)
            from    (
                select value tags 
                from base_data, 
                jsonb_array_elements(base_data.tags) 
                group by value
            ) stags
        ) unique_tags,
        /*summary casualty counts*/
        (
            select json_object_agg(key, val)
            from (
                select key, sum(value::numeric) val
                from base_data cas, jsonb_each_text(cas.casualties)
                group by key
                ) scas
        ) casualty_counts,
        /*Incident Count*/
        (
            select count(1) from base_data
        ) incident_count

In our database this runs in around 700ms for 10000 incidents on a clear cache. I'd have liked this to be sub 200ms and will continue to hack at it. If I come up with anything more useful I'll add a comment.

Related Solutions

PostgreSQL JSON – Getting Count and Sum of Distinct Records from JSONB Attribute

You can just use:

SELECT
    data->>'sector' AS "Sector", 
    count(data->>'sector') AS "Count", 
    sum((data->'personnel'->'headcount'->>'total')::integer)  AS "Total", 
    sum((data->'personnel'->'headcount'->>'male')::integer)   AS "Male", 
    sum((data->'personnel'->'headcount'->>'female')::integer) AS "Female"
FROM
    research
GROUP BY
    data->>'sector'
ORDER BY
    data->>'sector' ;

You use two PostgreSQL JSONB operators:

jsonb -> field  => gets the field out of the json(b), returning a json(b) object
jsonb ->> field => gets the field out of the json(b), returning it as text

And you will just get:

Sector     | Count | Total | Male | Female
:--------- | ----: | ----: | ---: | -----: 
business   |     2 |   600 |  510 |     90
government |     2 |   446 |  131 |    315
non-profit |     1 |     5 |    4 |      1

However, for this kind of scenario, where your data is perfectly structured, if would make far more sense to use normalized SQL. This would be the structure of your table (without redundant data):

 CREATE TABLE normalized_research
 ( 
     id integer PRIMARY KEY,
     name text,
     sector text,
     male_headcount integer,
     female_headcount integer
 ) ;

This is how you would fill it in:

 INSERT INTO 
     normalized_research 
     (id, name, sector, male_headcount, female_headcount)
 SELECT
     id, 
     data->>'name',
     data->>'sector',
     (data->'personnel'->'headcount'->>'male')::integer,
     (data->'personnel'->'headcount'->>'female')::integer 
 FROM
     research ;

And this is the (much nicer, faster, safer) query you would make:

SELECT
     sector AS "Sector", 
     count(sector) AS "Count", 
     sum(male_headcount)+sum(female_headcount) AS "Total", 
     sum(male_headcount) AS "Male", 
     sum(female_headcount) AS "Female"
 FROM
     normalized_research
 GROUP BY
     sector
 ORDER BY
     sector ;

... that would be giving you exactly the same result.

You can find all the logic at this DBFiddle

Side notes

JSON was mostly invented to transfer and exchange data (through Ajax) between web services and web consumers (normally written in JavaScript).

Quoting JSON.org:

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999.

(emphasis mine)

It is perfectly well suited for its original purpose. But it has not been designed as a way to store data in a database. That was, somehow, an afterthought, because you can very easIly do it. You can store JSON as text, and you can already use any database, even if it is not aware of JSON.

In some occasions (normally linked to very variable data structures, and data structureS not known in advance) it is a good choice. But when your data is perfectly well structured, your schema (=structure) is well known in advance and doesn't need lots of flexibility: plain old normalized SQL is a better choice: you have data-type safety, consistency, easier indexing, referential integrity, faster access, ...

Postgresql – How to sum (aggregate) all the values in a key/value JSONB

Despite having searched for hours I think I found my answer 10 minutes after posting!

Including for reference for others:

SELECT id, sum(value::float)
FROM testtable
CROSS JOIN LATERAL jsonb_each_text(qty)
GROUP BY id;

Best Answer

Related Solutions

PostgreSQL JSON – Getting Count and Sum of Distinct Records from JSONB Attribute

Side notes

Postgresql – How to sum (aggregate) all the values in a key/value JSONB

Related Question