I'm trying to write a query that returns data from multiple tables and aggregates it in one nested JSON field. I feel like this would perform great on SqlServer but, as Brent Ozar wrote in this post, the Postgres optimizer walls the CTE queries together. This forces me to use a WHERE
statement at the first CTE's level as it would otherwise load the entire dataset every times. That and the specific JSON functions which I'm not really used to make me wonder if this could perform better.
I've tried to write this without a CTE but was unsure how to nest subqueries.
Is there any postgres tricks I'm missing here ? Are those indexes effective ?
The output looks like this :
[{
"item_property_id": 1001010,
"property_name": "aadb480d8716e52da33ed350b00d6cef",
"values": [
"1f64450fae03b127cf95f9b06fca4bca",
"9a6883b8a87a5028bf7dfc27412c2de8"
]
},{
"item_property_id": 501010,
"property_name": "e870e8d81e16ee46c75493856b4c6b66",
"values": [
"a6bed25b407c515bb8a55f2e239066ec",
"feb10299fd6408e0d37a8761e334c97a"
]
},{
"item_property_id": 1010,
"property_name": "f2d7b27c50a059d9337c949c13aa3396",
"values": [
"56674c1c3d66c832abf87b436a4fd095",
"ff88fe69f4438a6277c792faaf485368"
]
}]
Here's the script to generate the schema and test data
--create schema
drop table if exists public.items;
drop table if exists public.items_properties;
drop table if exists public.items_properties_values;
create table public.items(
item_id integer primary key,
item_name varchar(250));
create table public.items_properties(
item_property_id serial primary key,
item_id integer,
property_name varchar(250));
create table public.items_properties_values(
item_property_value_id serial primary key,
item_property_id integer,
property_value varchar(250));
CREATE INDEX items_index
ON public.items USING btree
(item_id ASC NULLS LAST,item_name asc nulls last)
TABLESPACE pg_default;
CREATE INDEX properties_index
ON public.items_properties USING btree
(item_property_id ASC NULLS LAST,item_id asc nulls last,property_name asc nulls last)
TABLESPACE pg_default;
CREATE INDEX values_index
ON public.items_properties_values USING btree
(item_property_value_id ASC NULLS LAST,item_property_id asc nulls last,property_value asc nulls last)
TABLESPACE pg_default;
--insert dummy data
insert into public.items
SELECT generate_series(1,500000),md5(random()::text);
insert into public.items_properties (item_id,property_name)
SELECT item_id,md5(random()::text) from public.items;
insert into public.items_properties (item_id,property_name)
SELECT item_id,md5(random()::text) from public.items;
insert into public.items_properties (item_id,property_name)
SELECT item_id,md5(random()::text) from public.items;
insert into public.items_properties_values (item_property_id,property_value)
select item_property_id,md5(random()::text) from public.items_properties;
insert into public.items_properties_values (item_property_id,property_value)
select item_property_id,md5(random()::text) from public.items_properties;
--Query returned successfully in 22 secs 704 msec.
Here's the SQL command
Without the where on the third line it takes ~15 seconds to load. I understand this is loading thousands of records so maybe it's performing just fine but I'd REALLY like a second opinion.
with cte_items as (
select item_id,item_name from public.items
--where item_id between 1000 and 1010
),cte_properties as (
select ip.item_id,ip.item_property_id,ip.property_name from public.items_properties ip
inner join cte_items i on i.item_id=ip.item_id
),cte_values as (
select ipv.item_property_value_id,ipv.item_property_id,ipv.property_value from public.items_properties_values ipv
inner join cte_properties p on ipv.item_property_id=p.item_property_id
)
select i.item_id,i.item_name,json_agg(json_build_object('item_property_id',prop.item_property_id,'property_name',prop.property_name,'values',prop.values))
from cte_items i
left join (
select cp.item_id,cp.item_property_id,cp.property_name,json_agg(to_json(cv.property_value)) "values"
from cte_properties cp
left join ( select val.item_property_id,val.property_value from cte_values val ) cv on cv.item_property_id=cp.item_property_id
group by cp.item_id,cp.item_property_id,cp.property_name
) prop
on i.item_id=prop.item_id
group by i.item_id,i.item_name
Best Answer
What @jjanes wrote about CTEs acting as optimization fence.
Your particular query does not need CTEs to begin with - nor most of the other included noise. What I see can be reduced to a
SELECT
with two levels of nested subqueries:db<>fiddle here
Was more than twice as fast in my quick test.
While querying whole tables it is also much faster to aggregate first and join later. Even more so when you have more than just 2 or 3 rows per aggregate like in your demo - which may be over-simplifying.
Related: