I have two tables in one-to-many rel, I want to run a query efficiently returning data from the base table and some aggregates from the joined subtable. Adding in some indices, I still did not manage to realize how to get this baby going. So:
I have two tables:
CREATE TABLE public.person (
id serial NOT NULL,
"name" varchar(60) NULL,
"nationality" varchar(60) NULL,
-- ...
CONSTRAINT person_pkey PRIMARY KEY (id)
);
CREATE TABLE public.vehicle (
id serial NOT NULL,
person_id int4 NOT NULL, -- owner
vehicle_type_id int4 NOT NULL, -- 0=car, 1=bike, 2=...
"name" varchar(60) NULL,
"data" text NULL,
-- ...
CONSTRAINT vehicle_pkey PRIMARY KEY (id),
CONSTRAINT owner FOREIGN KEY (person_id) REFERENCES person(id)
);
I want to run a query like
SELECT
id, name, -- ...
COALESCE(v.vehicle_count, 0) vehcount,
COALESCE(v.has_car, false) has_car,
COALESCE(v.has_bike, false) has_bike
FROM person p
LEFT JOIN (
SELECT
person_id,
COUNT(1) AS vehicle_count,
bool_or(vehicle_type_id = 0) AS has_car,
bool_or(vehicle_type_id = 1) AS has_bike
FROM
vehicle
GROUP BY person_id
) v ON v.person_id = p.id
limit 100;
I generated some sample data with
INSERT INTO person (id,name)
SELECT
id, md5(random()::text)
FROM generate_series(1,1000000) id;
INSERT INTO vehicle (id, person_id, vehicle_type_id, data)
SELECT
id, (id-1)/4+1, cast(random()*1 as int),
substring(repeat(md5(random()::text), 32), 1, cast(random()*1000 as int))
FROM generate_series(1,4000000) id where random() < 0.5;
Tried adding the indices
CREATE INDEX person_name ON public.person USING btree (name);
CREATE INDEX vehicle_person ON public.vehicle USING btree (person_id, vehicle_type_id);
Explain analyze comes up with a plan that merges vehicle with person through person_id, and then performs sorting by p.name.
Limit (cost=175683.15..175694.82 rows=100 width=80) (actual time=1746.010..1750.880 rows=100 loops=1)
-> Gather Merge (cost=175683.15..272912.24 rows=833334 width=80) (actual time=1746.008..1750.854 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=174683.12..175724.79 rows=416667 width=80) (actual time=1742.265..1742.274 rows=79 loops=3)
Sort Key: p.name
Sort Method: top-N heapsort Memory: 49kB
Worker 0: Sort Method: top-N heapsort Memory: 48kB
Worker 1: Sort Method: top-N heapsort Memory: 48kB
-> Merge Left Join (cost=0.85..158758.41 rows=416667 width=80) (actual time=1.541..1629.871 rows=333333 loops=3)
Merge Cond: (p.id = vehicle.person_id)
-> Parallel Index Scan using person_pkey on person p (cost=0.42..28484.09 rows=416667 width=37) (actual time=0.059..108.234 rows=333333 loops=3)
-> GroupAggregate (cost=0.43..115317.40 rows=834915 width=14) (actual time=0.075..1269.030 rows=937271 loops=3)
Group Key: vehicle.person_id
-> Index Only Scan using vehicle_person on vehicle (cost=0.43..76972.43 rows=1999721 width=8) (actual time=0.068..523.452 rows=1999350 loops=3)
Heap Fetches: 5998049
Planning Time: 0.181 ms
Execution Time: 1750.966 ms
What I would like to happen is the db to use person_name
index and for the 100 records use the vehicle_person
index to look up and aggregate the has_car, has_bike, vehicle_count fields.
I understand that btree is not optimal for vehicle_person
, and tried to use a hash index, but that cannot include non-indexed fields, thus it would still require looking up the record from disk.
Running the query takes 2seconds on my box, while without the left join part 3ms.
NOTE: My example is not perfect. With the real data, the query runs way over 30seconds.
The real tables person has ~20 varchar/int fields, vehicle has one text field with an average of 500 byte length, ~5M records each.
(tried adding in a data field into both tables to simulate this, but that changed the query plan)
Best Answer
The planner will alas not rewrite subqueries into lateral queries. You need to do that yourself:
OF course if it weren't for the small LIMIT, the other way of writing it would end up being faster, so it is a shame the planner won't interconvert between them automatically in a context-aware way.