Postgresql Lateral Left Join Query Performance Issues – Takes a Long Time

postgresqlpostgresql-performancequery-performance

I am trying to run the query below on a database with hundreds of thousands of lines but it is taking over 12 hours so far. It is the lateral left join that is taking so long. The reason I need the lateral left join is because I am trying to get the last updated category for each transaction (ordered by c.updated_at DESC). Does anyone have any suggestions on how to speed my query up? I think I might need to create indexes but I am unsure of what indexes to create/how to do it.

Select *
From tp_base_fi_account_transactions t
Left join lateral (
        select category_id
        from tp_base_fi_account_transaction_categories c 
        where c.transaction_id=t.id 
        order by c.updated_at DESC
        LIMIT 1
) category on TRUE
inner join tp_base_transaction_categories cats on cats.id=category.category_id
inner join tp_base_vendor_associations va ON va.id = t.vendor_association_id
inner join tp_base_vendors v ON v.id = va.vendor_id
inner join tp_base_userprofile up ON up.user_id = t.assigned_user_id
inner join tp_base_organizations o ON o.id = up.organization_id

The schema is as follows:

**tp_base_fi_account_transaction_categories c** columns: ['id', 'created_at', 'updated_at', 'transaction_id', 'category_id', 'user_confirmed']
**tp_base_fi_account_transactions t** columns: ['id', 'transaction_id', 'reference_number', 'amount', 'date', 'ppd_id', 'primary_type', 'is_pending', 'created_at', 'updated_at', 'organization_fi_account_id', 'vendor_association_id', 'receipt_id', 'group_id', 'changed_by_id', 'payment_method_id', 'user_confirmed', 'acct_class_id', 'department_id', 'memo', 'location_id', 'settlement_date']
**tp_base_vendor_associations va** columns: ['id', 'name', 'mid', 'mcc', 'created_at', 'updated_at', 'vendor_id']
**tp_base_vendors v** columns: ['id', 'name', 'domain', 'created_at', 'updated_at', 'is_matchable']
**tp_base_userprofile up** columns: ['id', 'approval_group_id', 'user_id', 'avatar', 'display_name', 'organization_id']
**tp_base_organizations o** columns: ['id', 'name', 'created_by_id']

When I run explain analyze on the query and limit the results to 5 this is the query plan I get:

Limit  (cost=28872.80..138364.49 rows=5 width=1461) (actual time=231.504..647.825 rows=5 loops=1)
  ->  Nested Loop  (cost=28872.80..17255720972.54 rows=787991 width=1461) (actual time=231.503..647.803 rows=5 loops=1)
        ->  Nested Loop  (cost=28872.38..17255320333.89 rows=787991 width=1375) (actual time=231.488..647.740 rows=5 loops=1)
              ->  Nested Loop  (cost=28871.96..17254951399.67 rows=787991 width=1286) (actual time=231.471..647.663 rows=5 loops=1)
                    ->  Nested Loop  (cost=28871.54..17248296815.67 rows=787991 width=1224) (actual time=231.443..647.520 rows=5 loops=1)
                          ->  Hash Join  (cost=6982.90..221853.70 rows=787991 width=1220) (actual time=114.346..114.996 rows=5 loops=1)
                                Hash Cond: (up.organization_id = o.id)
                                ->  Hash Join  (cost=6969.09..210746.32 rows=856978 width=1001) (actual time=114.117..114.748 rows=5 loops=1)
                                      Hash Cond: (t.assigned_user_id = up.user_id)
                                      ->  Seq Scan on tp_base_fi_account_transactions t  (cost=0.00..27129.78 rows=856978 width=747) (actual time=0.012..0.020 rows=7 loops=1)
                                      ->  Hash  (cost=3546.93..3546.93 rows=73293 width=254) (actual time=114.010..114.010 rows=73293 loops=1)
                                            Buckets: 16384  Batches: 8  Memory Usage: 2360kB
                                            ->  Seq Scan on tp_base_userprofile up  (cost=0.00..3546.93 rows=73293 width=254) (actual time=0.024..51.276 rows=73293 loops=1)
                                ->  Hash  (cost=10.58..10.58 rows=258 width=219) (actual time=0.200..0.200 rows=258 loops=1)
                                      Buckets: 1024  Batches: 1  Memory Usage: 60kB
                                      ->  Seq Scan on tp_base_organizations o  (cost=0.00..10.58 rows=258 width=219) (actual time=0.006..0.086 rows=258 loops=1)
                          ->  Limit  (cost=21888.65..21888.65 rows=1 width=12) (actual time=106.499..106.500 rows=1 loops=5)
                                ->  Sort  (cost=21888.65..21888.65 rows=2 width=12) (actual time=106.495..106.495 rows=1 loops=5)
                                      Sort Key: c.updated_at DESC
                                      Sort Method: quicksort  Memory: 25kB
                                      ->  Seq Scan on tp_base_fi_account_transaction_categories c  (cost=0.00..21888.64 rows=2 width=12) (actual time=48.199..106.454 rows=1 loops=5)
                                            Filter: (transaction_id = t.id)
                                            Rows Removed by Filter: 1029685
                    ->  Index Scan using tp_base_transaction_categories_pkey on tp_base_transaction_categories cats  (cost=0.42..8.44 rows=1 width=62) (actual time=0.020..0.020 rows=1 loops=5)
                          Index Cond: (id = c.category_id)
              ->  Index Scan using tp_base_vendor_associations_pkey on tp_base_vendor_associations va  (cost=0.42..0.46 rows=1 width=89) (actual time=0.010..0.010 rows=1 loops=5)
                    Index Cond: (id = t.vendor_association_id)
        ->  Index Scan using tp_base_vendors_pkey on tp_base_vendors v  (cost=0.42..0.50 rows=1 width=86) (actual time=0.008..0.009 rows=1 loops=5)
              Index Cond: (id = va.vendor_id)
Planning time: 3.422 ms
Execution time: 651.187 ms

Best Answer

I don't think the left join here makes any sense, as you are then doing an inner join keyed on equality of one of the nullable columns, so the null rows can't be returned anyway. But based on the plan, I think the planner is smart enough to realize this and ignore the LEFT.

You might need an index on transaction_id. Creating it is nothing special, just:

create index on tp_base_fi_account_transaction_categories (transaction_id);

It might be slightly faster to do this instead:

create index on tp_base_fi_account_transaction_categories (transaction_id, updated_at);

It looks like transaction_id is mostly unique within the table (based on a sample size of 5 from your LIMITed EXPLAIN ANALYZE), so adding the updated_at to the index will probably not help a huge amount.

Since you selecting all of tp_base_fi_account_transactions (unless those inner joins function to filter them out and not just to bring in extra columns), a potentially faster alternative might be to do a sub-select to fetch just the newest rows for each transaction_id in bulk, then join to that without the lateral. That join could proceed with a HashJoin rather than a nested loop.

Select *
From tp_base_fi_account_transactions t
join (
        select distinct on (transaction_id) transaction_id, category_id
        from tp_base_fi_account_transaction_categories        
        order by transaction_id, updated_at DESC
) category on transaction_id=t.id
inner join tp_base_transaction_categories cats on cats.id=category.category_id
inner join tp_base_vendor_associations va ON va.id = t.vendor_association_id
inner join tp_base_vendors v ON v.id = va.vendor_id
inner join tp_base_userprofile up ON up.user_id = t.assigned_user_id
inner join tp_base_organizations o ON o.id = up.organization_id