PostgreSQL Optimization – Improving Query Performance with Many Permutations

amazon-rdsoptimizationperformancepostgresqlpostgresql-performance

I'm trying to figure out the optimal solution for filtering a result set by every permutation of a given input criteria. We're using Postgresql 11 on AWS RDS

I've created a sql fiddle here that outlines the schema I'm dealing with. Below is a copy of the sample data in that problem.

In this example we use "investors" with "holdings" (positions) in certain "securities". Securities have 3 attributes we care about:

Market Cap
Sector
Country

I'd like to be able to filter investors on one or more of these 3 criteria. So for instance I'd like to ask a question like "show me investors that focus in small-cap, canadian, mining companies" and get only those investors with holdings in securities matching those attributes. Or, "show me investors that focus in small-cap, mid-cap and large-cap companies"

For each of these attributes, I'd like to be able to query by one or many values, and I'd want to see investors that hold all permutations of those types of securities. So "show me investors in small-cap, mid-cap, canadian, american, materials" companies means:

Show me investors with:

at least one holding in small-cap, canadian, materials AND
at least one holding in mid-cap, canadian, materials AND
at least one holding in small-cap, american, materials AND
at least one holding in mid-cap, american, materials

The naive solution I've come up with is something along the lines of:

SELECT * from investors i
-- small-cap, canadian, materials holdings
WHERE EXISTS (
  SELECT 1 FROM holdings h
    JOIN securities s on h.security_id = s.id
  WHERE s.market_cap = 'SM'
    AND s.country = 'CA'
    AND s.sector = 'materials'
  AND h.investor_id = i.id
)
-- mid-cap, canadian, materials holdings
AND EXISTS (...)
-- and so-on for each permutation of the criteria

While this works, it's definitely not scalable. I'm pretty sure there's a way to improve upon this situation so it's not exponential in cost, but the formula eludes me at this moment.

Our system has hundreds of millions of holdings for hundreds of thousands of investors so this naive solution simply won't scale.

Can someone point me in the right direction here that would yield an optimal solution and avoid the exponential sub-selects that this naive solution would lead to?

Sample Data

Securities

| id | name         | sector      | market_cap | country |
|----+--------------+-------------+------------+---------|
| 1  | 'Mining ABC' | 'materials' |  'SM'      | 'CA'    |
| 2  | 'SilverFox'  | 'materials' |  'MD'      | 'CA'    |
| 3  | 'Big Coppa'  | 'materials' |  'LG'      | 'CA'    |
| 4  | 'Golds R Us' | 'materials' |  'LG'      | 'US'    |
| 5  | 'Weedly'     | 'pharma'    |  'SM'      | 'CA'    |
| 6  | 'HazeMaker'  | 'pharma'    |  'MD'      | 'US'    |
| 7  | 'StickyIcky' | 'pharma'    |  'LG'      | 'US'    |

Investors

| id | name   |
|----+--------|
| 11 | 'john' |
| 22 | 'bill' |
| 33 | 'susan'|
| 44 | 'jill' |

Holdings

| security_id | investor_id | shares |
|-------------+-------------+--------|
| 5           | 11          | 1      | -- john,  small-cap canadian pharma
| 7           | 11          | 12     | -- john,  large-cap american pharma
| 2           | 11          | 13     | -- john,  mid-cap   canadian materials
| 3           | 11          | 514    | -- john,  large-cap canadian materials
| 7           | 22          | 15     | -- bill,  large-cap american pharma
| 5           | 22          | 16     | -- bill,  small-cap canadian pharma
| 1           | 22          | 117    | -- bill,  small-cap canadian materials
| 2           | 33          | 18     | -- susan, mid-cap   canadian materials
| 3           | 33          | 919    | -- susan, large-cap canadian materials
| 4           | 33          | 20     | -- susan, large-cap american materials
| 1           | 44          | 21     | -- jill,  small-cap canadian materials
| 3           | 44          | 22     | -- jill,  large-cap canadian materials
| 4           | 44          | 123    | -- jill,  large-cap american materials
| 5           | 44          | 456    | -- jill,  small-cap canadian pharma
| 6           | 44          | 20     | -- jill,  mid-cap   american pharma
| 7           | 44          | 3      | -- jill,  large-cap american pharma

Best Answer

If they have to have one from each, then the total number of distinct holdings they have, when distinction is based only on those 3 properties, must be equal to the number of "permutations" which exist:

select investor_id from holdings h join securities s on (h.security_id = s.id) 
   where market_cap  in ('SM', 'MD') and country in ('CA','US') and sector in ('materials')
   group by investor_id
   having count(distinct (market_cap, country, sector)) = 2*2*1

This is probably not going to scale awesome, but probably better than your initial idea.

Partial index

If you need it faster still, and if there are more than a few rows with type <> 'VerifiedAddress' or state = 'manual_verification', a partial index would help some more. The gain grows with the percentage of rows you can exclude from the index.

CREATE INDEX addresses_selection_idx ON addresses (user_id)
WHERE  type  =  'VerifiedAddress'
AND    state <> 'manual_verification';

Be sure to use matching WHERE conditions in your query to allow Postgres to use this index. And weigh the benefit of tailored indexes against their cost (maintenance, extra disk space, small penalty for write operations).

PostgreSQL – Optimizing a Query with a Large IN Clause

Instead of using a huge IN-list, join on a VALUES expression, or if the list is large enough, use a temp table, index it, then join on it.

It'd be nice if PostgreSQL could do this internally & automatically but at this point the planner doesn't know how.