Postgresql – An alternative to querying with multiple JOIN over the same table

postgresqlquery-performancerelational-division

I have a Postgresql 11 database.
Let's say I have a table called houses. It should have hundreds of thousands of records.

CREATE TABLE houses (
  pkid serial primary key,
  address varchar(255) NOT NULL,
  rent float NOT NULL
);

Now, my houses have features I want to register in the database. As the list of possible features will be quite long (several dozens) and will evolve over time, as I don't want add a long list of columns to the table houses and change the table constantly with 'ALTER TABLE', I thought of having a separate table for these features :

CREATE TABLE house_features (
   pkid serial primary key,
   house_pkid integer NOT NULL,
   feature_name varchar(255) NOT NULL,
   feature_value varchar(255)
);
CREATE INDEX ON house_features (feature_name, feature_value);
ALTER TABLE house_features ADD CONSTRAINT features_fk FOREIGN KEY (house_pkid) REFERENCES houses (pkid) ON DELETE CASCADE;

In average, each house record will have 10-20 records in the house_features table.

So far, this seems a simple efficient model : I can add as many different features, controlling the possible values of feature_name and feature_value in the upper layers (the applicative layer and/or the GUI). I don't have to alter the database each time the application evolves and I need a new type of feature.

For the example, let's say I have the following features :

feature_name : 'rooftype' with possible feature_value : 'flat' or 'inclined'
feature_name : 'wallcolors' with possible feature_value : 'white', 'beige', 'blue', 'green', etc.. (15 different possible values)
feature_name : 'has_basement' with possible feature_value : 'True' or 'False'.
feature_name : 'number_of_doors' with possible feature_value any integer coded as a string (so '0', '1', '2', …).
feature_name : 'floor_surface' with possible feature_value any given float coded as a string (e.g.: '155.2')

Obviously, storing booleans, integers and floats as strings is not very efficient and this is also something I will need to take care of. I was thinking of having a separate house_features_XXX table for each XXX type (string, boolean, float, integer).

But that is not even my problem.

My problem is : how do I search for houses that have certain features ?

For the example, let's say I want to search the houses with a basement, white walls and an inclined rooftype.
I could dynamically create in the application layer a query like :

SELECT sq1.* FROM 
( SELECT house_pkid FROM house_features WHERE feature_name = 'has_basement' AND feature_value = 'True' ) AS sq1
JOIN
( SELECT house_pkid FROM house_features WHERE feature_name = 'wallcolors' AND feature_value = 'white' ) AS sq2
ON sq1.house_pkid = sq2.house_pkid
JOIN
( SELECT house_pkid FROM house_features WHERE feature_name = 'rooftype' AND feature_value = 'inclined' ) AS sq3
ON sq1.house_pkid = sq3.house_pkid
;

But that seems not so efficient, especially considering that there may be several dozens of conditions on house_features.

Is there a better way to do this ?

Best Answer

You could try to aggregate the features into a JSON value, then searching for a combination of multiple features is quite easy:

select h.*, hf.features
from houses
  join (
    select house_id, jsonb_object_agg(feature_name, feature_value) as features
    from house_features
    group by house_id
  ) hf on hf.house_pkid = h.pkid 
where hf.features @> '{"rooftype": "flat", "has_basement", "true", "wallcolors": "white"}';

Performance can be improved by adding a WHERE clause to the sub-select which repeats the feature names, e.g:

where feature_name in ('rooftype', 'has_basement', 'wallcolors')

or even

where (feature_name, feature_value) in (('rooftype', 'flat') ('has_basement', 'true'), ('wallcolors', 'white'))

The outer condition is still necessary, because the inner where will include houses that don't have all the features.

This also has the advantage (in my eyes) that you only get one row with all the features, rather then one row for each feature.

Unless you remove, add and change features for a house very frequently, storing them as a single JSONB column on the house table (features) and getting rid of the house_features table, might be an alternative. In that case you could create an index on the column to speed up the search.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Mysql – Identical query, tables, but different EXPLAIN and performance

Just make a simple join. Sub-queries does not provide the best result quite often

EXPLAIN SELECT l.id, l.level_name, l.date_published, l.rating
FROM levels AS l
INNER JOIN users_favorites AS uf 
ON uf.level_id = l.id
WHERE l.user_id = 2;

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Mysql – Identical query, tables, but different EXPLAIN and performance

Related Question