Postgresql – An alternative to querying with multiple JOIN over the same table

postgresqlquery-performancerelational-division

I have a Postgresql 11 database.
Let's say I have a table called houses. It should have hundreds of thousands of records.

CREATE TABLE houses (
  pkid serial primary key,
  address varchar(255) NOT NULL,
  rent float NOT NULL
);

Now, my houses have features I want to register in the database. As the list of possible features will be quite long (several dozens) and will evolve over time, as I don't want add a long list of columns to the table houses and change the table constantly with 'ALTER TABLE', I thought of having a separate table for these features :

CREATE TABLE house_features (
   pkid serial primary key,
   house_pkid integer NOT NULL,
   feature_name varchar(255) NOT NULL,
   feature_value varchar(255)
);
CREATE INDEX ON house_features (feature_name, feature_value);
ALTER TABLE house_features ADD CONSTRAINT features_fk FOREIGN KEY (house_pkid) REFERENCES houses (pkid) ON DELETE CASCADE;

In average, each house record will have 10-20 records in the house_features table.

So far, this seems a simple efficient model : I can add as many different features, controlling the possible values of feature_name and feature_value in the upper layers (the applicative layer and/or the GUI). I don't have to alter the database each time the application evolves and I need a new type of feature.

For the example, let's say I have the following features :

  • feature_name : 'rooftype' with possible feature_value : 'flat' or 'inclined'
  • feature_name : 'wallcolors' with possible feature_value : 'white', 'beige', 'blue', 'green', etc.. (15 different possible values)
  • feature_name : 'has_basement' with possible feature_value : 'True' or 'False'.
  • feature_name : 'number_of_doors' with possible feature_value any integer coded as a string (so '0', '1', '2', …).
  • feature_name : 'floor_surface' with possible feature_value any given float coded as a string (e.g.: '155.2')

Obviously, storing booleans, integers and floats as strings is not very efficient and this is also something I will need to take care of. I was thinking of having a separate house_features_XXX table for each XXX type (string, boolean, float, integer).

But that is not even my problem.

My problem is : how do I search for houses that have certain features ?

For the example, let's say I want to search the houses with a basement, white walls and an inclined rooftype.
I could dynamically create in the application layer a query like :

SELECT sq1.* FROM 
( SELECT house_pkid FROM house_features WHERE feature_name = 'has_basement' AND feature_value = 'True' ) AS sq1
JOIN
( SELECT house_pkid FROM house_features WHERE feature_name = 'wallcolors' AND feature_value = 'white' ) AS sq2
ON sq1.house_pkid = sq2.house_pkid
JOIN
( SELECT house_pkid FROM house_features WHERE feature_name = 'rooftype' AND feature_value = 'inclined' ) AS sq3
ON sq1.house_pkid = sq3.house_pkid
;

But that seems not so efficient, especially considering that there may be several dozens of conditions on house_features.

Is there a better way to do this ?

Best Answer

You could try to aggregate the features into a JSON value, then searching for a combination of multiple features is quite easy:

select h.*, hf.features
from houses
  join (
    select house_id, jsonb_object_agg(feature_name, feature_value) as features
    from house_features
    group by house_id
  ) hf on hf.house_pkid = h.pkid 
where hf.features @> '{"rooftype": "flat", "has_basement", "true", "wallcolors": "white"}';

Performance can be improved by adding a WHERE clause to the sub-select which repeats the feature names, e.g:

where feature_name in ('rooftype', 'has_basement', 'wallcolors')

or even

where (feature_name, feature_value) in (('rooftype', 'flat') ('has_basement', 'true'), ('wallcolors', 'white'))

The outer condition is still necessary, because the inner where will include houses that don't have all the features.

This also has the advantage (in my eyes) that you only get one row with all the features, rather then one row for each feature.


Unless you remove, add and change features for a house very frequently, storing them as a single JSONB column on the house table (features) and getting rid of the house_features table, might be an alternative. In that case you could create an index on the column to speed up the search.