Postgresql – Recommending movies customers haven’t rented yet based on their favorite genres, according to their rental history

postgresql

Disclaimer

I do NOT want any explicit queries to my question. I want suggestions regarding how I can best approach this problem. I am eager to learn SQL by experimentation, but for my sake, I'd prefer not to have my hand held. Thanks!

Database/ER model

I am using the DVD Rental database found on the PostgreSQLTutorial website. Here's the ER diagram they've provided (the zip is provided on the site):

Preliminary problem leading up to question

For my own practice, I wanted to see if I could come up with a list of customers (customer) and the genres (film_category) from which they've rented (payment and rental) movies, as well as counts for each of those genres per customer. Here is the query I wrote, which works correctly:

SELECT customer.first_name || ' ' || customer.last_name AS customer, category.name AS movie_genre, COUNT(category.name) AS num_rented

FROM customer JOIN payment ON customer.customer_id = payment.customer_id
JOIN rental ON payment.rental_id = rental.rental_id
JOIN inventory ON rental.inventory_id = inventory.inventory_id
JOIN film_category ON inventory.film_id = film_category.film_id
JOIN category ON film_category.category_id = category.category_id

GROUP BY customer, movie_genre
ORDER BY customer, num_rented DESC;

Problem

To challenge myself further, I want to see if I can list each customer's name along with the names of movies they have not yet rented that belong to the genre from which they've purchased the most movies (or the first one, if there's a tie). For example, here's a snapshot of some of the data I get when running the above query:

And here's a preview of the first 19 rows of the film table (not joined in the above query, but it will need to be joined for this problem):

So for example, we see that Aaron Selby really likes Travel movies, so we want to list her name alongside all movies belonging to the Travel category, but only those movies she has not yet rented.

Question

Could someone please offer some guidance/hints as to where I can begin? I know I'll need to join the film table, but here are some challenges:

limiting the output for each customer to only the most common genre they've rented
based on that information, retrieving all movies from films corresponding to that genre that the customer has not yet rented

Best Answer

Most common ...

In statistics, this is called mode.

Yes, PostgreSQL has an aggregate called mode. I haven't messed with it in PostgreSQL.

Retrieves those not wanted...

You want the resulting set of all movies for a genre minus the data set containing all movies rented by a customer for that genre

In PostgreSQL, you'll want to say EXCEPT instead of MINUS.

Test setup

Your original setup in the fiddle leaves room for improvement. I kept asking for your setup for a reason.

You have these indexes on film_actor:
```
"film_actor_pkey" PRIMARY KEY, btree (actor_id, film_id)  
"idx_fk_film_id" btree (film_id)
```
Which is pretty helpful already. But to best support your particular query, you would have a multicolumn index on (film_id, actor_id), columns in this order. A practical solution: replace idx_fk_film_id with an index on (film_id, actor_id) - or create the PK on (film_id, actor_id) for the purpose of this test, like I do below. See:
- Is a composite index also good for queries on the first field?
In a read-only (or mostly, or generally when VACUUM can keep up with write activity) it also helps to have an index on (title, film_id) to allow index only-scans. My test case is now highly optimized for read performance.
Type mismatch between film.film_id (integer) and film_actor.film_id (smallint). While that works it makes queries slower and can lead to various complications. Also makes FK constraints more expensive. Never do this if it can be avoided. If you are not sure, pick integer over smallint. While smallint can save 2 bytes per field (often consumed by alignment padding) there are more complication than with integer.
To optimize the performance of the test itself, create indexes and constraints after bulk-inserting lots of rows. It is substantially slower to add tuples incrementally to existing indexes than to create them from scratch with all rows present.

Unrelated to this test:

Free-standing sequences plus column defaults instead of much simpler and more reliable serial (or IDENTITY) columns. Don't.
- Auto increment table column
timestamp without timestamp is typically unreliable for a column like last_update. Use timestamptz instead. And note that column defaults do not cover the "last update", strictly speaking.
The length modifier in character varying(255) indicates that the test case is not intended for Postgres to begin with because the odd length is pretty pointless here. (Or the author is clueless.)

Consider the audited test case in the fiddle:

db<>fiddle here - building on your fiddle, optimized and with added queries.

How to implement a many-to-many relationship in PostgreSQL?

A test setup with a 1000 films and 200 actors has limited validity. The most efficient queries take < 0.2 ms. Planning time is more than execution time. A test with 100k or more rows would be more revealing.

Why retrieve only first names of authors? Once you retrieve multiple columns, you already have a slightly different situation.

ORDER BY title makes no sense while filtering for a single title with WHERE title = 'ACADEMY DINOSAUR'. Maybe ORDER BY film_id?

And for total runtime rather use EXPLAIN (ANALYZE, TIMING OFF) to reduce (potentially misleading) noise with sub-timing overhead.

Answer

It's hard to form a simple rule of thumb, because total performance depends on many factors. Very basic guidelines:

Aggregating all rows in sub-tables carries less overhead but only pays when you actually need all rows (or a very large part).
For selecting few rows (your test!), different query techniques yield better results. That's where LATERAL comes in. It carries more overhead but only reads required rows from sub-tables. A big win if only a (very) small fraction is needed.

For your particular test case, I would also test an ARRAY constructor in the LATERAL subquery:

SELECT f.film_id, f.title, a.actors
FROM   film
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT a.first_name
      FROM   film_actor fa
      JOIN   actor a USING (actor_id)
      WHERE  fa.film_id = f.film_id
      ) AS actors
   ) a ON true
WHERE  f.title = 'ACADEMY DINOSAUR';
-- ORDER  BY f.title; -- redundant while we filter for a single title

While only aggregating a single array in the lateral subquery, a simple ARRAY constructor performs better than the aggregate function array_agg(). See:

Why is array_agg() slower than the non-aggregate ARRAY() constructor?

Or with a lowly correlated subquery for the simple case:

SELECT f.film_id, f.title
     , ARRAY (SELECT a.first_name
              FROM   film_actor fa
              JOIN   actor a USING (actor_id)
              WHERE  fa.film_id = f.film_id) AS actors
FROM   film f
WHERE  f.title = 'ACADEMY DINOSAUR';

Or, very basically, just 2x LEFT JOIN and then aggregate:

SELECT f.film_id, f.title, array_agg(a.first_name) AS actors
FROM   film f
LEFT   JOIN film_actor fa USING (film_id)
LEFT   JOIN actor a USING (actor_id)
WHERE  f.title = 'ACADEMY DINOSAUR'
GROUP  BY f.film_id;

These three seem fastest in my updated fiddle (planning + execution time).

Your first attempt (only slightly modified) is typically fastest to retrieve all or most films, but not for a small selection:

SELECT f.film_id, f.title, a.actors
FROM   film f
LEFT   JOIN (         
   SELECT fa.film_id, array_agg(first_name) AS actors
   FROM   actor
   JOIN   film_actor fa USING (actor_id)
   GROUP  by fa.film_id
   ) a USING (film_id)
WHERE  f.title = 'ACADEMY DINOSAUR';  -- not good for a single (or few) films!

Tests with much bigger cardinalities will be more revealing. And don't generalise results lightly, there are many factors for total performance.

Best Answer

Most common ...

Retrieves those not wanted...

Related Solutions

Postgresql – How to run ALTER TABLE on a list of tables sequentially in a function

Postgresql – Why does this LEFT JOIN perform so much worse than LEFT JOIN LATERAL

Test setup

Answer

Related Question