MySQL – Aggregate Function, JOINS, and GROUP BY in Sakila Database Project

aggregategroup byjoin;MySQL

The current pandemic is making it very difficult to contact my teachers for assistance, so I'm hoping I can describe my problem well enough to try getting some help here.

The database used is the sakila example database
(I tried to create a fiddle, but the tables are just too big)

My data will be grouped by each store (only 2 video stores), and I will be getting

  1. The total sales of each store (teachers hint: store table has inventory. inventory can be rented, rental table has payments.
  2. count of customers per store
  3. inventory count per store
  4. count of movies per store
  5. count of movies rented that are not yet returned

Here is my incorrect output:

--------------------------------------------------------------------------------------------------------------------------------
| Jordan_Rasmussen | store_id | total_sales | num_customers | count_inventory | count_titles | inventory_cost | num_rentals_out |
--------------------------------------------------------------------------------------------------------------------------------
| (date&time)      | 1        | 68359569.18 | 326           | 2270            | 759          | 952923.30      | 29992 |
----------------------------------------------------------------------------------------------------------------------------
| (date&time)      | 2        | 56966647.92 | 273           | 2311            | 762          | 970134.69      | 24843 |              |
--------------------------------------------------------------------------------------------------------------------------------

Here is the correct output:

--------------------------------------------------------------------------------------------------------------------------------
| First_and_last N | store_id | total_sales | num_customers | count_inventory | count_titles | inventory_cost | num_rentals_out |
--------------------------------------------------------------------------------------------------------------------------------
| (date&time)      | 1        | 209691.93   | 326           | 2270            | 759          | 46205.30       | 92              |
--------------------------------------------------------------------------------------------------------------------------------
| (date&time)      | 2        | 208669.04   | 273           | 2311            | 762          | 46415.89       | 91              |
--------------------------------------------------------------------------------------------------------------------------------

Here is my code for trying to get the above output:

SELECT NOW() AS 'Jordan_Rasmussen',
s.store_id,

-- Get the total sales
SUM(p.amount) AS total_sales,

-- Get the number of customers
COUNT(DISTINCT c.customer_id) AS num_customers,

-- Get the inventory count
COUNT(DISTINCT i.inventory_id) AS inventory_count,

-- Get the number of movie titles
COUNT(DISTINCT f.title) AS num_titles,

-- Get the inventory value 
SUM(DISTINCT f.replacement_cost) * COUNT(DISTINCT i.inventory_id)   AS inventory_value,

-- Get the number of movies rented that have not yet been returned 
COUNT(r.rental_date) AS num_rentals_out 

FROM store AS s 
LEFT JOIN inventory AS i ON s.store_id = i.store_id 
LEFT JOIN customer AS c ON s.store_id = c.store_id 
INNER JOIN rental AS r ON i.inventory_id = r.inventory_id
INNER JOIN payment AS p ON r.rental_id = p.rental_id
INNER JOIN film AS f ON i.film_id = f.film_id 


GROUP BY store_id;

I can get the correct results by themselves, but I notice the more I JOIN tables together, the more the results change. I've been scratching my head about this for awhile now, but I'm just not sure what it is I'm missing.

Should I be using subqueries? Or am I just joining improperly?

Sorry for such a big question, but I'm clueless right now.

Best Answer

The problem is you are joining unrelated tables. If you don't join them, as in

SELECT NOW() AS 'Dorjan Masrussen',
        s.store_id,
        (select sum(p.amount)
           from payment p
           where p.rental_id in (
             select r.rental_id
               from rental r
               join inventory i
                 on i.inventory_id = r.inventory_id
               where i.store_id = s.store_id)) total_sales,
        (select count(*)
           from customer c
           where c.store_id = s.store_id) num_customers,
        (select count(*)
           from inventory i
           where i.store_id = s.store_id) count_inventory,
        (select count(distinct film_id)
           from inventory i
           where i.store_id = s.store_id) count_titles,
        (select sum(f.replacement_cost)
           from inventory i
           join film f on f.film_id = i.film_id
           where i.store_id = s.store_id) inventory_cost,
        (select count(*)
           from rental r
           join inventory i
             on i.inventory_id = r.inventory_id
           where i.store_id = s.store_id) num_rentals_out
  from store as s

you get:

+---------------------+----------+-------------+---------------+-----------------+--------------+----------------+-----------------+
| Dorjan Masrussen    | store_id | total_sales | num_customers | count_inventory | count_titles | inventory_cost | num_rentals_out |
+---------------------+----------+-------------+---------------+-----------------+--------------+----------------+-----------------+
| 2020-03-31 17:32:18 |        1 |    33679.79 |           326 |            2270 |          759 |       46205.30 |            7923 |
| 2020-03-31 17:32:18 |        2 |    33726.77 |           273 |            2311 |          762 |       46415.89 |            8121 |
+---------------------+----------+-------------+---------------+-----------------+--------------+----------------+-----------------+

Total sales and num_rentals_out differ from your correct output. I wonder who is right. The Sakila structure diagram, if I'm not mistaken, has it's crow's feet inverted.

The same result can be obtained, while querying inventory only once, using a CTE, the Common Table Expression "WITH", as follows:

WITH idata as (
  select i.store_id,
         count(*) num,
         count(distinct i.film_id) num_films,
         sum(f.replacement_cost) repl_cost,
         sum((select count(*)
                from rental r
                where r.inventory_id = i.inventory_id)) num_rentals,
         sum((select sum(p.amount)
                from payment p
                where p.rental_id in (
                  select r.rental_id
                    from rental r
                    where r.inventory_id = i.inventory_id
                ))) sum_rentals
    from inventory i
    join film f on f.film_id = i.film_id
    group by i.store_id
)
SELECT NOW() AS 'Dorjan Masrussen',
        s.store_id,
        idata.sum_rentals total_sales,
        (select count(*)
           from customer c
           where c.store_id = s.store_id) num_customers,
        idata.num count_inventory,
        idata.num_films count_titles, 
        idata.repl_cost inventory_cost,
        idata.num_rentals num_rentals_out
  from store as s
  join idata on idata.store_id = s.store_id

Querying the store table can also be dropped, since all we need is the store_id, which we also find in inventory.