PostgreSQL Database Design – Creating Index Over Two Tables

database-designindex-tuningpostgresqlpostgresql-9.6

Using PostgreSQL (currently 9.6, but upgrades are possible), I currently have the following database layout where customers can order products, which are themselves sorted into categories (products may be in multiple categories):

Orders
id -- PRIMARY KEY
customer_id -- FOREIGN KEY (Customer - id)
product_id -> FOREIGN KEY (Product - id)

Products
id -- PRIMARY KEY

Categories
id -- PRIMARY KEY

Product_Categories
product_id -- FOREIGN KEY (Product - id)
category_id -- FOREIGN KEY (Category - id)

Data volume

Now, I have a fairly large amount of orders (~30M) and a reasonable number of categories (~1K) and customers (~10K). There is around 30K Products, with an average of 3 products by category. Products may be moved from a category to another occasionally (let's say a once per month shuffle)

Query tendencies

My problem is that I want to have the following type of query run fast: "Get all Orders for customer whose product is in Category C". That would look like:

SELECT * FROM Orders 
JOIN Product_Categories ON Orders.product_id = Product_Categories.product_id
WHERE Orders.customer_id = X AND Product_Categories.category_id = Y

Indexing considerations

The best index I can think of is an index on customer_id in Orders, supported by a secondary index on Product_Categories.product_id. This leads to the following plan (not a real plan since the design I showed above is a very large simplification of the actual case):

 - Index Scan on Orders using index on customer_id ---> Returns ~10K Rows
 - 10K Joins done by Index Lookup on the product_id index of Product_Categories (MAIN TIME CONSUMER)
 - 9990 Rows Filtered Out.
 - 10 Rows Returned

I would like to have an index over (customer_id, category_id), but I haven't been able to find a way to do this. The best solutions I can think of is to add a column categories_id INTEGER[] and then either:

  1. Add a GIN index using categories_id and customer_id with the inclusion in list operator.
  2. Create 1000 Partial indexes on order_id.

In both cases, I would have to synchronize categories_id with the updates in the categoryproduct association tables, which is unfortunate.

Questions

My questions are:

  • Am I overthinking? Is the "filtering out 10k" rows not that bad of a
    problem and any solution I can think of will make the problem worse?
  • Am I missing something? Can I be efficient without changing my database schema?
  • Assuming I should change my database schema, what is the best way to do so?

Best Answer

If you have an index on product_categories (category_id), as well as the one you already have on orders (customer_id) then this type of query should be very fast. You can do a highly specific index scan on each table separately, then hash join the results.

https://explain.depesz.com/s/JEpZ

If that isn't fast enough for you, or you can't get it to use such a plan even when you have indexes in place, then I'm afraid you will have to give us a lot more info, like the actual query plan including timing, and what time you hope to achieve.