PostgreSQL Database Design – Creating Index Over Two Tables

database-designindex-tuningpostgresqlpostgresql-9.6

Using PostgreSQL (currently 9.6, but upgrades are possible), I currently have the following database layout where customers can order products, which are themselves sorted into categories (products may be in multiple categories):

Orders
id -- PRIMARY KEY
customer_id -- FOREIGN KEY (Customer - id)
product_id -> FOREIGN KEY (Product - id)

Products
id -- PRIMARY KEY

Categories
id -- PRIMARY KEY

Product_Categories
product_id -- FOREIGN KEY (Product - id)
category_id -- FOREIGN KEY (Category - id)

Data volume

Now, I have a fairly large amount of orders (~30M) and a reasonable number of categories (~1K) and customers (~10K). There is around 30K Products, with an average of 3 products by category. Products may be moved from a category to another occasionally (let's say a once per month shuffle)

Query tendencies

My problem is that I want to have the following type of query run fast: "Get all Orders for customer whose product is in Category C". That would look like:

SELECT * FROM Orders 
JOIN Product_Categories ON Orders.product_id = Product_Categories.product_id
WHERE Orders.customer_id = X AND Product_Categories.category_id = Y

Indexing considerations

The best index I can think of is an index on customer_id in Orders, supported by a secondary index on Product_Categories.product_id. This leads to the following plan (not a real plan since the design I showed above is a very large simplification of the actual case):

 - Index Scan on Orders using index on customer_id ---> Returns ~10K Rows
 - 10K Joins done by Index Lookup on the product_id index of Product_Categories (MAIN TIME CONSUMER)
 - 9990 Rows Filtered Out.
 - 10 Rows Returned

I would like to have an index over (customer_id, category_id), but I haven't been able to find a way to do this. The best solutions I can think of is to add a column categories_id INTEGER[] and then either:

Add a GIN index using categories_id and customer_id with the inclusion in list operator.
Create 1000 Partial indexes on order_id.

In both cases, I would have to synchronize categories_id with the updates in the category ↔ product association tables, which is unfortunate.

Questions

My questions are:

Am I overthinking? Is the "filtering out 10k" rows not that bad of a
problem and any solution I can think of will make the problem worse?
Am I missing something? Can I be efficient without changing my database schema?
Assuming I should change my database schema, what is the best way to do so?

Best Answer

If you have an index on product_categories (category_id), as well as the one you already have on orders (customer_id) then this type of query should be very fast. You can do a highly specific index scan on each table separately, then hash join the results.

https://explain.depesz.com/s/JEpZ

If that isn't fast enough for you, or you can't get it to use such a plan even when you have indexes in place, then I'm afraid you will have to give us a lot more info, like the actual query plan including timing, and what time you hope to achieve.

Related Solutions

Sql-server – How to model and enforce constraints on a categorical item contained in a collection

Categories and Buckets.

create table categories
(
  category_id int primary key
);

create table buckets
(
  bucket_id int primary key
);

Depending on how to interpret "an item may belong to at most 1 bucket" you need to do some different things with Items.

An item has to belong to one bucket.

create table items
(
  item_id int primary key,
  category_id int not null references categories(category_id),
  bucket_id int not null references buckets(bucket_id),
  unique (category_id, bucket_id)
);

If an item may belong to one bucket but does not have to you need to allow for null values in bucket_id and use a filtered index as the unique constraint.

create table items
(
  item_id int primary key,
  category_id int not null references categories(category_id),
  bucket_id int references buckets(bucket_id)
);

create unique index uq_items on items(category_id, bucket_id) where bucket_id is not null;

There is a collection of items where each item belongs to 1 category.

category_id as a foreign key in items.

an item may belong to at most 1 bucket.

bucket_id as a foreign key in items.

There is a collection of buckets which may contain at most 1 item from each category.

A unique constraint on bucket_id and category_id in items.

SQL Server – Database Design for an Online Baby Shop and Toy Shop

Your design commonly looks good except few mistakes.

1) You had used a wrong field in order to establish a relation between ProductCategory and Product. You have to use ProductCategoryID in Product table instead of using ProductCategoryName. Howeever, some products may belong to more than one product category.
If this is a valid case for your project, you should add additional table something like "ProductVsProductCategory" which will allow you to establish a "many to many" relation between Product and ProductCategory tables if you intend to add some products into multiple categories.

Basic structure of ProductVsProductCategory might look like following:

ProductVsProductCategoryID (primary key)

ProductID (refers to Product table)

ProductCategoryID (refers to ProductCategory table)

2) OrderProductName field in the OrderDetail table is unnecessary. You have a relation to Produt table and you can access ProductName from there anyway. You can cancel it.

3) OrderNumber in the Order table is unnecessary if it is not a special value rather than unique identity number. You can use OrderID field as an identity of the order.

4) Similary, OrderQuantity and OrderTotalPrice fields might be unnecessary if you have not special reason to place them there. You can calculate order quantity and total price values by means of using basic sql statements that sums related OrderDetail rows.

There are many additional tables can be added depending on your project requirements. I also recommend you to follow MSDN articles and specially CodeProject in order to improve your abstraction skills.