Postgresql – DB design advice – shortcut foreignkeys vs long queries

database-designpostgresql

I've designed quite a few databases but there's a problem I come across time and time again. The more information you want to pack into your database and the more fields and different properties you add to stuff, you end up with loads of tables that separate the objects that are semantically very closely related. Example:

Simple db schema

The events have a M2M relationship to Identification, which is e.g. an email address, a phone number etc. that identifies a Person. All events happen through an Identification. Not all Identifications have an attached Person (if nothing is known about the Person, then the Person object isn't needed) and not all Persons are Agents (but some of them are).

A commonplace query on that database would be to get all Events related to Agents for example, or get all Events for one Team. With this schema however you have to perform those queries over 4-5 tables which is both cumbersome and probably quite slow once you have rows in the range of millions (you're welcome to comment on the performance as well). On one hand if I wanted to simplify things and speed queries up, I could just connect Event directly to Agent/Team/Person via another M2M table or a Postgres FK Array, but that would be duplicating the relationships.

1) Is there a better design pattern for this database?
2) What are the best practices when it comes to shortcut FK-s for tables that are related but through n+1 other tables?
3) Is this even a legitimate concern or am I optimizing this prematurely? Maybe this schema is fine and I'm worrying needlessly?

PS. This is a redesign of an existing database that I know will have millions of rows, so the "don't waste energy on optimizing stuff that may never be necessary" point is not valid here.

Thanks!

Best Answer

The table structure you show looks fine from a database point of view, and it should work fine for an OLTP workload. Using nested loop joins, it is no problem to fetch a few rows by joining a number of tables.

If the schema is properly normalized, data modifications will be as localized as possible, and consistency is guaranteed.

Things become different when you want to perform analytical queries. Then joining many tables really hurts. This is why for OLAP workloads, you tend to denormalize the data and put them in a star schema or a similar structure. Such a “data warehouse” also usually doesn't hold the data in detail, but pre-aggregated in a way that makes the queries fast and the database not too large.

If you want to run both short transactions and analytical queries on the same database, it is hard or impossible to create a data model that fits both. Either you are ready to suffer with your analytical queries (possibly off-loading them to a streaming replication standby), or you create a second database with an analytical schema and an ETL process that regularly updates it from the live database.