Postgresql – DB design advice – shortcut foreignkeys vs long queries

database-designpostgresql

I've designed quite a few databases but there's a problem I come across time and time again. The more information you want to pack into your database and the more fields and different properties you add to stuff, you end up with loads of tables that separate the objects that are semantically very closely related. Example:

The events have a M2M relationship to Identification, which is e.g. an email address, a phone number etc. that identifies a Person. All events happen through an Identification. Not all Identifications have an attached Person (if nothing is known about the Person, then the Person object isn't needed) and not all Persons are Agents (but some of them are).

A commonplace query on that database would be to get all Events related to Agents for example, or get all Events for one Team. With this schema however you have to perform those queries over 4-5 tables which is both cumbersome and probably quite slow once you have rows in the range of millions (you're welcome to comment on the performance as well). On one hand if I wanted to simplify things and speed queries up, I could just connect Event directly to Agent/Team/Person via another M2M table or a Postgres FK Array, but that would be duplicating the relationships.

1) Is there a better design pattern for this database?
2) What are the best practices when it comes to shortcut FK-s for tables that are related but through n+1 other tables?
3) Is this even a legitimate concern or am I optimizing this prematurely? Maybe this schema is fine and I'm worrying needlessly?

PS. This is a redesign of an existing database that I know will have millions of rows, so the "don't waste energy on optimizing stuff that may never be necessary" point is not valid here.

Thanks!

Best Answer

The table structure you show looks fine from a database point of view, and it should work fine for an OLTP workload. Using nested loop joins, it is no problem to fetch a few rows by joining a number of tables.

If the schema is properly normalized, data modifications will be as localized as possible, and consistency is guaranteed.

Things become different when you want to perform analytical queries. Then joining many tables really hurts. This is why for OLAP workloads, you tend to denormalize the data and put them in a star schema or a similar structure. Such a “data warehouse” also usually doesn't hold the data in detail, but pre-aggregated in a way that makes the queries fast and the database not too large.

If you want to run both short transactions and analytical queries on the same database, it is hard or impossible to create a data model that fits both. Either you are ready to suffer with your analytical queries (possibly off-loading them to a streaming replication standby), or you create a second database with an analytical schema and an ETL process that regularly updates it from the live database.

Related Solutions

Database design advice

                                                  +------------ --- ---+
                                                  | Assembly options   |
+------------+------------+----------+------------+---+---+---+ --- +--+
| assembly ▼ | unit cost  | quantity | total cost | 1 | 2 | 3 |     |50|
+------------+------------+----------+------------+---+---+---+ --- +--+
| VSD55      | £10'000    | 2        | £20'000    | 1 | 1 |   |     |  |

If somebody handed that quote to me, my first question would be "What's option 1 for the VSD55?" The answer would be "I don't know." That information isn't on the quote. In the unlikely event that person got to field a second question, that question would be "What does it cost?" Again, the answer would be "I don't know." A very disturbing silence would follow immediately, during which the person who handed me the quote would imagine how much better it might feel to be run over by a train.

Options must be line items on the quote, along with their unit price, quantity, and total price. Options must be named, not numbered. They should appear directly under their parent assembly, too, not scattered all over hell and half of Georgia.

If you want a shot at my money, you'd better make it crystal clear what I'm supposed to be getting for my money.

There's nothing (much) wrong with 50 check boxes on a user-interface form. That makes it easy to pick options. But the UI code should read the checkboxes and insert the right information into normalized tables.

SQL Server 2008 – Table Design for Tracking Significant Events

Normalize always unless you have a very good reason not to. It should not be flipping a coin.

Neither of your proposed designs seems optimal to me for a variable / unknown number of events, since when "that number may increase or decrease," in both cases, you have to change the schema and the code.

I also don't believe that you have to base your core table design around the format of the input files. This is why we have staging tables, ETL processes, SSIS, C#, etc. There are many ways to transform data from flat files into a format that is more appropriate for a relational database. Don't design your database because of what your flat files look like.

So you could have staging tables that look like the input files, but then you would load the data into real tables that look perhaps like this:

CREATE TABLE dbo.Events -- lookup table describing events
(
  EventID INT PRIMARY KEY,
  Description VARCHAR(32) NOT NULL UNIQUE
);

CREATE TABLE dbo.OrderEventLog -- actual event data
(
  EventID INT FOREIGN KEY REFERENCES dbo.Events(EventID),
  OrderID INT FOREIGN KEY REFERENCES dbo.Orders(OrderID), -- guess
  ExpectedDate DATETIME2(7),
  ActualDate DATETIME2(7)
);

Now when you have a new type of event, you just insert it into the dbo.Events table, and then you can insert related events into dbo.OrderEventLog without having to change the schema of either table, and without having to change the interface to your stored procedures or bulk inserts into the staging tables either. This also keeps the history table (and all of its indexes) nice and narrow, and allows you to build indexes that favor searches for specific types of events.

You can always flatten the data later - which is typically what you do when you get to a point where you are building a data warehouse around the data.

Best Answer

Related Solutions

Database design advice

SQL Server 2008 – Table Design for Tracking Significant Events

Related Question