SQL Server Database Design – When to Duplicate Data for Querying

database-designsql server

I have the following table design.

an Event table, with helper tables to hold the event meta data.

each event row has an event type column (regular_event, recurring_route, timed_messages, and more.)

each event type has its own tables to define different things regarding the event – including users. The thing is that it might be referenced directly by a many to many table (users_regular_events) or by a third table or forth table (events -> recurring_routes -> recurring_routes_stations -> recurring_stations -> recurring_stations_users -> users).

would it make sense to create a users_events table that duplicates relations between certain events? because, currently if I want to show each user his events I'll have to join multiple tables. Maybe I need to rethink my entire design?

Best Answer

I agree with Aaron Bertrand that pre-optimization is not a good idea. Relational database management systems are built for joining data. Don't presume a performance problem, observe it, preferably through thorough load testing prior to moving to production, then deal with it if necessary.

As to your question about when does it make sense to duplicate data for querying, the classic scenario is in a data warehouse. Ideally, you want the data to be static, in other words, read only. This can be the case for historical transactional data, for example. In a data warehouse the data is written as close to once as possible and is read many, many times. If this is your scenario, then denormalization for reporting may be a reasonable design choice.

However, any time you introduce redundancy in your data you open yourself to the risk of loss of data quality. This is what the Normal Forms are designed to prevent. Redundancy introduces risk that you need to manage.

As long as you go in with your eyes open, you will be OK.

Related Solutions

ERD Review Please

Usually, events are not infinite. There's more than one way to represent this in your database: either store an endTime, or a duration. You could make it optional if you don't want to force it on people, but if the event starts at 4 pm and someone shows up at 11 pm they will be disappointed that they missed the event, and your app was not obviously clear enough that the even ended at 9 pm. You might become the subject of their criticism (whether or not it's rational is a different matter).

You might also want to add a notes field that's just some text that anyone can add for special info for the event.

Also, this design does not allow for One Event Can Host Many Book Authors. From your diagram, it looks like one event can have exactly one author. You might need a new table such as this:

event_authors
-------------
   event_id
   author_id

If you make this table have a composite key of event_id and author_id, it will allow you to have multiple records with the same event and different authors. Of course, you'll have to remove the author_id reference from events.

As for One Author Can Only attend One Event at a time., this one is a little trickier. Your current schema does not enforce this, but I suppose you could do it in your application. Or you could have an on-insert trigger on event_authors that first checks that the time duration of the new event for this author does not overlap any other events for the same author for different events.

SQL Server 2008 – Table Design for Tracking Significant Events

Normalize always unless you have a very good reason not to. It should not be flipping a coin.

Neither of your proposed designs seems optimal to me for a variable / unknown number of events, since when "that number may increase or decrease," in both cases, you have to change the schema and the code.

I also don't believe that you have to base your core table design around the format of the input files. This is why we have staging tables, ETL processes, SSIS, C#, etc. There are many ways to transform data from flat files into a format that is more appropriate for a relational database. Don't design your database because of what your flat files look like.

So you could have staging tables that look like the input files, but then you would load the data into real tables that look perhaps like this:

CREATE TABLE dbo.Events -- lookup table describing events
(
  EventID INT PRIMARY KEY,
  Description VARCHAR(32) NOT NULL UNIQUE
);

CREATE TABLE dbo.OrderEventLog -- actual event data
(
  EventID INT FOREIGN KEY REFERENCES dbo.Events(EventID),
  OrderID INT FOREIGN KEY REFERENCES dbo.Orders(OrderID), -- guess
  ExpectedDate DATETIME2(7),
  ActualDate DATETIME2(7)
);

Now when you have a new type of event, you just insert it into the dbo.Events table, and then you can insert related events into dbo.OrderEventLog without having to change the schema of either table, and without having to change the interface to your stored procedures or bulk inserts into the staging tables either. This also keeps the history table (and all of its indexes) nice and narrow, and allows you to build indexes that favor searches for specific types of events.

You can always flatten the data later - which is typically what you do when you get to a point where you are building a data warehouse around the data.

Best Answer

Related Solutions

ERD Review Please

SQL Server 2008 – Table Design for Tracking Significant Events

Related Question