Sql-server – Database Design for Preserving Values and Additional columns

database-designeavsql server

My colleague and I are having some trouble coming up with a database design, and I am doing my best to stay away from EAV for now.

We have an entity called Preparation that stores configuration data for machines. There are two types of processes per machine, let's call these process1 and process2, and a machine can only have one process. Both processes share a lot of attributes so setting up one table seemed logical.

However process2 has a lot of extra data, and even sub-categorizes itself with process2a, processes2b that is specific to those sub categories (process1 only has one extra attribute for now). So that one table will have nulls in columns that are not specific to its process.

Splitting the table for each process was another option. However, we need to preserve any data change and splitting the tables will bring more work (most all transactions will be INSERTS with a timestamp into a history table, followed by an UPDATE on the live table if they are updating a record). For example, if a machine changes from process1 to process2, then a record will have to be inserted in the history table from process1, deleted from process1 and then inserted into process2. We are using the standard edition of SQL Server, so no CDC to help with logging changes.

I am trying to keep these pretty relational and I would hate to create one table with expected NULLS (maybe create SPARSE columns if the NULL percentage is high?).

I am hoping someone has experience with this type of situation. Thank you.

Best Answer

Treat the various processes as sub-types. There will then be entity process_base, which contains all the common attributes, process1, process2, process2a etc. for the process-specific attributes.

Implement these as a table each. A view which combines them all together may simplify usage

create view process_all as
select <whatever>
from process_base
inner join process1 <etc>

union all

select <whatever>
from process_base
inner join process2 <etc>
...

This way you minimise the NULL columns (if that is desirable to you) but maintain the unity of "process" as a single idea.

Related Solutions

Database Design: New Table versus New Columns

What you are wrestling with is vertical partitioning. This is a physical database design technique to improve performance. As with any physical database design technique, its applicability depends on the specific queries you are trying to optimize and if this technique will optimize them. From a logical standpoint, if these new fields depend upon the candidate key for your entity then they are facts about it that belong with it. First you should make sure you fully understand the functional dependence of these new fields on your candidate keys to verify they really are facts about daily page views. If they are, deciding to partition them into another table is a performance optimization that should only be done if it achieves your performance goals.

In general, vertical partitioning is useful if you will query these new columns infrequently and distinctly from the other columns in the original table. By placing those columns in another table that shares the same PK as your existing table, you can query it directly when you want those new columns and get much greater through-put as you will have many more rows per page on disk for this new table as all the columns from the original table won't be sitting on those rows. However, if you will always query these columns along with the columns in the original table then a vertical partition wouldn't make much sense as you will always have to outer join to get them. Pages from tables on disk come into the buffer pool of a DBMS independently, never pre-joined, and so that join will have to happen with every query execution even if the data is pinned in the buffer pool. In this scenario making them NULLABLE columns on the original table would enable the DBMS storage engine to store them efficiently when NULL and eliminate the need to join on retrieval.

It sounds to me like your use case is the latter and adding them as NULLABLE to your original table is the way to go. But as with everything else in database design, it depends, and in order to make the right decision you need to know your expected workload and what making a good choice depends on. One good example of a proper use case for vertical partitioning would be a person search panel, where your application has some very rarely populated information about a person that someone might want to search on but rarely does. If you put that information into a different table you have some good options for performance. You can write the search so that you have 2 queries - one that uses the main, always populated information to search (like last name or ssn) only, and one that outer joins the very infrequently populated information only when it is requested for search. Or you could take advantage of the DBMS optimizer if it is smart enough to recognize for a given set of host variables that the outer join is not needed and won't perform it, and thus you only have to create 1 query.

What DBMS platform are you using? The way in which the platform handles NULL column storage, optimizes your query, as well as the availability of sparse column support (SQL Server has this) will impact the decision. Ultimately I would recommend trying out both designs in a test environment with production sized data and workload and seeing which better achieves your performance objectives.

Database Design for updatable sequential records

Your design seems reasonable to me. While you do have to update all subsequent records when new processes are added or deleted that is easy to accomplish. You just issue an update like:

UPDATE ProcessOrder
SET ProcessOrder = ProcessOrder+1
WHERE ProcessOrder >= [step# where you want to insert]

and then do your insert or delete.

The only other way I can think of would be to design the schema to store the next process id on the row. Something like:

ProcessID | ParentProcessID | NextId
--------------------------------------------------
UUID2     | UUID1           | UUID3
UUID3     | UUID1           | UUID4
UUID4     | UUID1           | NULL

Then if you insert a new step - say between UUID3 and UUID4, you perform more of a linked list operation which will update UUID3|UUID1's NextId to UUID5 and then just insert the new UUID5 with a NextId of UUID4.

This will reduce the UPDATEs to 1 in most cases, but it will make querying the process more difficult as now you have to walk the list from top to bottom to list out step by step.

You need to decide which process you want to favor - inserting and updating or retrieving. If you favor retrieval (which you might if changes are infrequent and reporting is frequent, and the lists are short), then go with your original design. If you favor insert and update (which you might if changes are happening all the time and reporting is infrequent, or lists are really really long), then go with the linked list approach.

I hope this helps. Interested in what other solutions the community might come up with as I'd love to broaden my knowledge around this!

Best Answer

Related Solutions

Database Design: New Table versus New Columns

Database Design for updatable sequential records

Related Question