Database Design: New Table versus New Columns

database-design

(This was suggested to be be repost here from StackOverflow)

Currently have a table .. and need to start adding new data columns to it.
Not every record (even going forward with new data after adding the new data columns) will have data. So I am wondering if this is more suited for a new table since it is really an extension of some of the data rows and not applicable to every row.

In other words, since there will be a lot of unused columns for those new data elements, it seems like this would be more suited for a new table ?

The first table is a record of page views (currently 2million records)

- id
- IP address
- times viewed
- created_at timestamp
- date

for every IP address, a record is made per day – and consecutive pageviews are added to the times views per day

additional field(s) would be for point of origin tracking (ie google analytics source/medium/campaign)

Not every visit will have that info.
Im would assume about 10% of the rows will have the data (as it is usually only attributed on the first visit)

The main use for the data would be to attribute where people came from. This may wind up being used more frequently (which then seems to lend itself to the single table)

Appreciate the feedback – can add more if needed

Best Answer

What you are wrestling with is vertical partitioning. This is a physical database design technique to improve performance. As with any physical database design technique, its applicability depends on the specific queries you are trying to optimize and if this technique will optimize them. From a logical standpoint, if these new fields depend upon the candidate key for your entity then they are facts about it that belong with it. First you should make sure you fully understand the functional dependence of these new fields on your candidate keys to verify they really are facts about daily page views. If they are, deciding to partition them into another table is a performance optimization that should only be done if it achieves your performance goals.

In general, vertical partitioning is useful if you will query these new columns infrequently and distinctly from the other columns in the original table. By placing those columns in another table that shares the same PK as your existing table, you can query it directly when you want those new columns and get much greater through-put as you will have many more rows per page on disk for this new table as all the columns from the original table won't be sitting on those rows. However, if you will always query these columns along with the columns in the original table then a vertical partition wouldn't make much sense as you will always have to outer join to get them. Pages from tables on disk come into the buffer pool of a DBMS independently, never pre-joined, and so that join will have to happen with every query execution even if the data is pinned in the buffer pool. In this scenario making them NULLABLE columns on the original table would enable the DBMS storage engine to store them efficiently when NULL and eliminate the need to join on retrieval.

It sounds to me like your use case is the latter and adding them as NULLABLE to your original table is the way to go. But as with everything else in database design, it depends, and in order to make the right decision you need to know your expected workload and what making a good choice depends on. One good example of a proper use case for vertical partitioning would be a person search panel, where your application has some very rarely populated information about a person that someone might want to search on but rarely does. If you put that information into a different table you have some good options for performance. You can write the search so that you have 2 queries - one that uses the main, always populated information to search (like last name or ssn) only, and one that outer joins the very infrequently populated information only when it is requested for search. Or you could take advantage of the DBMS optimizer if it is smart enough to recognize for a given set of host variables that the outer join is not needed and won't perform it, and thus you only have to create 1 query.

What DBMS platform are you using? The way in which the platform handles NULL column storage, optimizes your query, as well as the availability of sparse column support (SQL Server has this) will impact the decision. Ultimately I would recommend trying out both designs in a test environment with production sized data and workload and seeing which better achieves your performance objectives.