Postgresql – How to handle a single `reference` to multiple different types of data

postgresqlschema

I host a file sharing site. The idea is to take a short string (e.g. abcd) and do one of the following:

  • serve an image
  • display some text
  • redirect to another url

Historically, I've hosted this site using non-relational databases, where I'd store it as something like:

{
  short: "abcd",
  type: "url", // or image or text
  // type specific data, e.g. for a url
  url: "https://google.com"
  // or for a file
  store: "/cool_store",
  filesize: 10,
  etc.
}

I've redesigned the platform and decided to use SQLite / Postgres instead. I'm not that used to schema design and my first design of (not actual code, heavily simplified):

CREATE TABLE Items (
    short varchar(64),
    store varchar(255),
    filesize int
);

Which worked great for the image hosting side of things. Now I want to expand to the other types and can't decide the best approach. My two main ideas are:

  1. Replace the singular table with one 'reference' table and separate 'item' tables for each different type, e.g. References { id: 'abcd', type: 'url', ref: 1 } (where the ref represents the primary key of a different table). Items { id: 1, store: "/cool_store", filesize: 10 }. Urls { id: 2, url: "https://google.com" }, etc.
  2. Store all the properties required on a single table with a large proportion of the columns being null. In total we require ~50 columns, with ~30 being null at any point in time.

The database is exceptionally read-heavy, maybe 99.99% reads.

Best Answer

You should group the fields closest related to each other in the same Table and create additional Tables for the remaining fields as they relate to each other as well. This is called database normalization and is effective in reducing data redundancy, improving maintainability, and helps optimize performance. Your additional Tables (kind of reference Tables such as your first option describes) will store the primary key(s) of the Table(s) they relate to. This is known as a foreign key relationship. Your second design option is called denormalization and is not recommended in most cases (there are some specific cases where some denormalization is helpful).

I don't know your data so I'll use an example for a Car Dealer to convey the idea of normalization. Image a Car Dealer business who sells different brands of Cars across multiple Locations (notice how my objects are already taking shape of the Tables I need). So I'd expect their to be a Locations table, with columns regarding each Location, such as its Address, PhoneNumber and maybe ManagerName, in addition to its primary key column LocationId. I'd also expect a Cars table with columns like Manufacturer, Model, ManufacturedYear, with a primary key column of CarId.

Cool, so far so good. But what if we wanted to know more information about the Managers of each Location and more information about the Manufacturers of each Car. To store that in the database, if we added those fields to the Locations and Cars tables respectively would be bad design because that's denormalized. Information about a Manager object doesn't make a lot of sense in a Locations table, same for information about a Manufacturer in the Car Dealership's Cars table. So we should create two more tables, Managers and Manufacturers.

Managers would now store the ManagerName field instead of the Locations table. (Realistically you'd like store ManagerName in two fields, one for FirstName and one for LastName but this is just a basic example, so bear with me :). Managers would also store things like the Manager's Address and PhoneNunber, and maybe even their YearlySalary, all things that wouldn't make sense in a Locations table. Finally, the primary key would be ManagerId which is what the Locations table would store as a foreign key reference to the Managers table, and this is how you relate the two tables.

Same design as above for the Manufacturers table. It would have columns like ManufacturerName, Address, PhoneNumber, maybe a field called NumberOfCarModels etc, and finally a primary key of ManufacturerId which would be a foreign key in the Cars table.

This is loosely a first level of normalization and hopefully conveys how to structure things. Essentially every main object gets it's own table. You can actually further normalize this (though not always necessary) by having more generic tables for your objects. For example a Contacts table could store all the Addresses and PhoneNumbers of the Locations, Managers, and Manufacturers tables in one place (with a column to denote the ContactType) instead of on each individual table. And the Managers table would likely be an Employees table with a field IsManager to denote which ones are Managers. This next level of normalization reduces redundancy and lightens the other tables even more so (but again it isn't a necessity and just depends on your situation).

Ideally, the performance benefits of normalization come from the fact that when you query a table, you're not getting back 30 extra columns of data you don't need and take longer to load off disk, etc. By designing your tables centralized around your primary objects, and storing only what's truly relevant to those objects in those tables, breaking out the additional fields into their own object tables, you will maximize the performance of querying those tables.