Postgresql – SQL Database Design: Balance between table scans and row duplication (Address example)

database-designpostgresql

This question is easily illustrated with an example. Suppose I have an Address table, with the columns: address_id, address_line_1,…,address_line_4, locality, region, postcode, country.

This table will be referenced by:

  • Users (home, default billing, default shipping etc.)
  • Venues
  • Orders (billing, shipping)
  • Bookings

Any Address referenced more than once will not be updatable/deletable. For updates in this case, a new Address will be created and linked to as necessary.

I can think of three reasonable ways of storing information in the Address table:

  1. Addresses are unique. Every time a new Address is created (e.g., by a User), the entire table must be scanned to see if the Address already exists. I imagine this solution having an index supporting address_id, and an index supporting the actual Address columns (address_line_1,…,country).
  2. Addresses are unique at the level of the Users or Venues. If a new Address is added for a User or Venue, only the Addresses for that User or Venue need to be checked before adding another. Orders/Bookings reference these.
  3. Addresses are not shared at all (i.e., any row in the Address table has precisely one Owner; a User or an Order, etc.). Every time there is a new Order, the appropriate User Addresses are duplicated in the Address table and referenced by the Order.

I don't really like the third solution, but choosing between the other two is harder.

Questions

  • Is it practical, at scale, to scan an entire table to see if an Address exists before adding it (Option 1)?
  • Is an index supporting a combination of eight columns reasonable (Option 1)?
  • Or is there value in striking a balance between row uniqueness and the amount of data to be scanned (as in Option 2)?
  • Is there a point at which one option becomes better than the other?

I would appreciate any advice.

Best Answer

Ensuring that addresses are unique is not a trivial problem. You have to assume that people enter their addresses in slightly different ways, abbreviating some parts sometimes, or omitting optional information. Just checking for exact duplicates would probably fail pretty often.

You should consider whether trying to keep addresses unique is worth the effort and complexity.

Addresses of users and addresses of orders are pretty different concepts. A user can change their current address. An order should never be retroactively changed in that way, the address there should always be the delivery address.

You could consider making addresses immutable. Changing an address would always mean creating a new address. This means you can safely point orders to the common address table, as the addresses will never change.

If a user changes their address, you simply add a new address and link it to the user. The old address can be hidden in the UI, soft-deleted or just shown as an address history. But all orders could still reference the old address because it never changes.

  • Is it practical, at scale, to scan an entire table to see if an Address exists before adding it?

That depends entirely on what you consider "at scale", it's not an issue with a few tens of thousands of rows but completely impractical with a billion rows.

  • Is an index supporting a combination of eight columns reasonable (as in Option 1)?

B-Tree indexes are limited to something around 2700 characters. Indexing very large items is typically not a good idea. In Postgres 10, which isn't released yet, you could consider a hash index for this which has a fixed size as far as I understand and should have an advantage for large columns.

But I really think you're asking the wrong question here. In your application, you should know when an address is already in the DB because the user didn't enter it from scratch, but e.g. selected it from the list of addresses already associated with their account.

You don't actually need to enforce uniqueness in your DB, if I understand your problem correctly. And if you'd need to do that, you should always use a unique constraint, which automatically uses an index.