Postgresql – Array column vs another table when the relationship is quite small

arraynormalizationpostgresql

The Data

I am creating a table of cities and their zip codes for the United States. Some zip codes have a list of acceptable cities, as designated by the USPS. For example, the zip code 36613 is associated with Eight Mile, AL, but the University of Mobile is in 36613 and it's address is in Mobile, AL. A look at USPS data and I see that Mobile is on the acceptable cities list for zip code 36613.

The Question

Now, I have a few routes that I can take:

I can create a new table called acceptable_city with a FK/PK relationship on city.
Create a column in city called acceptable_city that is defined as an array (acceptable_city text[]), and create an index on this column.

So, in a case like this which option is more advantageous? What are the pros and cons of each approach?

Best Answer

There is no clear answer to that. Both variants have pros and cons. Many here will advice to properly normalize data with a separate table.

You might get best of both worlds with a MATERIALIZED VIEW:
Properly normalize your data model, and offer an aggregated view on the data in the materialized view.

Related Solutions

Sql-server – What are the pros and cons of using the customer codes as a primary key

Pros - It is the natural key, it makes sense and it will likely be searched on, I presume?

Cons - The default behavior (which is totally changeable) is for a primary key to be the clustered index. An alphanumeric doesn't make the best candidate because inserts can cause page splits because they aren't set on an ever increasing value like an identity column. The Int identity column will take less space (4bytes) compared to the character data (40+bytes for the unicode) . This makes your other indexes larger since the clustered key is part of them. If you ever change how you identify your customers and make customer codes, this all breaks - going with a surrogate insulates you from those type of changes.

In this situation, I tend to optimize for the insert performance and go with an identity column more often than not for the clustered key and primary key. I really like integer clustered indexes. (Now I know your question was not about clustered index, it was about primary key... You could still choose some other column to be the clustered index and make this your primary key, you could also put a unique constraint on this and treat it as a natural key but not make it your primary key).

I would at the very least index this with a unique constraint and treat it like a natural key. I just don't know if you really need to make it your primary key.

Kimberly Tripp is a trusted resource who has a lot to say about primary keys and (more so) clustered keys on her blog - https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

This is all just my opinion - YMMV.

Postgresql – Postgres 9.1 query plan not using check_constraints on partitions after large data load

A couple comments given the answer here. First PostgreSQL currently does not check the where clause functions to see if all equality constraints can be met properly on all tables. So you have to specify the condition for the partition often manually in the where clause.

A major point I would make here would be that if you are not doing a lot of frequent bulk operations, you may find that partitioning is not a net win here. You may find that partial indexes get you what you want in a more flexible manner. For example you could have a series of partial indexes:

CREATE INDEX markets_0_idx markets (id) WHERE zip between '00000' AND '09999';

and so forth. However for a single zip-code lookup you are probably best to have just an index on zip. Premature optimization is the root of all evil.