SQL Server – Choosing a Primary Key: VARCHAR(2) vs SMALLINT(2)

database-designsql server

Today we were discussing the following:

Brazil has 27 states and each one has its own abbreviation (just like USA).
So we've got RJ for Rio de Janeiro, SP for São Paulo, MG for Minas Gerais and so on.

One of our programmers proposed that we should use those abbreviations (RJ, SP, MG, etc.) as PK on the States table that we are planning to add to a new project.

Extrapolating the use of our database, I countered his argument saying that if we were to – someday – expand our services to other countries, we would have a problem with repeated abbreviations, for example: in the USA there is SC for South Carolina and on Brazil we've got SC for Santa Catarina; the same happens for MT, PA and MA. Counting on this we've agreed that there should be a ID column as PK IDENTITY.

Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.
Is it? Why? In which cases this could be applied? Should memory be considered in order to choose from one to another?

Best Answer

In most cases I agree with @Aaron's approach, but this just happens to be one of the few instances of there being a truly natural key to use: ISO codes (ISO 3166 specifically). For those who are unfamiliar with ISO, they are, in their own words (from the main http://www.iso.org/ page):

We're ISO, the International Organization for Standardization. We develop and publish International Standards.

ISO 3166-1 describes the country codes. While there are a few choices of codes to use (2 character, 3 character, and numeric), the 2 character codes are the recommended choice and the most widely used (including for most country-based top-level domain names).

ISO 3166-2 describes each country's subdivisions (e.g. states). This standard is divided into country-based sections (e.g. ISO 3166-2:BR for Brazil) and the codes are 1, 2, or 3 alphanumeric characters.

So, you could do something like:

CREATE TABLE dbo.Country
(
  CountryCode CHAR(2) NOT NULL
                      COLLATE Latin1_General_100_BIN2
                      PRIMARY KEY,
  CountryName VARCHAR(50) NOT NULL
);

CREATE TABLE dbo.CountrySubdvision
(
  CountrySubdvisionCode VARCHAR(3) NOT NULL
                                   COLLATE Latin1_General_100_BIN2,
  CountryCode CHAR(2) NOT NULL
                      COLLATE Latin1_General_100_BIN2
                      CONSTRAINT [FK_CountrySubdvision_Country]
                                 FOREIGN KEY
                                 REFERENCES dbo.Country(CountryCode)
                                 ON UPDATE CASCADE,
  CountrySubdvisionName VARCHAR(50) NOT NULL,
  --  LocalizedSubdvisionName NVARCHAR(50) NOT NULL, -- ??
  CONSTRAINT [PK_CountrySubdvision]
             PRIMARY KEY (CountryCode, CountrySubdvisionCode)
);

The idea here is that you would place the CountryCode and CountrySubdvisionCode fields in any tables that needed the "state" value, and FK back to dbo.CountrySubdvision on (CountryCode, CountrySubdvisionCode).

While this method, using 3 - 5 bytes depending on the Country / Subdvision, is not as compact as using a 2-byte SMALLINT, it does have the advantage of placing human-readable / meaningful values in those related tables. This could easily reduce some number of JOINs (when you only need the 2-character code anyway) and can (at least slightly) reduce time spent debugging certain problems.

Please note that I specified a _BIN2 (i.e. binary) collation for both of the "Code" fields to assist performance. Even with an extra couple of bytes, that should be nearly as fast as comparing two SMALLINT values. The only downside there is that you need to be consistent in adding codes in all upper-case and people need to remember that they need to use all upper-case when filtering on these codes.

Please also note that I added the ON UPDATE CASCADE clause to the FK on dbo.CountrySubdvision to handle the off-chance that a CountryCode is changed by ISO (extremely unlikely). Along those same lines, the FKs created on tables where the CountryCode and CountrySubdvisionCode fields have been placed will also need to be set to ON UPDATE CASCADE. This will propagate changes made to CountryCode in dbo.Country to the tables that have both of the fields. This also will propagate any changes made to the CountrySubdvisionCode to those related tables. This is also highly unlikely to happen (especially for where most of our systems would be dealing with) though more likely than any changes being made to the CountryCode.


UPDATE

Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.

Just to have this stated, if using this method (i.e. ISO codes) to denote "States" / "Provinces" and possibly Countries, then you do technically have the ability to start out handling only "States" for a single country. In this configuration you would simply remove the dbo.Country table and the CountryCode field (and associated FK) from dbo.CountrySubdvision. Then you would only have CountrySubdvisionCode to place in any related tables.

Now, if you later find that you need to expand the system to handle other countries, you can, at that time, do the following steps (none of which change any existing CountrySubdvisionCode values in any of the related tables):

  1. Create the dbo.Country table
  2. Populate the dbo.Country table
  3. Drop any FKs from related tables that reference dbo.CountrySubdvision
  4. Drop the PK on dbo.CountrySubdvision
  5. Add the CountryCode field to dbo.CountrySubdvision, making it NOT NULL and with a DEFAULT of the CountryCode for the only country that you have been using so far (this DEFAULT can be removed at the end of this process, if you like)
  6. Recreate the PK on dbo.CountrySubdvision to be on (CountryCode, CountrySubdvisionCode)
  7. Create the FK on dbo.CountrySubdvision for the CountryCode field to reference the dbo.Country table
  8. Add the CountryCode field to all of the related tables that already have the CountrySubdvisionCode field, making it NOT NULL and with a DEFAULT of the CountryCode for the only country that you have been using so far (this DEFAULT can be removed at the end of this process, if you like)
  9. Recreate the FKs on those related tables for both fields to reference the dbo.CountrySubdvision table
  10. Remove the Default Constraints on the CountryCode fields from dbo.CountrySubdvision and the related tables