Sql-server – NVARCHAR column as PRIMARY KEY or as UNIQUE column

primary-keysql serverunique-constraint

I'm developing a SQL Server 2012 database and I have a doubt about nvarchar columns as primary keys.

I have this table:

CREATE TABLE [dbo].[CODES]
(
    [ID_CODE] [bigint] IDENTITY(1,1) NOT NULL,
    [CODE_LEVEL] [tinyint] NOT NULL,
    [CODE] [nvarchar](20) NOT NULL,
    [FLAG] [tinyint] NOT NULL,
    [IS_TRANSMITTED] [bit] NOT NULL DEFAULT 0,
     CONSTRAINT [PK_CODES] PRIMARY KEY CLUSTERED 
    (
        [CODE_LEVEL] ASC,
        [CODE] ASC
    )
)

But now I want to use [CODE] column as primary key and remove [ID_CODE] column.

Is there any issue or penalty if I have a NVARCHAR column as PRIMARY KEY?

[CODE] column value must be unique, so I've thought that I can set an UNIQUE constraint to that column.

Do I have to use [CODE] as primary key or it is better if I set an UNIQUE constraint on [CODE] column?

Best Answer

Yes, absolutely there are negative consequences for using a string instead of a numeric type for a Primary Key, and even more so if that PK is Clustered (which it indeed is in your case). However, the degree to which you see the effect(s) of using a string field is a function of a) how many rows are in this table, and b) how many rows in other tables are Foreign Keyed to this PK. If you only have 10k rows in this table and 100k rows in a few other tables that FK to this table via that field, then maybe it won't be so noticeable. But those effects certainly become more noticeable as row counts increase.

You need to consider that the fields in a Clustered Index are carried over to Non-Clustered Indexes. So you aren't just looking at up to 40 bytes per row, but (40 * some_number) bytes. And in any FK tables you have those same 40 bytes in the row plus more often than not there will be a Non-Clustered index on that field as it is being used in JOINs, so now it is really doubled in any tables that FK to this one. If one is inclined to think that 40 bytes * 1 million rows * 10 copies of it is nothing to be concerned about, please see my article Disk Is Cheap! ORLY? which details all (or at least most) of the areas impacted by this decision.

The other thing to consider is that filtering and sorting on strings, especially when not using a binary Collation (I assume you are using the database default which is typically case-insensitive) is far less efficient (i.e. takes longer) than when using INT / BIGINT. This impacts all queries that filter / join / sort on this field.

Hence, using something like CHAR(5) would probably be OK for a Clustered PK, but mostly if it was also defined with COLLATE Latin1_General_100_BIN2 (or something like that).

And can the value of [CODE] ever change? If yes then that is even more reason to not use it as a PK (even if you do set the FKs to ON UPDATE CASCADE). If it can't or won't ever change that is fine, but still there is more than enough reason already to not use it as a Clustered PK.

Of course, the question might be incorrectly phrased as it appears that you currently already have this field in your PK.

Regardless, your best option, by far, is to use [ID_CODE] as the Clustered PK, use that field in related tables as the FK, and keep [CODE] as a UNIQUE INDEX (which means it is an "alternate key").


Update
A little more info based on this question in a comment on this answer:

Is [ID_CODE], as PRIMARY KEY, the best option if I use [CODE] column to look up the table?

This all depends on a great many factors, some of which I have already mentioned but will restate:

A Primary Key is how the individual row is identified, whether or not it is referenced by any Foreign Keys. How your system internally identifies the row is related to, but not necessarily the same as, how your users identify themselves / that row. Any NOT NULL column with unique data could work, but there are practicality issues to consider, especially if the PK is, in fact, referenced by any FKs. For example GUIDs are unique and some people really like using them for various reasons, but they are quite bad for Clustered Indexes (NEWSEQUENTIALID is better, but not perfect). On the other hand, GUIDs are just fine as alternate keys and used by the app to look up the row, but the JOINs are still done using an INT (or similar) PK.

So far you haven't told us how the [CODE] field fits into the system from all angles, outside of now mentioning that this is how you look up rows, but is that for all queries or just some? Hence:

  • Regarding the [CODE] value:

    • How is it generated?
    • Is it incremental or psuedo-random?
    • Is it uniform length or varying length?
    • What characters are used?
    • If using alphabetical characters: is it case-sensitive or insensitive?
    • Can it ever change after being inserted?
  • Regarding this table:

    • Do any other tables FK to this table? Or are these fields ([CODE] or [ID_CODE]) used in other tables, even if not explicitly Foreign Keyed?
    • If [CODE] is the only field used to get individual rows, then what purpose does the [ID_CODE] field serve? If it isn't used, why have it in the first place (which might depend on the answer to "Can the [CODE] field ever change?")?
    • How many rows in this table?
    • If other tables to reference this table, how many and how many rows in each of them?
    • What are the indexes for this table?

This decision can't be made purely on the question of "NVARCHAR yes or no?". I again will say that generally speaking I do not find it to be a good idea, but there are certainly times when it is fine. Given so few fields in this table it is not likely that there are any more, or at least not many, indexes. So you might be fine either way to have [CODE] as the Clustered Index. And if no other tables reference this table then you might also be fine making it the PK. But, if other tables do reference this table then I would opt for the [ID_CODE] field as the PK, even if Non-Clustered.