SQL Server – How to Find Minimum Set of Fields for Unique Composite Key

database-designprimary-keysql serverunique-constraint

I am importing flat files from different sources into tables in SQL Server. I am creating a composite primary key using a combination of fields from the extracts that will give me a unique key for every row.

The way I do it now is I just start with 1 field and then I keep concatenating fields until I find a key that is unique for all records. This can be a little time consuming or I might end up concatenating more columns than I really needed to in order to obtain the unique key.

Is there some sort of SQL script I can run on a table that will give me the minimum number of fields (names) I would need to concatenate in order to obtain a unique key? So if there is 1 field in the table that is unique for all records then that 1 field name would be returned. If I needed to concatenate [memberid], [claimid], and [date of service] in order to obtain a unique key then those 3 field names would be the result of the script.

Best Answer

I am creating a composite primary key using a combination of fields from the extracts that will give me a unique key for every row.

Um, this is not exactly what a Primary Key is for. Yes, they do uniquely identify each row, but they also are the basis of supporting relationships to sibling and child tables.

Is there some sort of SQL script I can run on a table that will give me the minimum number of fields (names) I would need to concatenate in order to obtain a unique key?

Not outside of what you are already doing, although maybe in slightly different forms, such as possibly loading the data into a table with no keys or unique indexes or unique constraints defined, and then trying to create the PK or Unique (Index | Constraint) on various combinations of fields. In either case you probably shouldn't be doing this in the first place.

There are a few problems with this approach in general:

  • What if there are several options for uniqueness? Any combination of one or more individual fields and / or one or more sets of multiple fields? FieldA could be unique, and FieldD + FieldH could be unique. What then?
  • What if no individual fields or combinations of fields are unique? What is the plan if using all of the fields still doesn't get you uniqueness?
  • What about bad data and/or errors in your import process? Scanning the incoming data is highly error prone due to not knowing how much you can trust the data to begin with. Your system needs to have some understanding if the incoming data is good or bad, not merely assume it to be good and base business rules around what could just as easily be a bug in the external system's export process (cuz that neeeeeeever happens ;-).
  • Should any of these import tables relate to each other? If so, trying to find uniqueness programmatically might pick different fields for tables that should relate but now can't due to uniqueness being found in multiple combinations of fields in at least one of the tables, and the process finding an incompatible combination prior to the relatable combination.
  • How are you deciding what the types are? What about 0x02FB4C97? Is that a VARBINARY or a string of hex bytes? What about 123456? Is that an INT, BIGINT, VARCHAR, DATETIME (in Julian format), VARBINARY (without the leading 0x but no A - F to help decide)?
  • What happens when the "data" in one or more columns radically changes but for good / natural reasons? What if there is a field that is currently empty and you think it is a string but it really is a date field that hasn't been used yet, or naturally is NULL for the particular export you are looking at? Or what about max lengths for variable length columns? What about a "comments" field that they are currently just using for a 5 digit number, but later start using it for actual comments?
  • As @MaxVernon mentioned in a comment on the Question, the uniqueness of the fields can change as new data comes in later since you don't really know the rules that govern how the data truly exists.

So this comes down to: what is the actual goal of defining the PKs to begin with? What are you trying to accomplish by doing this? Is there a reason why you don't just add an IDENTITY field and then remove duplicates within all of the imported fields (all minus the IDENTITY field)?

You really need to find out more about the true nature of the data first, and then build a table to hold the data with keys and constraints that match how the data should exist, not necessarily how it does exist.