Sql-server – Best type to use in a composite key when one of the values varies in type

database-designsql serversql-server-2008unique-constraint

I am building a data store for the content of multiple blog sites which have been scraped. Each of these sites is going to have an entry in a Blog table

BlogId                  Url                         Version
int, PK, identity       varchar(2500), not null     int, not null
-----------------       -----------------------     -------------
                1       'http://site1.com/                      1
                2       'http://site2.com/                      1
                3       'http://site2.com/                      2

Then there will be a Post table:

BlogId                Url                        PostId
int, FK, not null     varchar(2500) not null     ?

For most blogs that I'm looking at, I can easily discern a PostId value that is unique for them.

For all blogs, the Url will be unique for a BlogId value. However, the Url is frequently subject to change.

However, there are a few blogs that I can't discern a good PostId value when scraping the HTML. In those cases, I can get the datetimeoffset that a post was published on. The accuracy of the publishing frequency is on the minute level.

The question is, I need to build a composite natural key (I won't use it as a primary key because I'll need to relate to other tables and don't want to carry this along to those tables) using the BlogId and the PostId but I'm not sure what I should type the PostId as. I've been thinking of two options:

  • sql_variant – I could use an int for most of my sites, and that would be fine and then use datetimeoffset for the ones where I can't easily discern a unique numeric PostId
  • bigint – The PostId would fit in here and I can encode the datetimeoffset as a bigint (2011-11-11 13:59 would become 201111111359)
  • Or some other approach I haven't thought about.

Some other key facts:

  • The Post table will only have a few thousand records in it initially (if there are more than five thousand, I'd be surprized)
  • There are related tables for tags, categories, etc, which will be joined with this table (maybe a few thousand records at most)
  • The growth of any of these tables won't be astronomical. Maybe a another five thousand records a year (that's being conservative on my part but you never know)
  • This key will only be used to create an index/constraint, currently, I don't foresee the use of it in queries (there are other attributes that I am linking to which are of more concern).

Best Answer

I wouldn't even consider forcing data of different types into one field.

Another option:

  • synthetic post_id and
  • child tables for each 'type' of 'native_post_id'

There are various ways to go about enforcing the subset relationship between these child tables and the parent if necessary