I am building a data store for the content of multiple blog sites which have been scraped. Each of these sites is going to have an entry in a Blog
table
BlogId Url Version
int, PK, identity varchar(2500), not null int, not null
----------------- ----------------------- -------------
1 'http://site1.com/ 1
2 'http://site2.com/ 1
3 'http://site2.com/ 2
Then there will be a Post
table:
BlogId Url PostId
int, FK, not null varchar(2500) not null ?
For most blogs that I'm looking at, I can easily discern a PostId
value that is unique for them.
For all blogs, the Url
will be unique for a BlogId
value. However, the Url
is frequently subject to change.
However, there are a few blogs that I can't discern a good PostId
value when scraping the HTML. In those cases, I can get the datetimeoffset
that a post was published on. The accuracy of the publishing frequency is on the minute level.
The question is, I need to build a composite natural key (I won't use it as a primary key because I'll need to relate to other tables and don't want to carry this along to those tables) using the BlogId
and the PostId
but I'm not sure what I should type the PostId
as. I've been thinking of two options:
sql_variant
– I could use anint
for most of my sites, and that would be fine and then usedatetimeoffset
for the ones where I can't easily discern a unique numericPostId
bigint
– ThePostId
would fit in here and I can encode thedatetimeoffset
as abigint
(2011-11-11 13:59
would become201111111359
)- Or some other approach I haven't thought about.
Some other key facts:
- The
Post
table will only have a few thousand records in it initially (if there are more than five thousand, I'd be surprized) - There are related tables for tags, categories, etc, which will be joined with this table (maybe a few thousand records at most)
- The growth of any of these tables won't be astronomical. Maybe a another five thousand records a year (that's being conservative on my part but you never know)
- This key will only be used to create an index/constraint, currently, I don't foresee the use of it in queries (there are other attributes that I am linking to which are of more concern).
Best Answer
I wouldn't even consider forcing data of different types into one field.
Another option:
There are various ways to go about enforcing the subset relationship between these child tables and the parent if necessary