Best practice for database where the values can’t change

schema

Suppose I have a table Inspection with a field for Place (the location where the Inspection occurred). The field is a foreign key to a Place table, where each place has a Name, among other things. The Place names can change over time (for instance the Place could be a restaurant where the name changes when its ownership changes), but we would like the Inspection to continue to hold the original value it was given when it was created.

Is there a best practice for something like this? One thought is to have two fields, one to store the foreign key, regardless of whether it's been updated, and another to hold a string value for the original Place name. Another thought is to simply make Places immutable. You would have to create a new Place every time the name changed, rather than updating existing Places. Perhaps you could even related one or more Places somehow so that you know that multiple rows in the table actually relate to the same physical location.

Best Answer

There's a logical difference between a business and its current location.

When you're talking about inspections, it usually makes sense to store both. The inspection history of an address is meaningful, regardless of who occupies that address. And the inspection history of a business is meaningful, too, regardless of the location that business operates in.

Referenced tables not shown, but should be obvious.

create table inspections (
  address_id integer not null references addresses (address_id),
  inspection_date date not null default current_date,
  -- Instead of business name, you could (should?) store the licensed business
  -- identifier, if the local government issues one.
  business_name varchar(35) not null default 'Unoccupied',
  inspector_id integer not null references inspectors (inspector_id),
  primary key (address_id, inspection_date)
);

I'd probably want to revoke permissions on this table, and allow inserts only through a stored procedure. The goal would be to make sure that application code didn't accidentally omit the business name (or the licensed business identifier, whichever).

You probably can't have a foreign key based on business name; business names aren't guaranteed unique over time, and foreign keys have to reference columns that have a unique constraint. (There are ways to work around that--adding date columns and additional, overlapping unique constraints, but that might be more trouble than it's worth.) In any case, you can't cascade updates. As a historical record, the data in this table has to express the facts at the time of the inspection.

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.

On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY constraint and the clustered index in separate steps so you can define PAD_INDEX = ON and FILLFACTOR=50 to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILDs to reduce the inevitable fragmentation.

Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.

So, with the clustered indexes created so far you cover queries of the type

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id

The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.

Now you want to do this:

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'

For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).

OK, what about the reverse case?

SELECT p.productname, s.name as StoreName 
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'

For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.

I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).

The row-columns are completely useless in this scenario, just drop them to save space.

Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.

Using IDENTITY values for primary keys has disadvantages at INSERT time, but pays off every time after that.

Postgresql – Schema for lists of relationship differences

If I understand your question correctly you are looking to store, let's say, a Person entity with Name and DateOfBirth attributes. You will also have an Employee entity which is defined as everything a Person has plus EmploymentNumber plus DeskNumber but doesn't have DateOfBirth. I'm guessing you want to be able to add a column to a base type and automatically see it showing up in the sub-type(s), and sub-sub-type(s). In short you are looking for an object-oriented data store. RDBMSs in general are not will suited to deliver this featrure.

If you know what these differences are and they are stable, at least between schema relases, you can define a type/ sub-type model. The base table has the columns which are common to all types. It has a surrogate primary key. From there you define further tables, strictly adding columns as you go. The same ID is carried through this hierarchy. The case where you need to remove columns from one type to a descendant is handled by defining an abstract common ancestor and placing the removed column in one branch but not the other.

For my example above we would end up with these tables

Human
    ID     PrimaryKey
    Name

Person
    ID     Primary key and also foreign key to Human.ID
    DateOfBirth

Employee
    ID     Primary key and also foreign key to Human.ID
    EmploymentNumber
    DeskNumber

You could add an column to the base table (Human) to indicate which of the sub-types any particular occurance writes to. I don't think that is necessary because the process which writes to this database has to know which type it is dealing with in order to capture the correct values. Even if that process dynamically builds its list of attributes by examining the DB, it has to know which ultimate sub-type it is looking for to bootstrap the process.

It is tempting to have, say, EmployeeID and a separate foreign key HumanID. This is unnecessary.

When retrieving values you will either be interested in a known occurrance of a sub-type or will be looking for all available values for an instance i.e. "Human 99 as Employee" or "All about occurrance 99". The former can be had by INNER JOINing Employee up to its ultimate ancestor using the defined keys. The latter by OUTER JOINing all tables in the model.

If your attributes can vary at run time you will be forced into using some variant of an entity-attribute-value (EAV) model. This has been well documented in many, many blogs, papers, forums and SO questions. While being fractious, conceptually challenging and non-performant these models can be made to work (for an appropriate definiton of "work"). Your tables will be a lot like this:

ItemType
    ItemTypeID    primary key
    Name
    ParentTypeID  fk to ItemTypeID

Attribute
    AttributeID   primarykey
    Name
    AddedOrDeletedInd
    ItemTypeID    fk to ItemTypeID

Item
    ItemID
    ItemTypeID    fk to ItemTypeID
    ParentItemID  fk to ItemID

ItemValue
    AttributeID   fk to Attribute
    ItemID        fk to Item
    Value

ItemValue.Value could be a catch-all varchar() column or you could have one for each datatype you want i.e. ValueInt, ValueDate, ValueChar etc. and and indicator in Attribute to say which is populated. As you can see reading any one item's full definition will require recursion through a tree. Good luck.

Best Answer

Related Solutions

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

Postgresql – Schema for lists of relationship differences

Related Question