Sql-server – Schema design: Use of association (aka: bridge/junction) table vs foreign key constraint with composite index containing a non-key field

database-designforeign keyindexsql-server-2008

This is an inventory database for IT assets. The models used are trimmed in order to focus on the problem at hand. Using SQL Server 2008. Thanks for taking the time to read and for any input you can provide.

My design includes a Device table which holds the various devices that can be entered into inventory. Each device has a boolean flag, CanNetwork which states whether a device has network capability, e.g., for most computers CanNetwork = true, for hard drives CanNetwork = false; some printers will be true, and others will be false. You get the idea.

The CanNetwork field determines if network-related information is relevant when an inventory record is created.

Design 1

My first design uses an index on Device.DeviceID and Device.CanNetwork to use in a foreign key constraint with the Asset table.

Schema #1: Foreign key constraint with index

The NetworkStatus table looks like this in this setup:

+----------------------------------------------------------------------+
| NetworkStatusID | NetworkStatus  | NetworkStatusDescription          |
|----------------------------------------------------------------------|
| 1               | Connected      | Device connected to network.      |
| 2               | Not Connected  | Device not connected to network.  |
| 3               | Do Not Connect | Do not connect device to network. |
+----------------------------------------------------------------------+

I put check constraints on the Asset table as follows to ensure a network status and network information can only be provided if the device is capable of connecting to a network.

-- Asset table check constraint: CK_CanNetwork
-- If can't network, then network fields are null
CanNetwork = 0
AND NetworkStatusID IS NULL
AND Hostname IS NULL
AND IPAddress IS NULL
AND MACAddress IS NULL
OR CanNetwork = 1

-- Asset table check constraint: CK_NetworkStatus
-- If network status is "Connected", then must provide
-- a hostname or ip address
NetworkStatusID = 1 AND Hostname IS NOT NULL
OR NetworkStatusID = 1 AND IPAddress IS NOT NULL
OR NetworkStatusID <> 1

The issue I have with this design is I'm not sure if the relationship with Asset and Device.DeviceID/Device.CanNetwork is a good or bad design decision. Is propagating a non-key field like CanNetwork to other tables a bad design? I don't have enough experience with database design to make an informed decision.

Design 2

In this design I thought I would use a bridge/association/junction table to decide which network statuses are valid for a device. It looks like this:

enter image description here

The NetworkStatus table looks like this in this setup (Notice the addition of record with id #4, and the ForCanNetwork field which specifies that this status is for use with devices that can can connect to a network):

+--------------------------------------------------------------------------------------+
| NetworkStatusID | NetworkStatus  | NetworkStatusDescription          | ForCanNetwork |
|--------------------------------------------------------------------------------------|
| 1               | Connected      | Device connected to network.      | True  (1)     |
| 2               | Not Connected  | Device not connected to network.  | True  (1)     |
| 3               | Do Not Connect | Do not connect device to network. | True  (1)     |
| 4               | Incapable      | Cannot connect to networks.       | False (0)     |
+--------------------------------------------------------------------------------------+

Due to this design's granularity, I could theoretically allow any mix of statuses for devices with this design, but I wanted to control it so I wrote some triggers to only insert the correct mix of statuses depending on whether the device is network capable. Triggers as follows:

-- NetworkStatus table on INSERT trigger
-- Adds a record for each device that
-- matches the capability of the network status
INSERT INTO DeviceNetworkStatus
SELECT i.NetworkStatusID, dev.DeviceID
FROM Device dev
CROSS JOIN
inserted i
WHERE dev.CanNetwork = i.ForCanNetwork

-- Device table on INSERT trigger
-- Adds a record for each network status that
-- matches the capability of the new device
INSERT INTO DeviceNetworkStatus
SELECT ns.NetworkStatusID, i.DeviceID
FROM NetworkStatus ns
CROSS JOIN
inserted i
WHERE ns.ForCanNetwork = i.CanNetwork

I used the following CHECK constraint on the Asset table:

-- Asset table check constraint: CK_NetworkStatus
-- If network status is "Connected', then must provide
-- a hostname or ip address
-- If network status is "Incapable", then network fields
-- must be null
NetworkStatusID = 1 AND Hostname IS NOT NULL
OR NetworkStatusID = 1 AND IPAddress IS NOT NULL
OR NetworkStatusID = 4 AND Hostname IS NULL
AND IPAddress IS NULL AND MACAddress IS NULL
OR NetworkStatusID <> 1 AND NetworkStatusID <> 4

This design eliminates the need to propagate CanNetwork across the tables. The issue I see with this design is that every device that has network capability will have records in DeviceNetworkStatus paired with NetworkStatus ids 1, 2 and 3, while devices that can't connect to a network will be paired only with NetworkStatus id 4. It seems like a lot of extra records that all mean the same thing: devices that can be networked can only use statuses 1, 2 and 3, and devices that can't network only use 4. This design seems to be more "relationally correct", but also smells a bit.

Update

The following update proposes variations on Design 1. I come across situations like this often, where there are many ways to achieve the same end result. I never know how to tell if there are hidden problems with the designs, and I can't judge when to normalize or denormalize. Is one of these designs preferred over the other and why?

Design 1.1

Schema #1.1: NetworkAsset subtype

-- NetworkAsset table check constraint: CK_CanNetwork
CanNetwork = 1

-- NetworkAsset table check constraint: CK_NetworkStatus
-- If network status is "Connected", then must provide
-- a hostname or ip address
NetworkStatusID = 1 AND Hostname IS NOT NULL
OR NetworkStatusID = 1 AND IPAddress IS NOT NULL
OR NetworkStatusID <> 1

Design 1.2

Schema #1.2: NetworkAsset and NetworkDevice subtype

For the record, this design seems a bit absurd even to me, but this is my thought process.
In this one, the presence of a DeviceID in the NetworkDevice table is equivalent to saying CanNetwork = true in Design 1. The NetworkAsset.NetworkDeviceID has a foreign key constraint and is used to ensure only networkable devices are entered. Can do this using a CHECK constraint (see below), and by making NetworkDeviceID a computed column that is equal to DeviceID.

-- NetworkAsset table check constraint: CK_IsNetworkDevice
NetworkDeviceID = DeviceID

Best Answer

I have a surprise for you: you don't need CanNetwork at all!

You have two sets of devices: all devices and devices that can be attached to the network. Only the latter have a network status, and that status can be only one of two values: connected and not connected. (There may be administrative values, too, such as what network it's allowed to attach to, or its netmask or something. Those attributes also do not apply to non-networked devices.)

The presence of a device in the NetworkedDevices table indicates that it can be attached to the network. Absence from that table means it cannot; that is what's known as the Closed World Assumption.

In general, the presence of a boolean valued column that indicates whether or not the thing has a relationship to another table is hint that you need two tables instead of a flag.

HTH.

Related Solutions

Database normalization: Describing trait – foreign key to table, or varchar field with value

The design choices you describe are not directly related to normalization.

I agree there should be a lookup table.

I think an OrderStatusID value would increase redundancy. The status (text) value presumably already satisfies many of the qualities of a good key: unique, stable, narrow, familiar to users, etc. Referential integrity can be applied to VARCHAR columns, of course! Each application that uses the key can assign it a enum as required and would be responsible for mapping enum values to status (text) values. This would presumably make the lookup table a single column, 'all-key' table (and therefore would satisfy 6NF, the highest normal form ;)

[If OrderStatusID is an attribute in the Order table then it would not be in 6NF but, as I say, I don't think you are actually asking about normalization at all.]

Inventory database structure when inventory items have varying attributes

Supertype/Subtype

How about looking into the supertype/subtype pattern? Common columns go in a parent table. Each distinct type has its own table with the ID of the parent as its own PK and it contains unique columns not common to all subtypes. You can include a type column in both parent and children tables to ensure each device can't be more than one subtype. Make an FK between the children and the parent on (ItemID, ItemTypeID). You can use FKs to either the supertype or subtype tables to maintain the desired integrity elsewhere. For example, if the ItemID of any type is allowed, create the FK to the parent table. If only SubItemType1 can be referenced, create the FK to that table. I would leave the TypeID out of referencing tables.

Naming

When it comes to naming, you have two choices as I see it (since the third choice of just "ID" is in my mind a strong anti-pattern). Either call the subtype key ItemID like it is in the parent table, or call it the subtype name such as DoohickeyID. After some thought and some experience with this, I advocate calling it DoohickeyID. The reason for this is that even though there could be confusion about the subtype table really in disguise containing Items (rather than Doohickeys), that is a small negative compared to when you create an FK to the Doohickey table and the column names don't match!

To EAV or not to EAV - My experience with an EAV database

If EAV is what you truly have to do, then it's what you have to do. But what if it weren't what you had to do?

I built an EAV database that is in use in a business. Thank God, the set of data is small (though there are dozens of item types) so the performance is not bad. But it would be bad if the database had more than a few thousand items in it! Additionally, the tables are so HARD to query. This experience has led me to really desire to avoid EAV databases in the future if at all possible.

Now, in my database I created a stored procedure that automatically builds PIVOTed views for each and every subtype that exists. I can just query from AutoDoohickey. My metadata about the subtypes has a "ShortName" column containing an object-safe name suitable for use in view names. I even made the views updateable! Unfortunately, you cannot update them on a join, but you CAN insert to them an already-existing row, which will be converted to an UPDATE. Unfortunately, you cannot update only a few columns, because there is no way to indicate to the VIEW which columns you want to update with the INSERT-to-UPDATE conversion process: a NULL value looks like "update this column to NULL" even if you wanted to indicate "Don't update this column at all."

Despite all this decoration to make the EAV database easier to use, I still don't use these views in most normal querying because it is SLOW. Query conditions are not predicate pushed all the way back to the Value table, so it has to build an intermediate result set of all the items of that view's type before filtering. Ouch. So I have many, many queries with many, many joins, each one going out to get a different value and so on. They perform relatively well, but ouch! Here's an example. The SP that creates this (and its update trigger) is one giant beast, and I'm proud of it, but it is not something you want to ever try to maintain.

CREATE VIEW [dbo].[AutoModule]
AS
--This view is automatically generated by the stored procedure AutoViewCreate
SELECT
   ElementID,
   ElementTypeID,
   Convert(nvarchar(160), [3]) [FullName],
   Convert(nvarchar(1024), [435]) [Descr],
   Convert(nvarchar(255), [439]) [Comment],
   Convert(bit, [438]) [MissionCritical],
   Convert(int, [464]) [SupportGroup],
   Convert(int, [461]) [SupportHours],
   Convert(nvarchar(40), [4]) [Ver],
   Convert(bit, [28744]) [UsesJava],
   Convert(nvarchar(256), [28745]) [JavaVersions],
   Convert(bit, [28746]) [UsesIE],
   Convert(nvarchar(256), [28747]) [IEVersions],
   Convert(bit, [28748]) [UsesAcrobat],
   Convert(nvarchar(256), [28749]) [AcrobatVersions],
   Convert(bit, [28794]) [UsesDotNet],
   Convert(nvarchar(256), [28795]) [DotNetVersions],
   Convert(bit, [512]) [WebApplication],
   Convert(nvarchar(10), [433]) [IFAbbrev],
   Convert(int, [437]) [DataID],
   Convert(nvarchar(1000), [463]) [Notes],
   Convert(nvarchar(512), [523]) [DataDescription],
   Convert(nvarchar(256), [27991]) [SpecialNote],
   Convert(bit, [28932]) [Inactive],
   Convert(int, [29992]) [PatchTestedBy]
FROM (
   SELECT
      E.ElementID + 0 ElementID,
      E.ElementTypeID,
      V.AttrID,
      V.Value
   FROM
      dbo.Element E
      LEFT JOIN dbo.Value V ON E.ElementID = V.ElementID
   WHERE
      EXISTS (
         SELECT *
         FROM dbo.LayoutUsage L
         WHERE
            E.ElementTypeID = L.ElementTypeID
            AND L.AttrLayoutID = 7
      )
) X
PIVOT (
   Max(Value)
   FOR AttrID IN ([3], [435], [439], [438], [464], [461], [4], [28744], [28745], [28746], [28747], [28748], [28749], [28794], [28795], [512], [433], [437], [463], [523], [27991], [28932], [29992])
) P;

Here's another type of automatically-generated view created by another stored procedure from special metadata to help find relationships between items that can have multiple paths between them (Specifically: Module->Server, Module->Cluster->Server, Module->DBMS->Server, Module->DBMS->Cluster->Server):

CREATE VIEW [dbo].[Link_Module_Server]
AS
-- This view is automatically generated by the stored procedure LinkViewCreate
SELECT
   ModuleID = A.ElementID,
   ServerID = B.ElementID
FROM
   Element A
   INNER JOIN Element B
      ON EXISTS (
         SELECT *
         FROM
            dbo.Element R1
         WHERE
            A.ElementID = R1.ElementID1
            AND B.ElementID = R1.ElementID2
            AND R1.ElementTypeID = 38
      ) OR EXISTS (
         SELECT *
         FROM
            dbo.Element R1
            INNER JOIN dbo.Element R2 ON R1.ElementID2 = R2.ElementID1
         WHERE
            A.ElementID = R1.ElementID1
            AND R1.ElementTypeID = 40
            AND B.ElementID = R2.ElementID2
            AND R2.ElementTypeID = 38
      ) OR EXISTS (
         SELECT *
         FROM
            dbo.Element R1
            INNER JOIN dbo.Element R2 ON R1.ElementID2 = R2.ElementID1
         WHERE
            A.ElementID = R1.ElementID1
            AND R1.ElementTypeID = 38
            AND B.ElementID = R2.ElementID2
            AND R2.ElementTypeID = 3122
      ) OR EXISTS (
         SELECT *
         FROM
            dbo.Element R1
            INNER JOIN dbo.Element R2 ON R1.ElementID2 = R2.ElementID1
            INNER JOIN dbo.Element C2 ON R2.ElementID2 = C2.ElementID
            INNER JOIN dbo.Element R3 ON R2.ElementID2 = R3.ElementID1
         WHERE
            A.ElementID = R1.ElementID1
            AND R1.ElementTypeID = 40
            AND C2.ElementTypeID = 3080
            AND R2.ElementTypeID = 38
            AND B.ElementID = R3.ElementID2
            AND R3.ElementTypeID = 3122
      )
WHERE
   A.ElementTypeID = 9
   AND B.ElementTypeID = 17

The Hybrid Approach

If you MUST have some of the dynamic aspects of an EAV database, you could consider creating the metadata as if you had such a database, but instead actually using the supertype/subtype design pattern. Yes, you would have to create new tables, and add and remove and modify columns. But with the proper pre-processing (like I did with my EAV database's Auto views) you could have real table-like objects to work with. Only, they wouldn't be as gnarly as mine and the query optimizer could predicate push down to base tables (read: perform well with them). There would just be a one join between the supertype table and the subtype table. Your application could be set to read the metadata to discover what it is supposed to do (or it can use the auto-generated views in some cases). This protects your application code from having to be touched extensively just to add or modify things.

Or, if you had a multi-level set of subtypes, just a few joins. By multi-level I mean when some subtypes share common columns, but not all, you could have a subtype table for those that is itself a supertype of a few other tables. For example, if you are storing information about Servers, Routers, and Printers, an intermediate subtype of "IP Device" could make sense.

I will give the caveat that I haven't yet made such a hybrid supertype/subtype EAV-metatable-decorated database like I'm suggesting here yet to try out in the real world. But the problems I've experienced with EAV are not small, and doing something is probably an absolute must if your database is going to be large and you want good performance without some crazy expensive gigantic hardware.

In my opinion, the time spent automating the use/creation/modification of real subtype tables would ultimately be best. Focusing on flexibility driven by data makes the EAV sound so attractive (and believe me I love how when someone asks me for a new attribute on an element type I can add it in about 18 seconds and they can immediately start entering data on the web site). But flexibility can be accomplished in more than one way! Pre-processing is another way to do it. It's such a powerful method that so few people use, giving the benefits of being totally data-driven but the performance of being hard-coded.

(Note: Yes those views really are formatted like that and the PIVOT ones really do have update triggers. :) If someone is really that interested in the awful painful details of the long and complicated UPDATE trigger, let me know and I'll post a sample for you.)

And One More Idea

Put all your data in one table. Give columns generic names and then reuse/abuse them for multiple purposes. Create views over these to give them sensible names. Add columns when a suitable-data-type unused column is not available, and update your views. Despite my length going on about subtype/supertype, this may be the best way.