How to design a relational DB to capture the same entity as reported by multiple sources

database-design

I need a DB that can capture instruments as reported by a bank and two accounting firms. Let's take Walmart stock as an example. The bank may have a instrument named "Wal-Mart Stock". One of the accounting firms may have it named as "Walmart Stock". The other accounting firm may have "Wal-mart common stock".

Here's the diagram that I have so far:

* Table contains five additional columns. These five column names are the same for MasterInstrument and DataSourceInstrment.

The DataSource represents the bank, accounting firms, or any other entity that can provide data.

The DataSourceInstrument represents the instrument data as provided by the DataSource.

MasterInstrument is a version of the instrument specific to the application itself. So, in our example, it may contain "Walmart Common Stock" for the instrument name. It's different than DataSourceInstrument b/c unlike DataSourceInstrument, it is used extensively throughout the application and it must maintain a history of changes.

The Instrument table provides a way to know that several different data sources are referring to the same instrument. DataSourceInstrument and MasterInstrument with the same instrumentId are referring to the same instrument. (Instrument.id is a surrogate key b/c some instruments do not have a natural key, and the DataSourceInstruments and a MasterInstrument may need to be manually mapped as referring to the same instrument.) So, using the Walmart example, we can know that the bank, two accounting firms and the internal system are referring to the same Walmart instrument.

The Issuer represents the issuer of the instrument such as "Wal-Mart Stores, Inc." Note: It only needs to be captured for MasterInstruments.

I have a couple of doubts on this design:

Is it poor design to have a table (i.e. Instrument) with only a single surrogate column? If this is a problem, how would it be designed differently?
Are the various *Instrument table names confusing? If so, how could it be more clear?

Overall how should the design account for the need to capture multiple sources referring to the same instrument along with the need to capture an internal version of that instrument with history?

Update

The system's primary function / raison d'être is reporting. It must report for a given date what the internal instrument (i.e. MasterInstrument) data was on that date. This is why I have MasterInstrument as a history /temporal table and do not have the columns in Instrument.

Best Answer

Is it poor design to have a table (i.e. Instrument) with only a single surrogate column? If this is a problem, how would it be designed differently?

If that is the only constant property of the entity then that is fine, but I would assume that there are other invariant properties such as a name ("Walmart" in your example). Even if the same "instrument" can be referred to by different names by each source and issuer (those names then belong in the relation table) do they not have a canonical name?

Are the various *Instrument table names confusing? If so, how could it be more clear?

It depends a lot on how well you document them and/or the names and naming conventions in your target industry.

Table names should be as descriptive as possible (you can always shorten them in queries using aliases) avoiding generic words where possible. Where a generic name is needed because nothing more speciifc isn't too specific then I would suggest the simplest generic word that fits (item, object, thing) rather than instrument, but that is personal preference.

For junction tables (relation tables, or whatever you call them as there are several common names for the same concept) I recommend the name represents the row entity types in the relationship - so IssuerInstrument rather than MasterInstrument in your diagram.

Related Solutions

Ms-access – Database Relational Design for Multiple Categories

I second @FrustratedWithFormsDesigner's suggestions for a unique constraint. See http://office.microsoft.com/en-us/access-help/create-a-constraint-adp-HP003088257.aspx for how to do this in MS Access.

What a Unique Constraint does is state that no two rows can have duplicate data on some portion of the row. So presumably you want each patient to be in a study only once, so you'd want a unique constraint on PatientsInStudy(participantID, study).

This way each individual can only be given one study id. Similarly in that design your PatientsInStudy table would have a primary key spanning all three fields, and a second unique constraint on (StudyID, study_code). This way even if two studies end up with similar semantics or overlapping spaces, the key cannot be reused in the same study.

Otherwise, good.

Sql-server – Design concerns for using 3x bit or one char(1) or one integer in table for holding status of item

In my experience, trying to encode multiple data points into a single column always ends up being more trouble than it's worth. Sure, it seems cool and clever to use BITWISE operators, but there are many things that go wrong and it won't always be efficient to test those bits without cumbersome and unintuitive workarounds. It's the same reason we stay away from storing comma-separated lists, JSON strings etc. in a single column - eventually you care about viewing or filtering on those distinct bits which you now have to extract, sometimes expensively.

With the information I have, my vote is for three separate BIT columns. They will still collapse to similar storage patterns as a single column with the three bits on/off, and can be made more efficient individually and across the board in several ways, including:

data compression
sparse columns
filtered indexes (e.g. WHERE allow_returns = 1)

Someone else advocated for three CHAR(1) columns. These do not benefit from storage collapse and also require a check constraint, making them less than ideal in my mind.

Now, my answer might change if you say, "well, what if I might add 15 other attributes in the future?" I certainly don't think it's wise to build the columns this way if they're not relatively static - changing the schema (and therefore all of the code and interfaces to it) for every new or changed attribute is going to be a royal pain. So in that case you might want to consider EAV - where the attributes are not part of the metadata but part of the data. There are a lot of objections to EAV, mostly around performance and the difficulty in enforcing constraints (in this case unlikely to be an issue if all of these attributes are either on or off), but it worked quite well for us at my previous job. You might model it like this:

CREATE TABLE dbo.Attributes
(
  AttributeID TINYINT PRIMARY KEY,
  Name VARCHAR(32) NOT NULL UNIQUE
);

CREATE TABLE dbo.ItemAttributes
(
  ItemID INT NOT NULL 
    FOREIGN KEY REFERENCES dbo.Items(ItemID),
  AttributeID TINYINT NOT NULL 
    FOREIGN KEY REFERENCES dbo.Attributes(AttributeID),
  Status BIT NOT NULL,
  PRIMARY KEY(ItemID, AttributeID)
);

And again, you can have filtered indexes here to make certain queries much more efficient, such as (imagine the AttributeID for "allow returns" is 10):

CREATE INDEX optAllowReturns ON dbo.ItemAttributes(ItemID)
  WHERE AttributeID = 10 AND Status = 1;

If you have certain attributes that are not on/off (for example, three states of manufacture or shipping), you can change the Status column to:

Value TINYINT NOT NULL

This can double as an on/off value for attributes that are represented that way, and as tri- or more-state value for attributes that require more than simple on/off. You can also reflect which type is which in the metadata of the dbo.Attributes table.

Best Answer

Related Solutions

Ms-access – Database Relational Design for Multiple Categories

Sql-server – Design concerns for using 3x bit or one char(1) or one integer in table for holding status of item

Related Question