I've been working on the following problem for a long time and I can't decide between two solutions. While both would work to solve the problem, neither "feels" exactly right to me.
Any advice on this would be greatly appreciated. Also note that I'm not particularly great with DB design so any feedback is appreciated.
I have the following table:
ProductYearDistrictCrop (+-2 million records)
oid | productId | yearId | districId | cropId | payDate |
1 | 1 | 1 | 1 |1 | 2018-01-01 |
2 | 1 | 1 | 1 |1 | 2018-02-01 |
3 | 2 | 1 | 1 |1 | 2018-03-01 |
4 | 2 | 1 | 1 |1 | 2018-04-01 |
Some background info on the table:
A record is created for each possible combination of Product – Year – District – Crop and there is a unique constraint on those Foreign Keys (the combination must be unique). I opted to use Oid as the PK instead of a Composite Key because Oid can then be used by other tables as a FK and I assumed a lookup on the Oid would be quicker than a composite key.
For each of these records, there are additional options that can be picked from. These options come from the below table
Options (16 records)
oid | description | systemDefaultValue
1 | Option 1 | 0.5
2 | Option 2 | 2.5
3 | Option 3 | 1.5
4 | Option 4 | 1.5
For each of these choices I need to store a custom value or just reuse the systemDefaultValue per record in ProductYearDistrictCrop
The two possible solutions I come up with are
Solution A : Simple add each possible option as column to the table
oid | productId | yearId | districId | cropId | payDate | Option1 | Option2 |
1 | 1 | 1 | 1 |1 | 2018-01-01 | 0.5 | 2.5 |
2 | 1 | 1 | 1 |1 | 2018-02-01 | 0.5 | 2.5 |
3 | 2 | 1 | 1 |1 | 2018-03-01 | 0.5 | 2.5 |
4 | 2 | 1 | 1 |1 | 2018-04-01 | 0.5 | 2.5 |
Which means we'll end up with 16 extra columns in ProductYearDistrictCrop and of course, each time a new option is added to the system, I need to update this table as well. While this doesn't happen often, it can happen.
The other consideration is that there are other tables that will also follow a similar design so that means I would have to update several tables each time a new option is added.
Solution B : Use what I think is an EAV approach for the options by creating a linking table
ProductYearDistrictCropOptionChoices
ProductYearDistrictCrop | optionId | customValue |
1 | 1 | 0.5 |
2 | 2 | 1.5 |
3 | 3 | 100 |
4 | 4 | 25 |
In this scenario I end up with a +- 32 million records. The obvious upside is that no db changes are required each time a new option is added. Also I'm working with worst case scenario so it's not that each record from ProductYearDistrictCrop WILL have 16 records linked to it.
My Feedback
My gut feeling to pick Solution A and add the columns for each option. I can then do a single lookup when I need the data and get everything I need from that row or sets of rows.
The other reason I prefer Solution A is that to my understanding attributes should be columns in a database. It's just that another table is already storing those attributes as options to be used by the system when building UIs and providing hte user with default options and values.
I've been wrestling with this for a long time and would just like to get some external advice and insights. Also it would help to get more opinions as I would like to justify my choice with a stronger reasoning than "I feel this one is good and that one is icky"
Best Answer
Richard,
I really think this is a great question you have posed. My answer got longer than I anticipated so in summary I have 3 possible solutions based on what you asked. If I had to rank my preference I would do:
EVA
tableEVA Approach
Thanks to some insight from Jon the
EVA
is certainly a very viable approach with proper indexing using aPIVOT
For the sake of the example on the
PIVOT
(since I was not familiar with it) I created a simpleEAV
table.I then populated it with 16 unique
ID
's and each configured for 16 differentOptionID
's. The value is populated with the product ofID
andOptionID
If those two values are the same then no value was provided (If ID <= 6 a value ofNULL
was provided, else no row was added (so that we got examples for both aNULL
value and a missing row)).The table can then be queried with a
PIVOT
command. (As I understand it thePIVOT
requires that we use anAggregate
functionSUM
,MIN
,MAX
, etc.... If there is a constraint added so that theID-OptionID
combination is unique we shouldn't get any incorrect outputs. I just arbitrarily choseSUM
)I am not familiar with how well the
PIVOT
function performs but it is a built-in function so I would imagine performance is pretty good.This give you a strong blend of:
VIEW
orCTE
for easy of use. When a newOption
is added you would simply add an additional Column in the final select[17] AS Option17
and Add, [17]
in theFOR [OptionID](...)
piece of thePIVOT
).CASE
statement.EAV
tables are really narrow, so as long as the table is indexed well performance should be really good.Extension Table Approach
When It Comes To The Additional Columns Approach most people's concern is that it isn't normalized out. I hear a comment one time about "Normalize until it hurts. De-Normalize until it works". (There is actually a DBA Stack Exchange topic which covered this idea about how far to go with
Normalization
). With all that in mind if this is going to be a highly utilized part of the database than this may be a candidate where taking a slightly de-normalized design, for performance reasons may be what you want to do.I would not however recommend adding the columns onto your existing table. The wider the table is, the less data you can fit onto the same page which means that everything else using this table will take a performance degradation. One way around that is creating an "Extension"
table
which doesn't do anything but hold these values. You would simplyJOIN
to this one table to get all 16 options (whether they are always written, or the columns areNULLABLE
and only hold have a value if it differs from the default). When an additionalOption
is added, you would just add a new column to this table.This table, space-wise would be really inefficient. But we would be sacrificing that space, for
SELECT
query performance in both instances whereProductYearDistrictCrop
is used without this new extension table and whenProductYearDistrictCrop
is used in unison with it.Multiple Child Table Approach
You can get around the poor space utilization with multiple
child tables
I was reading an article the other day reference in a different DBA Stack Exchange Question on NULLS. This Article suggested that multiple child tables holding values can be used to designate this. Where each
option
has its ownchild table
which holds this data. You would end up creating 16 version of something like:You would then reference everything by using something like
This version is:
ProductYearDistrictCrop
widerProductYearDistrictCrop.Oid
is different than the default. Which allows you to save a lot of spacetable
following the same format which is then supported with code updates in whateverfunctions
/queries
/stored procedures
.That is probably one of the harder ones to support when it comes to new options coming. It will also require the most amount of work in your initial development. But a
VIEW
can probably be used to get around most of that. However I think space and performance wise it will give a lot of benefits, which should be the primary priority when it comes to a database, particularly if you are already in a 2 Million row world with this one table.