I personally would use a model similar to the following:
The product table would be pretty basic, your main product details:
create table product
(
part_number int, (PK)
name varchar(10),
price int
);
insert into product values
(1, 'product1', 50),
(2, 'product2', 95.99);
Second the attribute table to store the each of the different attributes.
create table attribute
(
attributeid int, (PK)
attribute_name varchar(10),
attribute_value varchar(50)
);
insert into attribute values
(1, 'color', 'red'),
(2, 'color', 'blue'),
(3, 'material', 'chrome'),
(4, 'material', 'plastic'),
(5, 'color', 'yellow'),
(6, 'size', 'x-large');
Finally create the product_attribute table as the JOIN table between each product and its attributes associated with it.
create table product_attribute
(
part_number int, (FK)
attributeid int (FK)
);
insert into product_attribute values
(1, 1),
(1, 3),
(2, 6),
(2, 2),
(2, 6);
Depending on how you want to use the data you are looking at two joins:
select *
from product p
left join product_attribute t
on p.part_number = t.part_number
left join attribute a
on t.attributeid = a.attributeid;
See SQL Fiddle with Demo. This returns data in the format:
PART_NUMBER | NAME | PRICE | ATTRIBUTEID | ATTRIBUTE_NAME | ATTRIBUTE_VALUE
___________________________________________________________________________
1 | product1 | 50 | 1 | color | red
1 | product1 | 50 | 3 | material | chrome
2 | product2 | 96 | 6 | size | x-large
2 | product2 | 96 | 2 | color | blue
2 | product2 | 96 | 6 | size | x-large
But if you want to return the data in a PIVOT
format where you have one row with all of the attributes as columns, you can use CASE
statements with an aggregate:
SELECT p.part_number,
p.name,
p.price,
MAX(IF(a.ATTRIBUTE_NAME = 'color', a.ATTRIBUTE_VALUE, null)) as color,
MAX(IF(a.ATTRIBUTE_NAME = 'material', a.ATTRIBUTE_VALUE, null)) as material,
MAX(IF(a.ATTRIBUTE_NAME = 'size', a.ATTRIBUTE_VALUE, null)) as size
from product p
left join product_attribute t
on p.part_number = t.part_number
left join attribute a
on t.attributeid = a.attributeid
group by p.part_number, p.name, p.price;
See SQL Fiddle with Demo. Data is returned in the format:
PART_NUMBER | NAME | PRICE | COLOR | MATERIAL | SIZE
_________________________________________________________________
1 | product1 | 50 | red | chrome | null
2 | product2 | 96 | blue | null | x-large
As you case see the data might be in a better format for you, but if you have an unknown number of attributes, it will easily become untenable due to hard-coding attribute names, so in MySQL you can use prepared statements to create dynamic pivots. Your code would be as follows (See SQL Fiddle With Demo):
SET @sql = NULL;
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'MAX(IF(a.attribute_name = ''',
attribute_name,
''', a.attribute_value, NULL)) AS ',
attribute_name
)
) INTO @sql
FROM attribute;
SET @sql = CONCAT('SELECT p.part_number
, p.name
, ', @sql, '
from product p
left join product_attribute t
on p.part_number = t.part_number
left join attribute a
on t.attributeid = a.attributeid
GROUP BY p.part_number
, p.name');
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
This generates the same result as the second version with no need to hard-code anything. While there are many ways to model this I think this database design is the most flexible.
There is nothing wrong with nullable columns, even a lot of them, if that is what your data domain calls for.
On the other hand, the fact that you say these columns can be grouped logically implies to me that something else might be going on.
If they can be grouped logically because different sets of columns apply to different sets of rows, then you might have an entity-subtype situation.
Conversely, if columns can be grouped because they apply at different times, then you may have a normalization issue. For example, if your columns are something like "January Sales", "February Sales", etc. these should be rows in a child table.
While there is nothing innately wrong with nullable columns, neither is there anything wrong with joining. It's what RDBMS does for a living.
UPDATE:
Given additional information about the logical groups of columns:
There are two kinds of sub-typing that can be represented in a database using 1:1 relationships. If the logical groups were mutually exclusive, then the parent entity could have what is known as a partitioning attribute that tells you which one of the subtypes is applicable. However, without a partitioning attribute, it is possible to have zero, one or even multiple subtypes being applicable at the same time.
The same fundamental question applies then to what you do with this situation.
A good way to resolve it would be to look at the logical groups of columns. Are the columns in logical group A the same as in logical group B - or are the totally different? If they are different they might be best modeled in the single table with nullable fields. If they are the same, then this might be a clue that they should be multiple child rows instead.
Another clue to look at is whether it makes sense that a logical group of columns could take on a life of its own and start attracting relationships from other tables. If logical group B might sometime soon find itself with multiple child records from another table, then it might be a sign that it makes sense to promote that group to its own subtype style table.
One last thing to consider is physical implementation. If a logical subgroup is very sparsely populated, you might be able to make a case for segregating these columns into another table to optimize physical storage. This step shouldn't be done proactively. This kind of optimization should be done when performance testing proves it is necessary.
If none of these things are true, then you are probably best off leaving the nullable columns in the original table.
Best Answer
Insofar as this Question is a continuation of Is my implementation of type/subtype design pattern (for mutually exclusive subclasses) correct?, which is itself a continuation of Don't know how to transform variable entity into relational table, I would ask: what exactly are you trying to optimize? Storage? The object model? Query complexity? Query performance? There are trade-offs when optimizing one aspect vs another since you can't optimize all aspects at the same time.
I completely agree with Remus's points regarding:
That said, the choice you face is between the following, arranged in order of least normalization to most normalization:
E
to the base-type TableE
out to a new, intermediary sub-class table on the same level asC
, thatA
andB
will directly be sub-classes of (@MDCCL's answer)Let's look at each option:
Move property
E
to base-type TablePROs
E
but notX
,Y
, orZ
.E
but notX
,Y
, orZ
(especially aggregate queries) due to no JOIN.(D, E)
(and if so, potentially a Filtered Index on(D, E)
where EntityType <>C
, if such a condition is allowed)CONs
E
asNOT NULL
CHECK CONSTRAINT
on base-type table to ensure thatE IS NULL
when EntityType =C
(though this is not a huge issue)E
must beNULL
, and should even be ignored entirely, when EntityType =C
.E
is a fixed-length type, and a large portion of the rows are for EntityType ofC
(i.e. not usingE
hence it isNULL
), and not using either theSPARSE
option on the column or Data Compression on the Clustered IndexE
since the presence ofE
in the base-type table will increase the size of each row which in turn decreases the number of rows that can fit on a data page. But this is highly dependent on the exact datatype ofE
, the FILLFACTOR, how many rows are in the base-type table, etc.Keep property
E
in each sub-type TablePROs
E
in the base-type table shouldn't be used because "it really isn't there")NOT NULL
if this is a required property of the entityCHECK CONSTRAINT
on base-type table to ensure thatE IS NULL
when EntityType =C
(though this is not a huge gain)CONs
E
, due to the JOIN, depending on how many rows ofA
+B
you have as opposed to how many rows ofC
there are.A
andB
(and notC
) as being the same "type". Of course, you could abstract this via a View that does aUNION ALL
between aSELECT
of the JOINed tables forA
and anotherSELECT
of the JOINed tables forB
. That will reduce complexity of SELECT queries but not so helpful forINSERT
andUPDATE
queries.(D, E)
would really help one or more frequently used queries, since they cannot be indexed together.Normalize
E
out to intermediary Table between base-class andA
&B
(Please note that I do like @MDCCL's answer as a viable alternative, depending on circumstances. The following is not meant as a strict criticism of that approach, but as a means of adding some perspective -- mine, of course -- by evaluating it in the same context as the two options I had already proposed. This will make it easier to clarify what I see as the relative difference between full normalization and the current approach of partial normalization.)
PROs
A
andB
, but notC
(i.e. no need for two queries joined viaUNION ALL
)CONs
Bar
table duplicates the ID, and there is a new column,BarTypeCode
) [negligible, but something to be aware of]JOIN
is needed to get to eitherA
orB
INSERT
(DELETE
can be handled implicitly via marking Foreign Keys asON CASCADE DELETE
) since the Transaction will be held open slightly longer on the base-class table (i.e.Foo
) [negligible, but something to be aware of]no direct knowledge of actual type --
A
orB
-- within the base-class Table,Foo
; it only knows of typeBr
which can be eitherA
orB
:Meaning, if you need to do queries over the general base info but need to either categorize by the entity type or filter out one or more entity types, then the base-class table doesn't have enough information, in which case you need to
LEFT JOIN
theBar
table. This will also reduce the effectiveness of indexing theFooTypeCode
column.no consistent approach to interacting with
A
&B
vsC
:Meaning, if each entity relates directly to the base-class table such that there is only ever that one JOIN to get the full entity, then everyone can more quickly and easily build up familiarity in terms of working with the data model. There will be a common approach to queries / Stored Procedures which makes them quicker to develop and less likely to have bugs. A consistent approach also makes it quicker and easier to add new sub-types in the future.
potentially less adaptable to business rules that change over time:
Meaning, things always change, and it is fairly easy to move
E
up to the base-class Table if it becomes common to all sub-types. It is also easy enough to move a common property out to the sub-types if changes in the nature of the entities makes that a worth-while change. It is easy enough to either break a sub-type into two sub-types (just create anotherSubTypeID
value) or to combine two or more sub-types into one. Conversely, what ifE
later on became a common property of all sub-types? Then the intermediary layer of theBar
table would be meaningless, and the added complexity would not be worth it. Of course, it is impossible to know if such a change would happen in 5 or even 10 years, so theBar
table is not necessarily, nor even highly likely to be, a bad idea (which is why I said "potentially less adaptable"). These are just points to consider; it's a gamble in either direction.potentially inappropriate grouping:
Meaning, just because the
E
property is shared between entity typesA
andB
does not mean thatA
andB
should be grouped together. Just because things "look" the same (i.e. same properties) does not mean that they are the same.Summary
Just like deciding if/when to denormalize, how to best approach this particular situation depends on considering the following aspects of the usage of the data model and making sure that the benefits outweigh the costs:
E
E
and how often they will be executedE
and how often they will be executedI think I tend to default to keeping
E
in the separate sub-type tables because it is, at the very least, "cleaner". I would consider movingE
to the base-type table IF: most of the rows were not for EntityType ofC
; and the number of rows was at least in the millions; and I more-often-than-not executed queries that neededE
and/or the queries that would benefit from an index on(D, E)
either execute very frequently and/or require enough system resources such that having the index reduces overall resource utilization, or at least prevents surges in resource consumption that go above acceptable levels or last long enough to cause excessive blocking and/or increases in deadlocks.UPDATE
O.P. commented on this answer that:
This change is particularly important because it is exactly what I predicated might happen in the "CONs" subsection of the "Normalize
E
out to intermediary Table between base-class andA
&B
" section above (6th bullet point). The specific issue is how easy / difficult it is to refactor the data model when such changes happen (and they always do). Some will argue that any data model can be refactored / changed, so start with the ideal. But while it is true on a technical level that anything can be refactored, the reality of the situation is a matter of scale.Resources are not infinite, not just CPU / Disk / RAM, but also development resources: time and money. Businesses are constantly setting priorities on projects because those resources are very limited. And quite often (at least in my experience), projects to gain efficiency (even both system performance as well as faster development / fewer bugs) are prioritized below projects that increase functionality. While it is frustrating for us technical folks because we understand what the long-term benefits of refactoring projects are, it is just the nature of business that the less-technical, business folks have an easier time seeing the direct relationship between new functionality and new revenue. What this boils down to is: "we will come back to fix that later" == "that problem will probably be there for the next 5 - 10 years because we will nearly always have more important things to work on (ironically, such as the support cases that keep coming up because we have not fixed it yet)".
With that in mind, if the size of the data is small enough such that changes can be made very query, and/or you have a maintenance window that is long enough to not only make the changes but to also roll-back if something goes wrong, then normalizing
E
out to an intermediary Table between the base-class table and theA
&B
sub-class tables could work (though that still leaves you with no direct knowledge of the specific type (A
orB
) in the base-class table). BUT, if you have hundreds of millions of rows in these tables, and an incredible amount of code referencing the tables (code that has to be tested when changes are made), then it usually pays to be more pragmatic than idealistic. And this is the environment that I had to deal with for years: 987 million rows & 615 GB in the base-class table, spread across 18 servers. And so much code hit these tables (base-class and sub-class tables) that there was a lot of resistance -- mostly from management but sometimes from the rest of the team -- to making any changes due to the amount of development and QA resources that would need to be allocated.So, once again, the "best" approach can only be determined situation-by-situation: you need to know your system (i.e. how much data and how the tables and code all relate), how to accomplish the refactoring, and the people that you work with (your team and possibly management -- can you get their buy-in for such a project?). There are some changes that I had been mentioning and planning for 1 - 2 years, and took multiple sprints / releases to get maybe 85% of them implemented. But if you only have < 1 million rows and not a lot of code tied to these tables, then you are probably able to start out on the more ideal / "pure" side of things.
Just remember, whichever way you choose to go, pay attention to how that model works over the next 2 years at least (if possible). Pay attention to what worked and what caused pain, even if it seemed like the greatest idea at the time (which means you also need to allow yourself to accept screwing up -- we all do -- so that you can honestly evaluate pain-points). And pay attention to why certain decisions worked or didn't so that you can make decisions that are more likely to be "better" next time :-).