Database Normalization Review and Suggestions

database-designerdnormalization

  1. User:
    User_ID,user_name,E-mail

  2. Staff: User_ID,Fname,Lname,DOB

  3. Student: User_ID fname Lname DOB

  4. Lecture: User_ID

  5. Admin: User_ID

  6. Position ID : P_ID,ROLE_NAME,

  7. Staff_position :Staff_position_ID,P_ID,user_ID

  8. Programme: Programme_ID P_code P_name

  9. Course: Course_ID**Course_number** C_name C_level User_ID**

  10. Co-programme: Co-programme_id,Course_ID,Programme_ID

  11. Attendence:Attendance_ID,date,time

  12. Attendance_course : Attendance_course_ID,Attendance_ID,Course_ID,date,time

  13. Resources Resource_ID Type,size,Course_ID

  14. Week_resources: Week_resources_ID,Week_number,term,resources_ID

  15. Weekly_upload: Weekly_upload_ID, size,date, time,week_number

  16. Message: Message_ID,user_ID,Body,date,time

Best Answer

Firstly, I must admit that I have been a little disingenuous, I'm afraid: My comment on your question was merely a ruse to trick you into explaining your schema to a degree which would minimize the amount of guesswork that I ( or any other member ) would have to shoulder in attempting to give you feedback of any reasonable quality on your work thus far. You expertly evaded my trap however, leaving me with no recourse but to rebuild your schema from the ground up, along the way explaining what I think you've overlooked or otherwise omitted, in the context of what I perceive your problem to be.

Secondly, with little to no explanation of the notation you are using, it is difficult to immediately discern the relationships, either data-wise or real, between the entities in your table list: Attributes such as Position_ID show up in the Lecture and Admin tables using the ID notation, indicating a key of some kind, but with no obvious source; the description for Course is littered with *'s without any indication as to why, further going on to include a User_ID attribute - This latter curiosity is particularly confusing and leads me to believe the model isn't normalized to any sufficient level at all, though I can hypothesize a ( short ) number of reasons why this relationship exists; the purpose of Co-Programme continues to be evasive; just looking at Consist of causes my brow to furrow violently - Unlike the User_ID in Course conundrum, I cannot even begin to fathom what is going on there.

So now that I've gotten all of that off my chest, we'll start by reverse engineering the structure you've provided into a word problem ( which is what I was trying to get out of you in the beginning ):

In the context of an e-learning system, we require a data model which can track our students as they participate in various courses, earning credits towards their programs. Online courses are administrated by our staff, whom have expressed a desire to allow both themselves and students to upload various digital resources ( such as practice exams in the former case, or homework submission in the latter ) at varying levels of accessibility. An adjudicator, for instance, would necessarily need to access the homework submissions of a class they preside over, whereas the fellow students of a submitter would necessarily not be able access such a resource. We would also like to see some form of audit trail on the access of these resources ( though we don't really know why that's important ).

As @ETL points out in his comment, the very first thing to consider is that candidate keys make piss-poor primary keys, which are far better left to the database for auto-generation. To be a little more clear on this point, despite frequent confusion between the two, there's no such thing as a primary key in relational algebra - this construct is instead referred to as a minimal superkey in model theory. With that in mind, I'll be applying alternate keys ( also known as unique indexes or unique constraints ) on the minimal superkeys of our tables and letting the primary key be a nice, thin data type like INTEGER. This practice will save you a lot of headache when foreign keys are involved and the data model needs updating.

"So, Mr. Words-A-Lot, are you going to help me or what?" or some such similar phrase may have indeed crossed your mind several times at this point. Notice the description of your problem that I created above has a number of bold words. These words, henceforth called nouns, in one way or another, are going to form the basic structure of your schema. Some of them have already been realized in your original list, so some of this work has already been done, but for completeness ( and to illustrate good OOAD via identifying objects ), we will touch on them anyway. Without further ado, here's the general idea of something relatively close to what you'd eventually end up with to solve our little blurb above:

Online Education

Starting with relational model theory, meaning you can forget about things such as primary keys for now, "Noun-Verb" analysis is as follows:

A Student ( S ) accesses your online system with a ( has a ) Username ( UN ) and a Staff member ( which I'll be calling an Adjudicator ( A ), for my purposes ) has a Username as well. Through this similarity, I can establish that there is a User ( U ) entity that has a Username ( UN ). ( A: B, C ), meaning entity A is a "container" ( or a collection, later called a table ) with B, C as "attributes" ( or atomic traits, later called columns ). To make things a little easier on me, we'll expand this notation to also include the minimal superkey in the notation, denoting this ( eventually alternate key ) of a particular container with a subscript Un, where n is merely used to indicate attributes sharing an equal n are part of the same key in that container, separating them from the possible inclusion of an independently unique key, Um, should such a condition be necessary ( protip: It won't be, so don't worry about it ).

User ( U ) has a Username ( UN ), or ( U: UNU1 )

Based on the schema originally presented, I assume these Users ( U ) sign up through email correspondence, indicating there also is a Email ( E ) entity which must be accounted for. While this is presented as attribute of the User ( U ) container, being a member of the information technology crowd, I can't help but wonder who these people are who only have one email address. In my experience, it's generally much, much easier ( and preferable ) to simply under-use the capabilities of a robust data model than it is to work around the limitations of one that has been under-designed, so I find myself insisting on a one-to-many relationship between the two. Email ( E ) thus becomes a container, with an Email Address ( EA ) as an attribute. To illustrate the many-to-one relationship between Email ( E ) and User ( U ), the latter becomes a "compound" attribute of the former, meaning that the User ( U ) container is treated as an attribute of the Email ( E ) container. This begins to illustrate where foreign keys will materialize in the final structure, and can be notated as:

( E: EAU1, ( U: UNU1 ) )

Students ( S ) and Adjudicators ( A ) are similar in other ways as well. With attributes such as First Name ( FN ), Last Name ( LN ) and Date of Birth ( DOB ) in common, from a purely data-centric point of view, it becomes hazy as to whether they are indeed separate entities at all! There is clearly a need to differentiate the two, or group them into their appropriate roles somehow, but the truth of this is as of yet unclear: Do the requirements dictate that a User ( U ) can not both be a Student ( S ) and a member of the Staff as well? Does a Teaching Assistant not fill exactly this sort of Position, briefly alluded to in the original schema?

If a Student ( S ) can indeed also be an Adjudicator ( A ), it follows that there is a genuine one-to-one relationship between a User ( U ) and a Student ( S ) and also another genuine one-to-one relationship between a User ( U ) and an Adjudicator ( A ). That is, a User ( U ) can be a ( is a ) Student ( S ) and a User ( U ) can be an ( is an ) Adjudicator ( A ). A User ( U ) cannot be multiple of either however, barring clever online trickery. Note than an Administrator could also be represented in this manner, through from the relative independence from the rest of the relationships in the model, I've omitted such an entity.

( S: ( U: UNU1 )U1 )

( A: ( U: UNU1 )U1 )

We know that both Students ( S ) and Adjudicators ( A ) have a First Name ( FN ), Last Name ( LN ) and a Date of Birth ( DOB ). As such, we could do something silly, like the following:

( S: ( U: UNU1 )U1, FN, LN, DOB )

( A: ( U: UNU1 )U1, FN, LN, DOB )

But doing so would be in violation of, you guessed it, the Third Normal Form. In case the reason a specific user shouldn't have two First Names ( FN ) isn't immediately obvious, since both Student ( S ) and Adjudicator ( A ) have a minimal superkey ( again, denoted by U1 ) which consists of the entirety of the minimal superkey of User ( U ), non-prime attributes First Name ( FN ), Last Name ( LN ) and Date of Birth ( DOB ) are transitively dependent on User ( U ). The correct arrangement is then to instead associate those attributes with the User ( U ) entity, as follows:

( S: ( U: UNU1, FN, LN, DOB )U1 )

( A: ( U: UNU1, FN, LN, DOB )U1 )

The next piece of the puzzle to tackle involves Courses ( C ) and Programs ( P ). From my understanding, a Program ( P ) is a curriculum offered by a Faculty, which despite probably being the correct word to use, I chose Department ( D ) in my model and am not about to go and change it now. The Department ( D ) of Mathematics can offer a number of different Programs ( P ), such as a Bachelors, a Bachelors with Honors, a Doctorate in Philosophy or a certification for a Chartered Accountant. A simple, one-to-many relationship.

( P: PCU1, PN, ( D: DCU1, DN ) )

I imagine that in order to complete a Program ( P ) and win yourself the associated expensive piece of paper, there will be a number of Program Requirements ( PR ) - at least, that's the way I had to do it. These requirements arise in the form of Courses ( C ), each of which have a ( has a ) Credit Value ( CV ) towards a Program ( P ). As an aside, I've added Course Level ( CL ) as a compound attribute as well, though it has little value in the context of the model relationships.

( C: CCU1, CN, CV, ( CL: CLCU1 ) )

As not all degrees are created equal, the Program Requirements ( PR ) may offer various choices and a particularly keen individual may opt to take a course in cryptology over one in data structures ( and subsequently get stuck listening to me ramble on the internet - not a desirable fate for the faint of heart ). To keep Program Requirements ( PR ) flexible, we assume a one-to-many relationship between a Program ( P ) and Program Requirements ( PR ), with each of those requirements having a minimum number of Required Credits ( RC ) to meet the criteria. The nature of this relationship is different from the other relationships we've seen thus far in that the Program Requirement ( PR ) deviates from the other entities by being a multiset ( there is no reason a Program ( P ) couldn't require 20 credits from two separate Course ( C ) lists ) and thus has no candidate key. While this reality doesn't affect our model notation in any detrimental manner ( we simply leave the subscript off the attribute list ), in terms of implementation on a database, we'll be using a surrogate key. If you noticed above, I'm using a surrogate key ( the respective <TableName>_PK attribute ) on every table anyway.

( PR: RC, ( P: PCU1, PN, ( D: DCU1, DN ) ) )

Using this Program Requirement ( PR ) multiset, we can then define Course ( C ) lists which are able to satisfy those requirements. That is to say, there is a Program Requirement Course ( PRC ) which is able to contribute in some way to the Required Credits ( RC ) of the requirements of a Program ( P ). This construct allows us to set up a list of courses ( in a three of five case, for example ) or specifically require a list of courses to be completed. If for some reason a particular Program ( P ) wished to alter the Credit Value ( CV ) of a particular Course ( C ) for its own purposes, an Alternate Credit Value ( ACV ) could be added as an attribute to the Program Requirment Courses ( PRC ) entity, but it's omitted from the diagram above, pretty much because I just thought of it now. Note that Program Course Requirement ( PRC ) is a many-to-many relationship, allowing a specific Course ( C ) to be a requirement of any number of Program Requirements ( PR ) and Program Requirements ( PR ) to involve the completion of any number of Courses ( C ).

( PRC: ACV, ( PR: RC, ( P: PCU1, PN, ( D: DCU1, DN ) ) )U1, ( C: CCU1, CN, CV, ( CL: CLCU1 ) )U1 )

With the bulk of the work out of the way, it's really a matter of just tying the Courses ( C ) back to the Students ( S ) and Adjudicators ( A ) and continuing on to assert some of the more trivial relationships. Since the definitions are getting a little long, I'm going to stop showing the attributes of the compound attributes and focus on the final structures we'll need instead.

We create a couple of many-to-many relationships in the form of Adjudicator Course ( AC ) and Student Course ( SC ):

( AC: AU1, CU1 )

( SC: SU1, CU1 )

An Adjudicator ( A ) which can instruct / administrate a Course ( C ) can now perform Lectures ( L ) for those Courses ( C ), designating a Start Time ( T1 ) and an End Time ( T2 ). This is a multiset in my schema, though it could certainly be instead constrained to time blocks if desired.

( L: AC, T1, T2 )

A Student ( S ) who logs in to attend a Lecture ( L ) can be noted through a many-to-many Student Attendance ( SA ) relationship.

( SA: SU1, LU1 )

A Resource ( R ) may be uploaded by a User ( U ) in the context of a specific Course ( C ). An Adjudicator ( A ) should be allowed sufficient privileges to view the Resources ( R ) on their own Courses ( C ), whereas a Student ( S ) should only be allowed to see Resources ( R ) for a Course ( C ) as posted by an Adjudicator ( A ) of that course. Such permissions are hand-waved by my data model as "business logic" to be handled through procedures, views and the application. To expand on your comment, you could easily add a Resource Type ( RT ) here.

( R: U, C, RD, ( RT: RTCU1, RTN ) )

Finally, a ResourceUser ( RU ) table is added to simply keep track of the times ( Date Accessed ( DA ) ) which a particular User ( U ) accesses a specific Resource ( R ).

( RU: R, U, DA )

All in all, creating the following structure would solve the blurb:

( U: UNU1, FN, LN, DOB )

( E: U, EAU1 )

( S: UU1 )

( A: UU1 )

( D: DCU1, DN )

( P: D, PCU1, PN )

( CL: CLCU1 )

( C: CL, CCU1, CN, CV )

( PR: P, RC )

( PRC: PRU1, CU1, ACV )

( AC: AU1, CU1 )

( SC: SU1, CU1 )

( L: AC, T1, T2 )

( SA: SU1, LU1 )

( RT: RTCU1, RTN )

( R: RT, U, C, RD )

( RU: R, U, DA )