Optimal design for a large database

database-designperformancequery-performance

Brand new to databases, using postgresql on linux. I've done a decent amount of reading on the issue and looked through similar questions on here, but just want to confirm the most efficient way of organizing everything.

So say for example we have entries for a million people, who each have a name, phone number, address, and between 20 to 50 "skills". Would the correct way be to have one table with the basic person properties along with a unique identifier, and then create a second table with two columns containing the person's ID and skill name?

Seems a little redundant to me, but the other option of having a variable number of skill columns in the person table seems even worse.

Any tips to maximize speed and minimize memory usage?

Best Answer

This is a classic situation where you use a "joining table".

You will need three tables.

Person
Skill
Person_Skill

(note that table names are singular - this is a preference of mine, not a rule).

Person
ID integer PRIMARY KEY
<other person details - name, address, phone &c.>
KEY name (and possibly phone?)

Skill
ID integer PRIMARY KEY
<other skill details - name, description, necessary qualification &c.>
KEY name (maybe not even necessary if there are only 50 skills?)    

Person_Skill
Person_ID integer FOREIGN KEY REFERENCES Person(ID)
Skill_ID integer FOREIGN KEY REFERENCES Skill(ID)
Unique Index on (Person_ID, Skill_ID) (to avoid duplicates)
KEY Skill_ID

This way you don't store the skill name, description &c. for every person who has that skill, but just once and then refer to it in the Person_Skill table by its ID - this reduces space and memory requirements and is less error prone (you only store the Skill datum once - one of the core features of an RDBMS under Codd's rules).

That is the way I would do this.

My understanding of your business rules

I will describe my suggestions within the context of your individual questions but, before I do that, you first need to know the way I understand your scenario, and I understand it as follows:

Telephone and Email are both types of ContactMeans.
A particular type of ContactMeans must be fixed for a Person (or a Club).
Once a specific type of ContactMeans has been fixed for a Person (or a Club), this Person (or Club) can be reached through one-to-many occurrences of such type of ContactMeans, and you want to make sure that each one of these occurrences belong to the same type, either Telephone or Email.

Proposed methods and answers to your individual questions

First method

Personally i thought it was better to work with an extra table contact and have person and club link to this table and then have phone numbers link to this table also.

But,...

I find myself in the same situation when it comes to e-mail addresses. So, my idea was to also link them to that same contact table. Was this a good step, or should i have added a separate table for the relationship between club - email and person - e mail?

Yes, adding a separate table is a good step and, to me, the best method, but it needs some refinements. Here is where my first suggestion comes into play, you should make use of two supertype-subtype clusters.

In the first case, I would include a table called Party (which has been mentioned in my other answer). This table will serve the purpose of relating a particular Club or Person (the Party subtypes) with a given ContactMeans, Telephone or Email.

You need to store the fact that a specific Party is fixed with only one type of means of contact, so I recommend you including a table called ContactMeansType, with a PRIMARY KEY named ContactMeansTypeCode. Then, in the Party table, you should add a FOREIGN KEY pointing to ContactMeansType.ContactMeansTypeCode.

And then comes the second supertype-subtype cluster. In this case I would call your Contact table “ContactMeans” (the supertype of Telephone and Email) holding a PRIMARY KEY composed of two columns, PartyId (FOREIGN KEY referencing Party.PartyId) and ContactMeansNumber. Every ContactMeans must hold a type, so I would add a FOREIGN KEY named ContactMeansTypeCode pointing to ContactMeansType.ContactMeansTypeCode.

Then, you should establish a validation method in order to guarantee that any given occurrence of a Party (Club or Person) can only be reached through the ContactMeansType (Telephone or Email) that has being fixed for such occurrence. Involve the columns Party.ContactMeansTypeCode and ContactMeans.ContactMeansTypeCode in this effort. For instance:

Suppose that a particular Party has been fixed to be reached only by means of Telephone, then when there is an attempt to INSERT a new ContactMeans, you must make sure that the value contained in this ContactMeans.ContactMeansTypeCode is the same as the value contained in Party.ContactMeansTypeCode, if it is so, then let the INSERT go on, otherwise deny said INSERT attempt. You are using MySQL, so this content about TRIGGERS from the MySQL Reference Manual may be relevant in this purpose.

See Figure 1, which shows an IDEF1X data model depicting a logical structure for this method.

Fig. 1. Clubs and Members DM - Contact Means First Method

* Do not overlooke the vast power that relational keys have to offer, since they are being used to shape most of your business rules.

In this sense, e.g., it is worth noting how the PRIMARY KEY called PartyId has migrated from Party to ContactMeans, and then from ContactMeans to Telephone or Email. Said PRIMARY KEY has also migrated from Party to Club or Person (receiving the rolenames ClubId and PersonId, respectively), and then to ClubMember (PersonId being rolenamed as MemberId). Thereby, all this “chain” of tables is provided with referential integrity.

In this external document in .PDF format I give a more detailed treatment to this method.

Second method

As i understand it, this isn't exactly a many to many relation but a many to one.

However, phone number can belong to 2 other tables. So, should i add 2 columns to phone numbers then? 1 with a FK to person and 1 with a FK to club? Leaving one of them null when a phone number is entered?

Yes, as per the business rules that you have presented, this situation could be handled through a many-to-one (or one-to-many) relationship.

And yes, one given Telephone can belong to a Person or a Club. But no, I do not recommend the addition to the Telephone table of the two FOREIGN KEYS that make reference to Person and Club, nor I advise you leaving them NULL because, according to my personal experience, this may be a sign that the modeling stage needs to be extended.

Instead of the above, I propose a second method in which, again, you would be using the supertype table called Party. This table will “link” Club or Person with Telephone or Email.

It is necessary to add a column that represents the fact that a given Party has been fixed with a specific type of means of contact, so I would add the column Party.ContactMeansCode that makes to the PRIMARY KEY (ContactMeans.ContactMeansCode) of a table named ContactMeans. Or you could deal with this value by way of a BOOLEAN column called Party.IsFixedWithEmail or Party.IsFixedWithTelephone, also.

As you can see in Figure 2, in this option Telephone and Email are structured with a many-to-one relationship with Party, and then, through the latter, to Club or Person.

I included the column denominated IsPrimary to Telephone and Email, which should be of use to store the fact that an individual Telephone (or Email) has been established as the main instance (of its corresponding type) for contacting a particular Party. Regarding this aspect you must create a method to ensure that there could be only one instance set as primary.

Then, you need to define a procedure to validate that any particular instance of a Party (Club or Person) can only be reached through (or contacted via) the ContactMeans (Telephone or Email) that has being fixed for such instance. The column Party.ContactMeansCode is useful in achieving this goal. For example:

Say an individual Party has been fixed to be contacted only via Email ( by virtue of a specific value in the Party.ContactMeansCode column), so you have to accept only INSERTS of new Emails for such Party and, of course, reject every INSERT attempt in the Telephone table for this Party. Again, since you are using MySQL, you may have to do it via TRIGGERS.

See an IDEF1X data model illustrating a logical structure for this procedure in Figure 2.

Fig. 2. Clubs and Members DM - Contact Means Second Method

In the same external document in .PDF format I give a more detailed treatment to this second method, as well.

How to design a database to track each individual’s stat changes weekly, in a group

You are thinking along the right track in wanting to store information about each person. Since you want to record a history of their statistics, the best way to do that is to use another table which will keep one row for each time your record statistics with the date when they were recorded. Now the question becomes how do you want to record each statistic? You have two basic options.

Column Wise

If you have a fairly fixed set and want to make make it very simple, you can place each statistic by name as a column on Measurement along with the Date. So for example, you'd have a column for Weight, Steps Walked, and so on. The advantage of this approach is it is simple and clear, and you can use a data type specific to each type of statistic. The disadvantage to this approach is that to add new kinds of statistics you have to add new columns, and you have to make all of the statistic types optional should you not record one or more of them for a given Measurement.

Row Wise

A second option would be to make the statistic more generalized by having a Statistic Type table. There would be a row in Statistic Type for each kind of statistic you want to record. Then, you would associate Measurement to the Statistic Type and record the Value. The advantage of this approach is that you can easily add more statistics, and you only have to record just the values for the statistic types you measured. The disadvantage is that this is more abstract and complex, and you have to use a generalized data type that can support all of the various units of measure, or you have to create a mutually exclusive set of columns each of a data type matching a statistic type. If you go with the first approach, you can really turn it into a science project in order to support all the various data types and still ensure data integrity.

Other Points

The diagrams I show are created using Oracle's Data Modeler tool which is a free download and very powerful. You can create DDL directly from it. One thing you want to consider that I didn't discuss is that, if you use surrogate keys as I did here, you want to also define an alternate unique key for each table using more natural columns. In the case of Person that could be Name. This way, you won't accidentally add the same person twice if your list gets long. A second thing is you have to think about if you have one value per statistic or more than one. For example, Weight and Steps are good examples of a single statistic for the Measurement. Today you record that I weight 200 pounds and walked 5000 steps. But for something like Bad Foods Eaten there could be more than one. If that is the case you have what we call a repeating group. To resolve this, you either have to create a fixed number of columns - one for each bad food - or create a new table that will be a child of the Measurement that will have one row for each bad food.

Examples

Here is an example of the way your data would look using the column approach (skipping the surrogate keys for simplicity of display):

And here is an example of the way your data would look using the row approach:

I hope this helps! A great book on data modeling like this is Steve Hoberman's Data Modeling Made Simple.