Optimal design for a large database

database-designperformancequery-performance

Brand new to databases, using postgresql on linux. I've done a decent amount of reading on the issue and looked through similar questions on here, but just want to confirm the most efficient way of organizing everything.

So say for example we have entries for a million people, who each have a name, phone number, address, and between 20 to 50 "skills". Would the correct way be to have one table with the basic person properties along with a unique identifier, and then create a second table with two columns containing the person's ID and skill name?

Seems a little redundant to me, but the other option of having a variable number of skill columns in the person table seems even worse.

Any tips to maximize speed and minimize memory usage?

Best Answer

This is a classic situation where you use a "joining table".

You will need three tables.

Person
Skill
Person_Skill

(note that table names are singular - this is a preference of mine, not a rule).

Person
ID integer PRIMARY KEY
<other person details - name, address, phone &c.>
KEY name (and possibly phone?)

Skill
ID integer PRIMARY KEY
<other skill details - name, description, necessary qualification &c.>
KEY name (maybe not even necessary if there are only 50 skills?)    

Person_Skill
Person_ID integer FOREIGN KEY REFERENCES Person(ID)
Skill_ID integer FOREIGN KEY REFERENCES Skill(ID)
Unique Index on (Person_ID, Skill_ID) (to avoid duplicates)
KEY Skill_ID

This way you don't store the skill name, description &c. for every person who has that skill, but just once and then refer to it in the Person_Skill table by its ID - this reduces space and memory requirements and is less error prone (you only store the Skill datum once - one of the core features of an RDBMS under Codd's rules).

That is the way I would do this.