Database planning… avoiding parallel fields

database-designgeneralization

this is my first post here.

I'm a programmer and lately I've had an interest in databases. I'm working on a side project and would like some tips as how to best construct my database.

Here's the scenario:
I am attempting to see how the '6 degrees of seperation' theory holds up on social media.
I will have a scripted social media account which will act in much the same way as a web crawler… It will visit a social media account and catalog every UserID and UserName into a database file, UserID acting as the primary key. I would like to have a table (or collection of tables) which will keep track of which UserID's were linked to which account… i.e. UserID 12345 had links to/from UserID's 44444 and 55555 on their page.

The problem is, in my database classes I'm told it is bad practice to have parallel fields or columns that hold more than one attribute. I would like to avoid this.

From what I described UserID 12345 would have either two entries in one column(44444 & 55555), or two parallel fields holding one userID each. What is the best way to work around this?

Best Answer

One row in a relational DB table holds one fact only. That user 12345 is connected to user 44444 and to user 55555 is two facts. Consequently it should be stored as two separate rows.

Columns capture meaning from the real world. Depending on the table, there can be an implicit semantics between the columns. A good example is a departmental reporting hierarchy. Typically this would be modelled:

Employees
   EmployeeID
   Name
   HiredDate
   ManagerID Foreign Key references Employees.EmployeeID

Although each manager is also an employee (and will have a row in the table for her data), for a given row it does not make sense to swap the values in the EmployeeID column and the ManagerID column. The two are not symmetric and there is meaning in the columns.

For your example, however, if A is a friend of B then B is a friend of A. The pairing is completely symmetric. Modelling this in RDBMS can be problematic as it breaks the implicit asymmetry mentioned earlier. There are techniques to do this in RDBMS. One way is to store each pair twice i.e. 12345/44444 and 44444/12345. Another is to query the table twice and swap the meanings assigned to columns between queries.

This can be tricky to understand and is likely to be quite challenging as a first journey into relational database design.