Efficiency for contact-contact relationships schema

join;optimization

I'm migrating data from a schema where people and organisations are separate tables to one where people and organisations are all treated as contacts (they share a lot in common). Currently, the people table has about 90k records with 80k relationships to 10k organisations.

New model:

Contact           Relationship              (Contact table again)
--------          -------------             ---------------------
cid      11----0< cid_a              /---11 cid
name              cid_b          >0-/       name
                  details
                  start_date
                  end_date
                  relationship_type

If I want to be able to query for Wilma's current relationships (let's say Wilma has cid = 2) I can set up 2 keys on relationship, one (cid_a, cid_b) and (cid_b, cid_a).

SELECT friend.name FROM contact friend, relationship 
WHERE 
     (
       ( cid_a = 2 AND cid_b = friend.cid )
       OR
       ( cid_b = 2 AND cid_a = friend.cid )
     ) 
     AND
     ( start_date IS NULL OR start_date <= CURRENT_DATE )
     AND
     ( end_date IS NULL OR end_date >= CURRENT_DATE ) 

But I'm not sure it's efficient as the duplicate keys would be quite long.

A contact will likely have 3, 4 or more relationships to various organisations, other contacts etc. such as

  • Wilma is a student at X University-
  • Wilma is a member of Y organisation
  • Wilma was previously a contact at Z organisation
  • Wilma is married to Fred.

Is this the One True Way? Or nothing like it?!

Best Answer

Since you invoked The One True Way... I'll invoke it. 1NF would insist on "no repeating groups," which is what cid_a and cid_b are... two columns of the same "stuff" (to use the technical term).

You should not have to look at data two different ways to get the correct answer.

contact           relationship              contact_relationship_map
--------          -------------             ----------------------
cid (PK)          relationship_id (PK)      relationship_id (FK) \\ P   
name              details                   cid (FK)             // K
                  start_date                + INDEX(cid,relationship_id)
                  end_date
                  relationship_type

Each relationship gets a record in relationship, which has an ID, which is used to insert two rows into contact_relationship_map -- one for each peer in the relationship.

The PK of this table is both columns combined, and it should be indexed on both columns combined in the opposite order so that searching by relationship_id or cid has the benefit of the index. The latter index doesn't need to be declared as unique because the primary key will enforce that. Neither column allows nulls and deletes from the parent tables cascade to the records of this table.

To find relationships starting with a name from 'contact' and a relationship_type of = 'friend' we look up starting in c1:

SELECT c2.cid as my_friends_cid, c2.name as my_friends_name 
  FROM contact c1
  JOIN contact_relationship_map crm1 on crm1.cid = c1.cid
  JOIN relationship r on r.id = crm1.relationship_id 
  JOIN contact_relationship_map crm2 on crm2.relationship_id = crm1.relationship_id
                                    and crm2.cid != c1.cid
  JOIN contact c2 on c2.cid = crm2.cid
 WHERE c1.name = 'first_contact_name_here'
   AND r.relationship_type = 'friend';

In other words, following:

c1 -> crm1 -> crm2 -> c2
          \-> r

All of these joins are easily satisifed by indexes so the number of joins here should not be any cause for concern.

If you already know the cid from the first contact, that table can be eliminated from the query, and you'd start with WHERE crm1.cid = ?

This also opens up the possibility of relationships with more than two peers, if you ever wanted it.