Efficiency for contact-contact relationships schema

join;optimization

I'm migrating data from a schema where people and organisations are separate tables to one where people and organisations are all treated as contacts (they share a lot in common). Currently, the people table has about 90k records with 80k relationships to 10k organisations.

New model:

Contact           Relationship              (Contact table again)
--------          -------------             ---------------------
cid      11----0< cid_a              /---11 cid
name              cid_b          >0-/       name
                  details
                  start_date
                  end_date
                  relationship_type

If I want to be able to query for Wilma's current relationships (let's say Wilma has cid = 2) I can set up 2 keys on relationship, one (cid_a, cid_b) and (cid_b, cid_a).

SELECT friend.name FROM contact friend, relationship 
WHERE 
     (
       ( cid_a = 2 AND cid_b = friend.cid )
       OR
       ( cid_b = 2 AND cid_a = friend.cid )
     ) 
     AND
     ( start_date IS NULL OR start_date <= CURRENT_DATE )
     AND
     ( end_date IS NULL OR end_date >= CURRENT_DATE )

But I'm not sure it's efficient as the duplicate keys would be quite long.

A contact will likely have 3, 4 or more relationships to various organisations, other contacts etc. such as

Wilma is a student at X University-
Wilma is a member of Y organisation
Wilma was previously a contact at Z organisation
Wilma is married to Fred.

Is this the One True Way? Or nothing like it?!

Best Answer

Since you invoked The One True Way... I'll invoke it. 1NF would insist on "no repeating groups," which is what cid_a and cid_b are... two columns of the same "stuff" (to use the technical term).

You should not have to look at data two different ways to get the correct answer.

contact           relationship              contact_relationship_map
--------          -------------             ----------------------
cid (PK)          relationship_id (PK)      relationship_id (FK) \\ P   
name              details                   cid (FK)             // K
                  start_date                + INDEX(cid,relationship_id)
                  end_date
                  relationship_type

Each relationship gets a record in relationship, which has an ID, which is used to insert two rows into contact_relationship_map -- one for each peer in the relationship.

The PK of this table is both columns combined, and it should be indexed on both columns combined in the opposite order so that searching by relationship_id or cid has the benefit of the index. The latter index doesn't need to be declared as unique because the primary key will enforce that. Neither column allows nulls and deletes from the parent tables cascade to the records of this table.

To find relationships starting with a name from 'contact' and a relationship_type of = 'friend' we look up starting in c1:

SELECT c2.cid as my_friends_cid, c2.name as my_friends_name 
  FROM contact c1
  JOIN contact_relationship_map crm1 on crm1.cid = c1.cid
  JOIN relationship r on r.id = crm1.relationship_id 
  JOIN contact_relationship_map crm2 on crm2.relationship_id = crm1.relationship_id
                                    and crm2.cid != c1.cid
  JOIN contact c2 on c2.cid = crm2.cid
 WHERE c1.name = 'first_contact_name_here'
   AND r.relationship_type = 'friend';

In other words, following:

c1 -> crm1 -> crm2 -> c2
          \-> r

All of these joins are easily satisifed by indexes so the number of joins here should not be any cause for concern.

If you already know the cid from the first contact, that table can be eliminated from the query, and you'd start with WHERE crm1.cid = ?

This also opens up the possibility of relationships with more than two peers, if you ever wanted it.

How to fix it in MySQL

As can be seen from line 3 of the EXPLAIN statement, a [edit:]covering index is not being used here. It needs to join on the entity_id and the type and no key is available for that.

ALTER TABLE entity_relationship ADD KEY e_t_r 
   (entity_id, type, relationship_id);

This makes the key available, but MySQL chooses not to use it. It can be forced with: USE INDEX (e_t_r) :

SELECT lt.entity_id entity_id, 
       py.relationship_id relationship_id,
       'implied-constituent' `type` 
FROM entity_relationship lt,
     entity_relationship ly,
     entity_relationship pt USE INDEX (e_t_r),
     entity_relationship py
WHERE lt.type='constituent'
  AND lt.relationship_id = ly.relationship_id
  AND ly.type='constituency'
  AND ly.entity_id = pt.entity_id 
  AND pt.type='constituent' 
  AND pt.relationship_id = py.relationship_id 
  AND py.type='constituency';

This now runs in 0.8s. (compare to 17.7s without forcing it to use that index.) This is with MySQL 5.1.63.

How to fix it in MariaDb

Well, you don't have to!

MariaDb executes the query very fast with or without the USE INDEX intervention (0.25s, but it's on a different host so I would expect this to be nearer the 0.8s in the optimised version above).

Also, MariaDb does not use the new index and is quite happy with the t_e_r index, including using it as a covering index (i.e. uses the index for the entire data fetch).

I'm getting more and more impressed with MariaDb and now considering switching.

Mongodb – Schema design for privacy settings in MongoDB

It's important to remember that since there are multiple ways to design schema in MongoDB for a given set of data, it's critical to consider the types of reads and writes you will be making from your application.

Assuming that you will be querying for a subset of documents that a given user can see, it's likely that you will want to structure the document collection in a way that will give you the ability to query for all "visible" documents in a single query.

Document structure such as:

{ _id: ...,
  public: true,
  groups: ['group1', 'group2',...],
  users: ['user1','user2',...],
}

You could now query for all documents visible to user X if you have a list of groups user X belongs to by querying:

db.documents.find( { $or : [
     { public : true },
     { users  : X },
     { groups : { $in : [list-of-X's-groups] } }
] } )

Best Answer

Related Solutions

Mysql – Optimising MySQL query with lots of self-joins

How to fix it in MySQL

How to fix it in MariaDb

Mongodb – Schema design for privacy settings in MongoDB

Related Question