I think the trick to this is that it doesn't have to be real time, just eventually consistent, in which case it's straightforward enough (using SQL Server, but this applies in any DB). First a trivial table and some sample data:
create table messages
(message_id integer, sender varchar(20), recipient varchar (20))
go
insert into messages values (1, 'Gaius', 'Octavian')
insert into messages values (2, 'Gaius', 'Octavian')
insert into messages values (3, 'Gaius', 'Octavian')
insert into messages values (4, 'Aurelius', 'Octavian')
insert into messages values (5, 'Aurelius', 'Octavian')
insert into messages values (6, 'Aurelius', 'Gaius')
insert into messages values (7, 'Aurelius', 'Gaius')
insert into messages values (8, 'Octavian', 'Gaius')
go
This is logging for every message, who sent it and who to (assuming for simplicity that the message body is stored in another table). So we can see that the top sender to Octavian is Gaius (3 messages of 5), and the top sender to Gaius is Aurelius (2 messages of 3). To query that using a CTE:
with q1 as (
select recipient, sender, count(sender) as num_messages_from_sender,
rank() over (partition by recipient order by count(sender) desc) as priority
from messages group by recipient, sender)
select recipient, sender as top_sender, num_messages_from_sender
from q1 where priority=1
go
In practice you would have a job that ran every minute (or whatever interval is best) refreshing a lookup table mapping a user to their top sender (or top n senders using where priority <= n
) (or in your case, you would be tracking the senders to which they reply with another column and filtering by that).
For the sake of simplicity I have left off indexes and partitioning - they would be the key to performance of this solution. You could certainly scale this to many billions of messages on any modern DB/hardware. GMail most likely has a custom solution tho', but with 20,000 engineers Google can do that!
You can find the latest sent message in each thread and then join that with your tables:
select M.*, R.*
FROM messages M
JOIN recipients R
ON M.Message_ID = R.Recipient_Message_ID
JOIN (
select Message_Root_ID, max(Message_Sent_Time) as Message_Sent_Time
from messages
group by Message_Root_ID
) as X
ON M.Message_root_id = X.Message_root_id
AND M.Message_Sent_Time = X.Message_Sent_Time
Best Answer
Why not use this query (sqlfiddle):
Basicaly, if a MessageID is also a MessageParentID it means there is at least 1 reply. Therefore it only looks for message without replies (ie. without MessageParentID).
With this limited sample, it gives the correct output:
I don't unsderstand why: