Mongodb – Scalability: Desiging a message table

database-designmongodbnosqlscalability

This question might be deadly simple but I wanted to make sure I am going through the correct path and not making any simple and irrevertible mistakes.

My project will be based on messaging and it won't be as less-used as a side tool like in Twitter or Facebook (not that they are not used but it's not the key option presented to users)

I was wondering will a simple table design like;

from (id)
to (id)
content (text)
created_at (date)
isRead (boolean)

will suffice because, even though it is planned to be a small project, I do not believe that all the messages should be stored in one table and for each user when they want to view their messaging history, I shouldn't be scanning the entire table to do so. I am thinking about how scalable would this structure be and how would I benefit using a NoSQL DB like mongoDB.

Best Answer

You can go through the following embedded json document to implement your message collection and can ensure index on "mail_id" key to avoid scanning all documents. For your information here collection serve as table and each document serve as row of a table.

{
mail_id:string
sentMails:{
            [
                {
                sent_at: datetime,
                subject: string,
                sent_to: string,
                cc_to:[array of mailid strings],
                bcc_to:[array of mailid strings],
                attachments:[
                                {
                                attachment_name:string,
                                attachment_type:strng
                                }
                            ],
                sent_mail_content: string               
                }
            ]
            }

receivedMails:{
                [
                    {
                    received_at: datetime,  
                    subject: string,                
                    received_from: string,
                    cc_to:[array of mailid strings],
                    attachments:[
                                    {
                                    attachment_name:string,
                                    attachment_type:strng
                                    }
                                ],
                    received_mail_content: string
                    }
                ]   
            }
}

I hope it will help.

Related Solutions

Priority Inbox Database Design

I think the trick to this is that it doesn't have to be real time, just eventually consistent, in which case it's straightforward enough (using SQL Server, but this applies in any DB). First a trivial table and some sample data:

create table messages 
(message_id integer, sender varchar(20), recipient varchar (20))
go

insert into messages values (1, 'Gaius', 'Octavian')
insert into messages values (2, 'Gaius', 'Octavian')
insert into messages values (3, 'Gaius', 'Octavian')
insert into messages values (4, 'Aurelius', 'Octavian')
insert into messages values (5, 'Aurelius', 'Octavian')
insert into messages values (6, 'Aurelius', 'Gaius')
insert into messages values (7, 'Aurelius', 'Gaius')
insert into messages values (8, 'Octavian', 'Gaius')
go

This is logging for every message, who sent it and who to (assuming for simplicity that the message body is stored in another table). So we can see that the top sender to Octavian is Gaius (3 messages of 5), and the top sender to Gaius is Aurelius (2 messages of 3). To query that using a CTE:

with q1 as (
  select recipient, sender, count(sender) as num_messages_from_sender,
  rank() over (partition by recipient order by count(sender) desc) as priority
  from messages group by recipient, sender)
select recipient, sender as top_sender, num_messages_from_sender 
from q1 where priority=1
go

In practice you would have a job that ran every minute (or whatever interval is best) refreshing a lookup table mapping a user to their top sender (or top n senders using where priority <= n) (or in your case, you would be tracking the senders to which they reply with another column and filtering by that).

For the sake of simplicity I have left off indexes and partitioning - they would be the key to performance of this solution. You could certainly scale this to many billions of messages on any modern DB/hardware. GMail most likely has a custom solution tho', but with 20,000 engineers Google can do that!

Sql-server – SQL datatype choice for flagging a read/ unread message

Performance of a particular design depends on the distribution of data to a large extent, and the access path. Are you expecting most messages to have the flag, or most messages not to? From the docs:

My tests show that a table scan often starts to perform better than a nonclustered index access when at least 10 percent of the rows are selected. I also found that the optimizer switches from nonclustered index access to table scan prematurely (i.e., when nonclustered index access still shows better response time than the corresponding table scan). In many cases, the optimizer forces a table scan for queries with result sets of approximately 5 percent, although the table scan becomes more efficient than index access at selectivities of 8 to 10 percent.

And of course, if there are any other predicates on the query, and the clustered index if any. For example, are you likely to want to access all unread messages within a certain timeframe? Or sent to/from a particular user? How big are the message bodies and are they stored inline? These are just rhetorical questions, mind.

So which is better, a BIT and a DATETIME or just a DATETIME for performance? You will have to benchmark with some representative data and access patterns to find out. I'm afraid this one can't be answered with just theory. But you can of course create the table with the boolean in and just not use it if you find the latter is better - it only adds an overhead of 1 byte per row.

If you were asking the opposite question, how to efficiently query the not-NULL rows, I would have said use a filtered index.

Best Answer

Related Solutions

Priority Inbox Database Design

Sql-server – SQL datatype choice for flagging a read/ unread message

Related Question