MySQL – design help for scale

database-designMySQLmysql-5.7scalabilitysharding

I will be using Amazon Web Services RDS service to host a MySQL 5.7 server (Currently AWS do not support 5.7, but they have announced they are working on implementing this version and it will be out soon). I will only have one table, but I expect it to have MANY rows. It is a messaging table which will contain messages from users to other users.

A Message can only be sent by one user to one other user. This message when read for the first time will need to be updated stating it has been read so it will require one update in it's life if it has been read. A User must be able to view all thier sent messages ordered by newest and all the messages they received ordered by newest.

My initial design looks like this:

Messages Table:

Message_ID (BIGINT) PK auto_increment
FromUser (INT 16)
ToUser (INT 16)
DateCreated (Timestamp)
MessageText (Varchar (500) )
HasRead (TINYINT (0=false, 1=true) )

The queries which will be run will include:

SELECT * FROM messages WHERE FromUser = '10000000' ORDER BY DateCreared DESC LIMIT [starting pagination value], 10 ; //get all users sent messages - newest first, get 10 at a time

SELECT * FROM messages WHERE ToUser= '10000000' ORDER BY DateCreared DESC LIMIT [starting pagination value], 10 ; //get all Bob's received messages - newest first, get 10 at a time

UPDATE messages SET HasRead = 1 WHERE Message_ID = '123456789'; // When a message has been read update it to show it has been read.

As a result I would have the following indexes:

-Message_ID (primary index)
-(FromUser,DateCreated) BTREE DESC
-(ToUser,DateCreated) BTREE DESC

Additional "would like to have feature, but can live without it requirement if it effects performance considerably" would be to have a user view their messages with another user:

SELECT * FROM messages WHERE (FromUser = '1000000 and ToUser = '2000000') OR (FromUser = '2000000' and ToUser = '1000000') ORDER BY DateCreated DESC [starting pagination value], 10 ; // Get all the messages sent between user 1000000 and user 2000000 - newest first, fetch 10 at a time. I was thinking of creating a new column which would be be a concatenation of [smaller_userid]-[larger_userid] and search on this field instead. If this was the case I would have an additional composite index on this new column + DateCreated. SELECT * FROM messages WHERE concateduser = '1000000-2000000' ORDER BY DateCreated DESC [starting pagination value], 10 ;

We believe this will work for, but we currently have an active community of users and have estimated when we roll out the feature it will be heavily used. Thus we would also like to plan for the future to be able to scale (premature now, but we think this is a very simple feature and are hoping to design it well now to save us time in the future).

If in the future we need to scale horizontally we do not think our design will scale very well. We don't believe an auto-incremented message_id pk will work in a multi node environment. We looked into setting a UUID for this column (https://www.percona.com/blog/2007/03/13/to-uuid-or-not-to-uuid/), but have read that this can really hurt performance since the indexes will be large.
Reading this article we see paging can be an issue too. http://www.psce.com/blog/2012/04/04/how-important-a-primary-key-can-be-for-mysql-performance/

In our current design we don't really see a great shard key which can be used for all our queries. We would like our queries to reach one shared server if possible.

So my question is what would be a an efficient way to implement this basic messaging feature so that in the future it scales well with the queries we need. I have only ever worked with a single instance of MySQL so I am not an expert on scale out design with MySQL by any means and am open to ANY ideas (complete redesign too)! We believe sharding will be inevitable since our instance types are not very large.

PS: We know some may say NoSQL is a great option for this scenario, but we looked into NoSQL options for this feature (Cassandra,DynmoDB,Google Datastore, Azure DocumentDB, FileSystem like AWS S3 or Azure storage) for a few months but due to costs for performance (indexes are very expensive in managed NoSQL environments), lack of ACID compliance (we have other ideas which will need true transactions), and more we decided on MySQL.

Best Answer

Trivial, even at scale.

Some minor, but important, tweaks...

INT UNSIGNED is 4 bytes; the 16 is meaningless. It will hold values up to 4 Billion; you probably won't have more than that many users.

Do not use OFFSET for pagination, instead "remember where you left off", as I discuss here.

For the extra feature, turn the OR into UNION so that you can make better use of another necessary index: INDEX(FromUser, ToUser, DateCreated, MessageID):

      ( SELECT  *
            FROM  messages
            WHERE  FromUser = '1000000'
              and  ToUser = '2000000'
            ORDER BY  DateCreated DESC
            LIMIT  10 
      )
    UNION  ALL
      ( SELECT  *
            FROM  messages
            WHERE  FromUser = '2000000'
              and  ToUser = '1000000'
            ORDER BY  DateCreated DESC
            LIMIT  10 
       )
    ORDER BY  DateCreated DESC
    LIMIT  10 

It is probably too tricky to "remember where you left off", but since this should be a rare query and one with not many results, go ahead and use OFFSET. However, for, say, page 15, you need LIMIT 160 on the inside and LIMIT 150,10 on the outside. And, I will pull another trick:

SELECT  m.*
    FROM  messages AS m
    JOIN  ( 
              ( SELECT  Message_ID
                    FROM  messages
                    WHERE  FromUser = '1000000'
                      and  ToUser = '2000000'
                    ORDER BY  DateCreated DESC
                    LIMIT  160 
              )
            UNION  ALL
              ( SELECT  Message_ID
                    FROM  messages
                    WHERE  FromUser = '2000000'
                      and  ToUser = '1000000'
                    ORDER BY  DateCreated DESC
                    LIMIT  160 
               )
            ORDER BY  DateCreated DESC
            LIMIT  150,10 
          ) AS u ON u.Message_ID = m.Message_ID
    ORDER BY  DateCreated DESC; 

The trick this time is to avoid SELECT * until you have whittled it down to only 10 rows. Note that the proposed index contains the necessary Message_ID, so the inner queries run Using index and are clustered for efficiency. The middle query involves a tmp table and filesort, but those will be in a MEMORY table. The outer query will hit the disk for the 10 rows. Total disk hits: About 12. Very efficient.

(This trick should be added to my other big query; as it stands, it will make about 21 disk hits.)

Related Question