Mysql – Sharding AWS MySQL data

MySQLsharding

I am working on a SaaS application using AWS MySQL.

The application is for multiple organizations. We use the same database and tables to store multiple org data. We store org_id in most of the tables and use this in select, update and delete queries as where org_id = ? .

DB Model

  • Organization

    org_id
    
  • User

    user_id
    org_id
    
  • client

    client_id
    org_id
    
  • client_contact

    client_id
    phone_number
    phone_type
    
  • project

    project_id
    org_id
    
  • user_project

    project_id
    user_id
    

The data is growing rapidly and we need to shard the data. We have customers from all over the world.

If I use a different sharding id for different table, then a join might happen across the nodes. For example, for user, sharding can be on shard_id and for client, sharding can be on client_id and for project, sharding can be on project_id. Here, the problem would be joining project and user. The user_id and project_id might be on different nodes and join will happen across nodes.

What is the best approach here? I am thinking of sharding based on org_id as I store org_id in most of the tables.

I see two problems here:

  1. A few child tables don't store org_id as the parent table is storing. Do I need to store the org_id in all the tables?
  2. Some orgs might have more data and load which might lead to a hot spot and more storage on a particular node. Is it possible scale a particular node alone with AWS RDS?

Please suggest the best approach.

Note Each org might have up to 500,000 records in any table with around 10 columns. We could reach more than 5,000 organizations in 6 months. 500,000 multiplied by 5,000 orgs will reach 2.5 billion records.

Best Answer

(Not yet an answer, but some more steps toward understanding the task enough to provide an answer.)

2.5 billion rows will fit on a single machine. (Perhaps 400GB, including ancillary tables.) Have you determined that the activity will exceed a single server's capabilities?

  1. Which table is the biggest? (Please list the 3 largest and their projected sizes. Also, please provide the full SHOW CREATE TABLE for each of them -- We need to consider shrinking their datatypes.)
  2. Which column(s) in that table seem reasonable for sharding?
  3. If you shard on that column, what other tables will also be split up?
  4. Ignoring for the moment, provide the 6 most common SELECTs with JOINs and multi-table Updates/Deletes.
  5. Which tables will be needed for JOINs? Are they small enough to simply maintain copies of them in each shard? How complex will that maintenance be?