Sql-server – Scaling out SQL Server and syncing data across multiple machines

data synchronizationshardingsql server

I don't have expertise in architecting databases, and I've been teaching myself new stuff every day. I'd like to make an Internet-scale application using SQL Server as the data store. I haven't found any good information online with regards to scaling out SQL Server.

My understanding is that scaling out is great for write throughput, but it doesn't necessarily scale reads. A simple example (which is relevant in my case) is, if data is sharded by posting user id, status 1 posted by user X living in shard A will have all its likes and comments across the whole federation. So, if I need to fetch the comments on this status, I need to hit every database and merge and sort/filter results in application memory. This is bad for the databases because they are kept busy and bad for the web servers because I will be using CPU and RAM for post processing the objects. Ideally, I'd like to write to one database and read from one database for maximum scalability.

Now, what I'm thinking of doing is, instead of sharding by posting user id, shard by receiving user id. So, if user X posts status 1, user Y living in shard B can insert a comment in shard A, and I can enforce a parent-child relationship between the status and the comment. User Z living in shard C can insert a like in shard A for the comment, so the comment and the like can constitute a parent-child relationship. The benefit of this approach is I query only one database to get all the comments and likes for a specific status rather than naively querying every single shard.

However, I need to get results like "comments on status 1 by people who are male or 18+ years old". This is a crucial functionality I want to implement. I still have to hit other databases to get information about the users. In order to eliminate this, I'm thinking of creating a sync group where one database (hub) syncs all user deltas to all shards (every 5 minutes). I'm okay with eventual consistency though it has its own problems for example, if a user deletes their account, from the time the account is deleted to the time the delta is persisted to a shard, other users will not see the change potentially adding child objects to objects created by that user. This seems to me a data integrity issue.

I'm also aware of replication and caching to increase read throughput.

My question is, which approach should I pursue? If I choose the second one, will I have trouble syncing data across potentially hundreds or thousands of servers? Not to mention the hub is essentially a single point of failure.

Best Answer

Creating a Scale Out database at Internet scale is pretty huge step. You will face a lot of issues that are not critical on a single big database. From your notes I see that you understand some of the basic issues you face.

Since Microsoft has papers on using SQL Server for scale out, I suggest that you study those first. Your scale out strategy will need to take into account the database server you choose.

For Microsoft SQL Server you should first study: http://msdn.microsoft.com/en-us/library/aa479364.aspx

This paper discusses the decisions that you need to make and why they are important. It offers 5 SQL Server strategies for scaleout:

• Scalable Shared Databases

• Peer-to-Peer Replication

• Linked Servers

• Distributed Partitioned Views

• Data-Dependent Routing

As you go down the stack, things get more complicated, but also provide more powerful ways of scaling out.