Sql-server – Scaling out SQL Server and syncing data across multiple machines

data synchronizationshardingsql server

I don't have expertise in architecting databases, and I've been teaching myself new stuff every day. I'd like to make an Internet-scale application using SQL Server as the data store. I haven't found any good information online with regards to scaling out SQL Server.

My understanding is that scaling out is great for write throughput, but it doesn't necessarily scale reads. A simple example (which is relevant in my case) is, if data is sharded by posting user id, status 1 posted by user X living in shard A will have all its likes and comments across the whole federation. So, if I need to fetch the comments on this status, I need to hit every database and merge and sort/filter results in application memory. This is bad for the databases because they are kept busy and bad for the web servers because I will be using CPU and RAM for post processing the objects. Ideally, I'd like to write to one database and read from one database for maximum scalability.

Now, what I'm thinking of doing is, instead of sharding by posting user id, shard by receiving user id. So, if user X posts status 1, user Y living in shard B can insert a comment in shard A, and I can enforce a parent-child relationship between the status and the comment. User Z living in shard C can insert a like in shard A for the comment, so the comment and the like can constitute a parent-child relationship. The benefit of this approach is I query only one database to get all the comments and likes for a specific status rather than naively querying every single shard.

However, I need to get results like "comments on status 1 by people who are male or 18+ years old". This is a crucial functionality I want to implement. I still have to hit other databases to get information about the users. In order to eliminate this, I'm thinking of creating a sync group where one database (hub) syncs all user deltas to all shards (every 5 minutes). I'm okay with eventual consistency though it has its own problems for example, if a user deletes their account, from the time the account is deleted to the time the delta is persisted to a shard, other users will not see the change potentially adding child objects to objects created by that user. This seems to me a data integrity issue.

I'm also aware of replication and caching to increase read throughput.

My question is, which approach should I pursue? If I choose the second one, will I have trouble syncing data across potentially hundreds or thousands of servers? Not to mention the hub is essentially a single point of failure.

Best Answer

Creating a Scale Out database at Internet scale is pretty huge step. You will face a lot of issues that are not critical on a single big database. From your notes I see that you understand some of the basic issues you face.

Since Microsoft has papers on using SQL Server for scale out, I suggest that you study those first. Your scale out strategy will need to take into account the database server you choose.

For Microsoft SQL Server you should first study: http://msdn.microsoft.com/en-us/library/aa479364.aspx

This paper discusses the decisions that you need to make and why they are important. It offers 5 SQL Server strategies for scaleout:

• Scalable Shared Databases

• Peer-to-Peer Replication

• Linked Servers

• Distributed Partitioned Views

• Data-Dependent Routing

As you go down the stack, things get more complicated, but also provide more powerful ways of scaling out.

Related Solutions

Sql-server – How does one scale SQL Server 2008 or 2012

SQL Server doesn't scale out as such. It scales up.

There are 3 areas to do this, subject to edition limitations

CPU cores
RAM
Storage

And of course, use a higher edition eg Enterprise

SQL Server doesn't shard and any such solution (you can research MySQL sharding solutions) adds complexity and overhead to a system.

Scaling up one server (+ standby nodes/mirror) is usually quite straightforward with RAM, SSDs, more disk volumes to spread IO, separate drives for tempdb and logs etc

Also, if you find SQL Server is CPU bound then it's usually poor design and/or indexes and/or poorly written queries unless you have a massive load.

Sql-server – SQL Server – Database per company. How to query across databases

I don't know if this will help, and this maybe a little much for your situation, but here goes our solution in use today.

All users are grouped into logical entities under a domain (something.com) umbrella. Our particular scenario required an additional layer of Domain->Company->Group->User break out. Not sure if you ever worked with Active Directory or domain trees, but it follows that logic.

Using the domain model, the server itself is at the top, presumably the client of your services would be the administrator. Each company flows like a branch from the server itself. Then flows into user-defined groups which contain users. Each company in this scenario would have a Domain Admin which can administer the xyz.com, abc.com, etc domains, but can't access any other domain.

Each of the service containers (the databases) use a full trust model with the security database to provide services to the users contained in this security database (Think OpenID for perspective). We use a home brewed C++ compiled module for Apache 2.x to provide an application firewall and session security host.

Each run, the module "asks" the security database to record the page hit, produce a random (64 character string) cookie and a session cookie (two total cookies), and authenticate the session. A new session or the same session is returned based on if the cookies match. The session key and anything else relevant is provided to the end-user UI application code over HTTP headers (so the firewalls could sit in front of Google Apps or AppEngine).

Once the UI has the code, the UI can act on behalf of the user and the service database accepts this token and provides direct permissions to the user based on the token. Our application provides the username as well to match a unique user within the service database to provide extended permissions. Since we also use per-transaction random keys, the state is unable to be cached.

This model supports isolation (UI developers are unable to eavesdrop), allows a central security with a delegation aspect, and the ability to provide centralized services. Such as providing a central forum, trending (statistical compiled) data stored centrally (which maybe restricted to paying clients), or maybe even weblogs as we provide to each of our service databases to make decisions to gauge a fraudulent checkout request.

    SERVERS TABLE Minimum of id (unique key), domain, adminuser
                |
                | -------> OPTIONAL Access Control List
                |
    SERVER USER CONTROL LIST Minimum of id (unique key), serverid, actual url you wish to protect. 
                OPTIONAL restrictions on AdminOnly, NoSearchEngines, Restricted (Authentication Required)
                |
                |
                | --------> OPTIONAL ENTITY (Sub-Company) Abstraction
                |
                | --------> OPTIONAL GROUPS
                |
                |
    USERS   Minimum of id (unique key), username, serverid (unless you use the entity container which are already tied to the server)
                |
                |
                |
    PERMISSIONS Minimum of id (unique key), server_ucl_id, groupid/entityid/userid (Depending on preference)

    SESSIONS    Minimum of id (unique key), serverid, userid if authenticated

    Provide the sessionid and either userid or username to tamper-resistant proxy code that will provide it to the end-user UI.

We have a lot of bells and whistles in our app to provide a more robust multi-tenant solution, but this is the basics of what we are doing.

The objective is to totally isolate each company into it's own container. Proxy code of some sort should help. C++ is not required, just need code to sit in front of the web application. Also if you want to provide security in one database and shared services in another database, your shared database, which presumably will also be under your client's control, would be allowed to query directly to the security database to see if the user is logged in based on the tokens provided to it, such as when the user posts.

There are no error messages to humans using this model. If anything, any errors would either be pushed via some type of messaging que, email, or log files. The authentication layer is between the proxy code and the security database. User management is within an application that you create for the end-user companies.

This model also scales well with SQL Azure in the middle if needed as it does not need anything like CLR, FT, or etc. The only true weak point will always be the proxy code for security enthusiast. Physical security, and limited user access should lock that down.

Hope I wasn't to verbose and that this helps!

Best Answer

Related Solutions

Sql-server – How does one scale SQL Server 2008 or 2012

Sql-server – SQL Server – Database per company. How to query across databases

Related Question