Sql-server – sql server database sharding – what to do with common data / non sharded data

sql server

We have a very large scale enterprise level database. As part of our business model all web users hit our web servers at the same time each month which in turn hammer our sql box. The traffic is very heavy and continues to grow heavier the larger the company grows. sql proc optimization has been performed and hardware has already been scaled up to a very high level.

We are looking to shard the database now to ensure that we can handle company growth and future loads.

We have decided what particular data should be sharded. It is a subset of our database which is highly utilized.

However, my question is regarding the non sharded data which is common/universal. An example of data like this may be an Inventory table for instance or possibly an Employee table, user table etc .

I see two options to handle this common/universal data:

1) design 1 – Place the common/universal data in an external database. All writes will occur here. This data will then be replicated down to each shard allowing each shard to read this data and inner join to this data in t-sql procs.

2) design 2 – Give each shard its own copy of all common/universal data. Let each shard write locally to these tables and utilize sql merge replication to update/sync this data on all other shards.

concerns about design #1

1) Transactional issues: If you have a situation in which you must write or update data in a shard and then write/update a common/universal table in 1 stored proc for instance, you will no longer be able to do this easily. The data now exists on seperate sql instances and databases. You may need to involve MS DTS to see if you can wrap these writes into a transaction since they are in a separate database. Performance is a concern here and possible rewrites may be involved for procs that write to sharded and common data.

2)a loss of referential integrity. Not possible to do cross database referential integrity.

3) Recoding large areas of the system so that it knows to write common data to the new universal database but read common data from the shards.

4). increased database trips. Like #1 above, when you run into a situation in which you must update sharded data and common data you are going to make multiple round trips to accomplish this since the data is now in separate databases. Some network latency here but I am not worried about this issue as much as the above 3.

concerns about design #2

In design #2 each shard gets its own instance of all common/universal data. This means that all code that joins to or updates common data continues to work/run just like it does today. There is very little recoding/rewriting needed from the development team. However, this design completely depends on merge replication to keep data in sync across all shards. the dbas are highly skilled and are very concerned that merge replication may not be able to handle this and should merge replication fail, that recovery from this failure is not great and could impact us very negatively.

I am curious to know if anyone has gone with design option #2. I am also curious to know if i am overlooking a 3rd or 4th design option that I do not see.

thank you in advance.

Best Answer

Your question focused on this:

However, my question is regarding the non sharded data which is common/universal. An example of data like this may be an Inventory table for instance or possibly an Employee table, user table etc .

When you're doing sharding, and you have data that all of the shards need to see, you have to classify that data with a few attributes:

Does it change frequently? In your examples, you listed Inventory, Employee, and User. Typically inventory changes very fast, but the Employees records only change periodically (say, a few hundred updates per day).

How much delay can each shard tolerate? Even though the Inventory may constantly be changing, you can typically tolerate a large amount of delay (minutes or even hours) on a table like that. If you're selling unique items with a very limited quantity that you can never restock (think original artworks), then you don't shard that data at all - you only query the original database. However, in most online stores, you're not selling out of every item every day, and you're going to restock things quickly anyway, so you don't really need up-to-the-millisecond counts of inventory. In fact, in most cases, you only need an In-Stock flag that's either 0 or 1, and a central process updates that flag. That way, you don't have to push every up/down bump of item count out to every shard. Employee or User data, on the other hand, may need to be pushed out to every shard nearly instantaneously.

Will you be joining from the sharded tables to the non-sharded ones? Ideally, the answer here is no - you should make two separate queries to get the data, and then join them on the app side. This gets a lot harder from an app perspective, but it gives you the capability to get the freshest data from each source.

Is this original data, or copied? Another way to think of this question: what do you need to back up, and how frequently? Typically in a high-volume sharding environment, you want the backups to be as fast and as small as possible. (After all, you need to protect each node, and you want all of the shards to fail over to DR at the same point in time - not have some shards with newer data than others.) This means the sharded data and the non-sharded data should be in completely separate databases - even if they're on the same server. I may need constant transaction log backups of my sharded (original) data, but I may not need to back up the non-sharded data at all. It's probably easier for me to just refresh my Employees or Users table from the single source of truth rather than back it up on every shard. If all of my data is in a single database, though, I lose that capability.

Now, about your concerns:

"Transactional issues...you will no longer be able to do this easily." Correct. In sharded scenarios, throw the concept of a transaction out the window. It gets worse, too - for the sharded data, you could have one shard up and online, and another shard down temporarily due to a cluster instance failover or restart. You need to plan for failure of any part of the system, at any time.

"Not possible to do cross database referential integrity." Correct. When you split a single table out across multiple servers, you're putting your big boy pants on and telling the database server that you're taking over for tough tasks like point-in-time backups, relationships between tables, and combining data from multiple sources. It's on you and your code now.

"Recoding large areas of the system so that it knows to write common data to the new universal database but read common data from the shards." Correct here as well. There's no easy button for this, but once you've built this into the app, you're able to scale like crazy. I'd argue that the easier way to do this is to split the app's connections by reads.

"increased database trips." - Yes, if you break the data into multiple servers, the app is going to have to reach out to the network more. The key is to implement caching as well so that some of this data can be stored in lower-cost, higher-throughput, lock-free systems. The fastest query is the one you never make.

I've also laid out more pros and cons to dividing up multi-tenant databases here, such as performance tuning on individual shards, different backup/recovery strategies per shard, and schema deployment challenges.

Related Solutions

Sql-server – SQL Server – separate database for reports

The answer is: yes, there is a benefit to doing it. Reports on on operational database will use a lot of resources and will interfere with the performance of the operational system. Remember that database performance is subject to mechanical constraints (disk heads moving back and forth and rotational latency as we wait for the right sector to make its appearance under the head). You have two broad options for a reporting strategy:

Replicate your database onto another server and move the reporting sprocs onto it. Reports are run off the replicated server. This is the least effort and can re-use your existing reports and stored procedures.
Build a Data Warehouse that consolidates the data from your production systems and transforms it into a form that is much friendlier for reporting. If you have a lot of ad-hoc statistical reporting that could be done acceptably from a snapshot as of 'close of business yesterday' a data warehouse might be the better approach.

Sql-server – SQL Server – Database per company. How to query across databases

I don't know if this will help, and this maybe a little much for your situation, but here goes our solution in use today.

All users are grouped into logical entities under a domain (something.com) umbrella. Our particular scenario required an additional layer of Domain->Company->Group->User break out. Not sure if you ever worked with Active Directory or domain trees, but it follows that logic.

Using the domain model, the server itself is at the top, presumably the client of your services would be the administrator. Each company flows like a branch from the server itself. Then flows into user-defined groups which contain users. Each company in this scenario would have a Domain Admin which can administer the xyz.com, abc.com, etc domains, but can't access any other domain.

Each of the service containers (the databases) use a full trust model with the security database to provide services to the users contained in this security database (Think OpenID for perspective). We use a home brewed C++ compiled module for Apache 2.x to provide an application firewall and session security host.

Each run, the module "asks" the security database to record the page hit, produce a random (64 character string) cookie and a session cookie (two total cookies), and authenticate the session. A new session or the same session is returned based on if the cookies match. The session key and anything else relevant is provided to the end-user UI application code over HTTP headers (so the firewalls could sit in front of Google Apps or AppEngine).

Once the UI has the code, the UI can act on behalf of the user and the service database accepts this token and provides direct permissions to the user based on the token. Our application provides the username as well to match a unique user within the service database to provide extended permissions. Since we also use per-transaction random keys, the state is unable to be cached.

This model supports isolation (UI developers are unable to eavesdrop), allows a central security with a delegation aspect, and the ability to provide centralized services. Such as providing a central forum, trending (statistical compiled) data stored centrally (which maybe restricted to paying clients), or maybe even weblogs as we provide to each of our service databases to make decisions to gauge a fraudulent checkout request.

    SERVERS TABLE Minimum of id (unique key), domain, adminuser
                |
                | -------> OPTIONAL Access Control List
                |
    SERVER USER CONTROL LIST Minimum of id (unique key), serverid, actual url you wish to protect. 
                OPTIONAL restrictions on AdminOnly, NoSearchEngines, Restricted (Authentication Required)
                |
                |
                | --------> OPTIONAL ENTITY (Sub-Company) Abstraction
                |
                | --------> OPTIONAL GROUPS
                |
                |
    USERS   Minimum of id (unique key), username, serverid (unless you use the entity container which are already tied to the server)
                |
                |
                |
    PERMISSIONS Minimum of id (unique key), server_ucl_id, groupid/entityid/userid (Depending on preference)

    SESSIONS    Minimum of id (unique key), serverid, userid if authenticated

    Provide the sessionid and either userid or username to tamper-resistant proxy code that will provide it to the end-user UI.

We have a lot of bells and whistles in our app to provide a more robust multi-tenant solution, but this is the basics of what we are doing.

The objective is to totally isolate each company into it's own container. Proxy code of some sort should help. C++ is not required, just need code to sit in front of the web application. Also if you want to provide security in one database and shared services in another database, your shared database, which presumably will also be under your client's control, would be allowed to query directly to the security database to see if the user is logged in based on the tokens provided to it, such as when the user posts.

There are no error messages to humans using this model. If anything, any errors would either be pushed via some type of messaging que, email, or log files. The authentication layer is between the proxy code and the security database. User management is within an application that you create for the end-user companies.

This model also scales well with SQL Azure in the middle if needed as it does not need anything like CLR, FT, or etc. The only true weak point will always be the proxy code for security enthusiast. Physical security, and limited user access should lock that down.

Hope I wasn't to verbose and that this helps!

Best Answer

Related Solutions

Sql-server – SQL Server – separate database for reports

Sql-server – SQL Server – Database per company. How to query across databases

Related Question