Sql-server – Is it OK to create hundreds of databases in SQL Azure versus one big one and run the risk of deadlocking

azure-sql-databasescalabilitysql server

I need to create highly scalable solution – where field devices in thousands of sites are delivering data in real time to a back end system, and SQL Azure seems to fit the bill nicely in terms of adding sql databases and application servers.

Each field device is effectively sending 400 sensor values every second – for about two hours a day, and those 400 sensor values every 5 minutes for all other times forever. Additionally, when an error occurs on this field device, it will send up the last minute's data for all 400 sensors as well (400 * 60 readings) – causing a mass flood of data when anything goes wrong.

I really want to design the system so that the independent field devices and the data in which they store can not affect other devices. Allowing each field device to not affect the performance of other field devices.

I started the design with thinking a single database holding all the device's data – but have started to get deadlocks occurring when simulating multiple site devices. Hence, am in the process of moving to a multiple database solution. Where a master database holds a lookup table for all the devices – returning a connection string to the real database

At this stage of the project, it is most important that I am able to pass that data back to User Interfaces running in web browsers in real time – updating their screens every second.

In future stages of the project it will be necessary to start aggregating data across multiple devices showing statistics such as sum of sensor X in region Y. I can see this will be hard to do with the multiple database approach.

So would value any advice e.g.

Do you think it is sensible to use Sql Azure to host potentially 1000's of databases and to use this master database to indirectly point to the real ones?

Will I have a problem with Connections to the databases from the applications- with issues with connection pooling for example?

How will I be able to aggregate data from all these different databases in Sql Azure.

Would be interested in all your comments. Regards, Chris.

Best Answer

Since no one else has answered, I'll share some opinions and do some hand-waving.

As long as you aren't locking common resources, or are locking resources in the same order, you shouldn't have problems with deadlocks.

I'd look at separate tables before separate databases. Each additional database will definitely cost you more but additional tables won't necessarily cost you more. You might need to use more than 1 database becuase of the sheer volume of data you will store or because of the rate at which you need to store your burst traffic. If you can manage it, I think that a table-level granularlity will be more flexible and possibly a good deal cheaper than starting with a database-level granularity.

The problem with putting each device's data into it's own tables is that it makes the reporting hard since all of the table names will be different.

I presume that you have some way of detecting when you get a "failure resend" of data. You don't want to put the same value in a table twice and I i'm sure that the devices can fail (local power failure?) in ways that have nothing to do with whether or not earlier values where properly stored.

WAG: Assuming each "value" is 4 bytes, I calculated about 11.5 MB of collected data per device, per day. (This ignores all kinds of stuff, like device identifiers and timestamps, but I think it is OK as a rough estimate.) So, with "thousands" of sites, we are looking at tens of GB, per day. You don't mention any kind of lifetime on that data. The largest Azure database currently maxes out at 150 GB. You could fill those up pretty quickly.

Getting anything to happen in a web browser in a short period of time is iffy. When you are reading from (possibly multiple) databases with GBs of data, continuously inserting lots of new data into the tables you are reading from and interacting with web servers across the open internet, "real time" is wishful thinking. IMO. "Fast enough" is the usual goal.

If you can't keep all of the data you need in a single report in one SQL Azure database, it's a problem. There are no linked servers or distributed views (at this point). There is no simple way to aggregate accross many Azure databases. You'd have to pull all of the data to a central location and report from there. I'd guess that the aggregated data would be too large to store in a single SQL Azure database, so you'd have to go to on-premise or maybe EC2. A data mart or warehouse with a star-schema structure would be the classic answer there, but that takes significant processing time and that means no "real time". Also, that's potentially a lot more data transfer from Azure to wherever it goes, and that will cost you.

I wouldn't commit to this strategy without a pilot program first. The first thing to do would be to build a single instance (can it handle 400 sensor values a second? (Is that a series of rows, a big denormalized row, an XML document or something else? The format of the incoming data will affect how fast the data can be stored. Can you do bulk inserts, or does it have to be row-by-row?) How about 4,000 sensor values a second? It's possible that an single SQL Azure instance might not be able to store that much that quickly.) and see how it handles insertions at your expected rates and see how the reporting might work. And I'd talk to Microsoft too. Just dealing with the billing for hundreds or thousands of seperate databases might be quirky.

I don't know if this is applicable to you, but have you looked at Microsoft's "Stream Insight" product? It seems to be aimed at situations like yours. Caveat: I've never used it.

The marketting blurb: Effectively analyze large amounts of event data streaming in from multiple sources. Derive insights from critical information in near real time by using Microsoft StreamInsight. Monitor, analyze, and act on data in motion and make informed decisions almost instantaneously

While doing some quickly googling, I noticed a blog posting which states that StreamInsight available on SQL Azure as a CTP last year. it might be ready for prime time by now.

Good luck, it sounds like an interesting project.