Index table for horizontal shard

indexsharding

Lets assume we have a vary amount of clients, which could manage some stuff on our site. Each client has his own table, which is used to store data. For instance: client_1, client_2, …, client_N. Each table could contain different amount of records, but there could be clients which could have more than a hundred of thousand records in a table.

The task is to create aggregate search on the site, which will display results from different tables for a certain search keyword. In this case we face with a problem that we need to perform up to N queries to N tables. This is not cool.

The idea is to create an index table which will collect all information from all tables and will be updated by insert/update/delete trigger on a concrete table.

I feel uncomfortable with this "index table" idea, because it could contain a huge amount of records and will be quite big overhead for our database. Is there any other approach how to build an index for horizontal shard?

Best Answer

There are many general solutions to the problem. Here are three that quickly come to mind.

  1. SSAS
  2. VoltDB
  3. Map Reduce Framework

What you're describing is what would historically fall under an OLAP solution. If you can handle some lag time in your data then perhaps a solution such as SQL Server Analytic Services (SSAS) could provide you a solution. You could read data from these tables, perhaps via some secondary reporting server to not affect your operational database, and then perform these aggregations and processing of data you described.

However, this problem also has some newer solutions we see in the Big Data realm with a specific focus on "Velocity" not necessarily "Volume" of data. Velocity can mean either the processing of relatively static data quickly or the processing of dynamically changing data that is actively being ingested by the DB.

A second option is to allow a contemporary "NewSQL" database engine, such as VoltDB, to do this for you. VoltDB handles a lot of this "indexing" as you referred to it and it does so in real-time as opposed to per-aggregated data that may not be current. There are always performance tradeoffs when performing aggregations across many tables as opposed to a specific table but in my experience an in-memory DB engine, such as Volt, can beat traditional OLAP solutions when considering many metrics including processing speed and management.

Another option is to use the Map Reduce framework many DB engines support today. This implies that you're happy with off-line processing of a problem\query as opposed to real-time processing.

It's not an exhaustive list but hopefully that gives you some other ideas. I know that the triggers you are speaking of can easily spiral out of control due to hotspots in your tables especially if your SQL is not optimized. The one recommendation I will strongly make is to definitely get this processing off of your operational server and do this on a secondary replica\reports server.