Index table for horizontal shard

indexsharding

Lets assume we have a vary amount of clients, which could manage some stuff on our site. Each client has his own table, which is used to store data. For instance: client_1, client_2, …, client_N. Each table could contain different amount of records, but there could be clients which could have more than a hundred of thousand records in a table.

The task is to create aggregate search on the site, which will display results from different tables for a certain search keyword. In this case we face with a problem that we need to perform up to N queries to N tables. This is not cool.

The idea is to create an index table which will collect all information from all tables and will be updated by insert/update/delete trigger on a concrete table.

I feel uncomfortable with this "index table" idea, because it could contain a huge amount of records and will be quite big overhead for our database. Is there any other approach how to build an index for horizontal shard?

Best Answer

There are many general solutions to the problem. Here are three that quickly come to mind.

SSAS
VoltDB
Map Reduce Framework

What you're describing is what would historically fall under an OLAP solution. If you can handle some lag time in your data then perhaps a solution such as SQL Server Analytic Services (SSAS) could provide you a solution. You could read data from these tables, perhaps via some secondary reporting server to not affect your operational database, and then perform these aggregations and processing of data you described.

However, this problem also has some newer solutions we see in the Big Data realm with a specific focus on "Velocity" not necessarily "Volume" of data. Velocity can mean either the processing of relatively static data quickly or the processing of dynamically changing data that is actively being ingested by the DB.

A second option is to allow a contemporary "NewSQL" database engine, such as VoltDB, to do this for you. VoltDB handles a lot of this "indexing" as you referred to it and it does so in real-time as opposed to per-aggregated data that may not be current. There are always performance tradeoffs when performing aggregations across many tables as opposed to a specific table but in my experience an in-memory DB engine, such as Volt, can beat traditional OLAP solutions when considering many metrics including processing speed and management.

Another option is to use the Map Reduce framework many DB engines support today. This implies that you're happy with off-line processing of a problem\query as opposed to real-time processing.

It's not an exhaustive list but hopefully that gives you some other ideas. I know that the triggers you are speaking of can easily spiral out of control due to hotspots in your tables especially if your SQL is not optimized. The one recommendation I will strongly make is to definitely get this processing off of your operational server and do this on a secondary replica\reports server.

Related Solutions

PostgreSQL Indexing – Is Indexing a VARCHAR Column a Good Idea?

There's nothing wrong with indexing a varchar column if you're going to be doing queries based on it. However please keep in mind that there a limits to some indexes and how much they can index in a single field. Example you can't index a column that can contain an unlimited amount of text. However you should be able to do an index on varchar(256) without issue. Try it, and analyze the improvements in your queries performance to see if it helps.

Mysql – Why search by date field requires engine to touch all records

As @Razor commented, the number reported by EXPLAIN is only an estimate, and it can be off from your true row count by +/- 10% or even more.

MySQL prefers not to use an index at all if the portion of the table selected by the query conditions is too large. Think of the index at the back of a book. Why don't they index the word "the"? Because it would just list every page in the book. If the word occurs on too many pages in the book, it's actually easier to just read the book cover-to-cover instead of using an index.

Similarly, in MySQL if the optimizer estimates that the values you are searching for are found on a significantly high number of rows (in my experience "significant" is at least 16-18%) then MySQL skips the use of the index and just does a table-scan instead.

Sometimes this is the wrong choice, and you can give MySQL a hint that a table-scan is actually much more expensive than the preferred index choice:

EXPLAIN select * FROM playdays FORCE INDEX(realdate) where realdate>=Date('2010,01,01') ;

Be careful about using index hints too liberally, because you effectively hard-code your index choices into your queries by using them, and that may become the wrong strategy as your data changes over time. They should be used in exception cases, when you can demonstrate that the optimizer make the wrong choice.

Best Answer

Related Solutions

PostgreSQL Indexing – Is Indexing a VARCHAR Column a Good Idea?

Mysql – Why search by date field requires engine to touch all records

Related Question