Sql-server – Columnstore index on multi-tenant database

columnstoreindexindex-tuningsql server

I have a sql azure database that is used only for analytics.

It is constantly being updated with data pushed from the OLTP system.

The OLTP inserts/updates/deletes are pushed via webhooks to a service that propagates the changes into the analytics db in real time.

As a result, we have a db that must sustain both a OLTP and analytics work loads, althought the analytics queries performance is more important.

I haven't used columnstore indexes before, but after doing minimal research it seems like this type of payload can be good fit for nonclustered columnstore indexes. The clustered index would stay rowstore, for fast insert/update/delete.

But, I have an extra requirement that I haven't seem mentioned: multi-tenancy.
When users use our analytics app, they are always associated with a single tenant id, and the analytics queries will always have a WHERE clause on the tenant id. The tenant id column is highly selective. The biggest tenant makes up for less than 5% of all rows and most tenants make up for less than 1% of rows.

Is it still a good idea to use NCCI indexes for multitenants database? If so, should I set up my index in a particular way?

Best Answer

Is it still a good idea to use NCCI indexes for multitenants database?

If the alternative is to have a data structure that efficiently filters by tenant, then no.

If TenantId is the leading column in your clustered index, then single-tenant queries will be pretty fast. The NCCI row groups will not be segregated by tenant, and so all the row groups will need to be scanned to find rows for a single tenant.

As rows are inserted, they are inserted into the "middle" of the clustered index, typically at the "end" of the rows for the tenant. But rows are always inserted into the delta store for an CCI/NCCI. And whenever the delta store has 1 million rows in it, it's rebuilt into a columnar row group. So each row group will end up with the last million rows inserted across all tenants.

You can fix this with partitioning by TenantID, which could give each tenant (or group of tenants) a separate physical table (or NCCI). Each partition will have its own delta store, which will fill up with single-tenant rows. If you do partition, each partition might end up too small for a Columnstore to be useful. You'll want a few million rows per partition at least.

In general with multi-tenant data, you want to avoid interleaving tenant data in a way that requires you to read all tenants data to retrieve a single tenant's data.

Related Solutions

Sql-server – index maintenance strategy if few insertion points relative to # of rows

The answer by ThomasStringer is very good and the usage of partitioning here would likely help overall performance and decrease your maintenance cost - however, it won't do anything for your page split scenario.

Can you verify that your FK's are trusted? This won't decrease or impact your page splits but it will help your query execution time.

select 'ALTER TABLE ' + s.name + '.' + OBJECT_NAME(o.object_id) + ' WITH CHECK CHECK     CONSTRAINT ' + i.name + '
GO'
FROM sys.foreign_keys i  INNER JOIN sys.objects o ON i.parent_object_id = o.object_id     INNER JOIN sys.schemas s ON o.schema_id = s.schema_id WHERE     i.is_not_trusted = 1 AND i.is_not_for_replication = 0
-- Foreign Key Check... Will build the statement below to CHECK (validate) the FK for     those FK's that are untrusted.
UNION
select 'ALTER TABLE ' +  s.name + '.' + OBJECT_NAME(o.object_id) + ' WITH CHECK CHECK     CONSTRAINT ' + i.name + '
GO'
from sys.check_constraints i join sys.objects o on i.parent_object_id = o.object_id     join sys.schemas s on o.schema_id = s.schema_id where i.is_not_trusted =     1 and i.is_not_for_replication = 0

-- Check Constraint Check... Will build the statement below to CHECK (validate) the       check constraints for those check constraints that are untrusted.

-- !!!! -- The output (in TEXT) will look similar to what you see below.  once you     have the output, put it into the query window and execute it.

Another idea you already hit on, using fill factor - if you use a low fill factor and combine that with either row or page level compression, your data density, per page will increase. Again, this won't help your page split problem, BUT it can decrease the quantity of page splits (due to having a higher data density per page).

Lastly, if you can, look at changing your index and lead with the log_time column instead of the foreign key column. This change, depending on how the log_time data comes in (hopefully it's more "in order" than the FK column). This could decrease your page splits significantly. If you pair this with both compression (ROW) and a good partition scheme, you might see some significant improvements.

Postgresql – Multicolumn index and performance

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms