Sql-server – What’s the anatothe of a columnstore index

columnstoreindexsql serversql-server-2012

One of the new features in SQL Server 2012 codenamed Denali is the Columnstore index.

I know a good bit about regular old row-store indexes, like the b-tree structure, differences in storage between the leaf level and b-tree pages, affects of included fields, optimizing to use them, order of keys, etc.

I'm having difficulty getting any good info on the internals of a columnstore index.

  • How is it structured?
  • Is there a b-tree? Some other structure in place?
  • How is data organized?
  • What sorts of specific operators are best suited to use it?
  • Any other anti-patterns to be avoided when using them?

A lot of what I can find out about them are basically the exact opposite of a "normal" index, i.e. no ordering of keys, no included fields, nonclustered ONLY.

Any insights are appreciated.

Best Answer

Columnstore Structure

Columnstore data is physically stored in one or more segments (regular LOB allocation units) per column, and may also be partitioned in the usual way. Each segment contains roughly one million rows of highly-compressed values or value references (several compression techniques are available). A value reference links to an entry in one of up to two hash dictionaries.

Dictionaries are pinned in memory during query execution, with data value IDs from the segment being looked in the dictionary whenever execution requires the actual data value (this lookup is deferred for as long as possible for performance reasons).

Segments also have a header record containing metadata such as the minimum and maximum values stored in the segment. Information from the header can often be used to eliminate complete partitions from processing at execution time. Header record information is stored in the usual LOB data root structure, so eliminating a segment means the Storage Engine can skip reading the LOB data pages from physical storage entirely. Maximizing the potential for elimination can require careful design, including a dependency on the clustered index order at the time the Columnstore index is built.

Specific Plan Operators

SQL Server 2012 introduces a new execution mode called Batch Mode. In this mode, packets of roughly 1000 rows are passed between operators, significantly improving processor utilization efficiency. Within each packet, columnar data is represented as a vector. Not all plan operators support batch mode operation, but examples of those that do include Columnstore Index Scan, Hash Inner Join, Batch Hash Table Build, Bitmap Filter, Hash Aggregate (not scalar aggregates), Filter, and Compute Scalar (for projection and expression evaluation). Query execution plans have been enhanced to show estimated and actual execution mode.

Anti-Patterns

There are a large number of restrictions in the first release, including constraints on allowable data types. Most common types are supported; unsupported data types include DECIMAL with a precision greater than 18 digits, (N)VARCHAR(MAX), UNIQUEIDENTIFIER, CLR types, and (VAR)BINARY.

Use of string types, OUTER JOIN, IN, EXISTS, NOT IN, OR, UNION ALL may result in significantly reduced performance (Row Mode execution), unless workarounds are employed that typically involve unusual syntax rewrites as shown in the linked articles in this section.

More Information

Remus Rusanu has blogged a great overview here.