Sql-server – Composite index performance with varchar columns

composite-typesindexperformancesql-server-2017varchar

I have large fact tables with composite non-clustered indexes, something like this:

NONCLSUTERED INDEX (OrderDate,OrderType,ClientKey,ItemKey,CustomerKey)

The columns ending with "Key" are varchar(50) columns, but they store strings of 20-30 characters.

Do I need to worry in long term, that the varchar columns in the index are actually varies in length?
Will I get more page splits, slower performance when millions of rows merge into these tables?
https://stackoverflow.com/questions/59667/what-are-the-use-cases-for-selecting-char-over-varchar-in-sql

Does the Sql Server engine have harder times to figure out the plan using that index since the columns are varies in length?
I read that CHAR is better since its fixed size, and its easier for the db engine to calculate with, hence it performs better.
Is it that significant?

Does it worth to hash these "Key" columns individually with lets say SHA2_256 function, so they become fixed size BINARY(32)?

NONCLSUTERED INDEX (OrderDate,OrderType,HASHEDClientKey,HASHEDItemKey,HASHEDCustomerKey)

I did my test to answer at least the performance question, and doesn't see any performance benefit with hashkeys over varchar.
In fact its a big overhead to calculate the hashes and store the extra 32 bytes of each next to the original "Key" column.

What I cannot test is how these varchar indexes will behave when i have millions of rows merging into the tables each day?

Which one would you choose for both performance and maintenance point of view?

Best Answer

While some RDBMSs might be tuned to interact better with fixed-length strings (I seem to recall that maybe DB2/MVS did better with fixed-length strings, but I learned this in 1996 and it was mainframe DB2, so not sure if that even applied to DB2 on Unix or Windows), I would be shocked if using CHAR(50) did anything but decrease performance as compared to VARCHAR(50) given that most values are only using 20-30 bytes. Using CHAR(50) takes up more space, which reduces the number of rows that fit on the data page, which increases the number of data pages, which takes up more space in the buffer pool (i.e. RAM) when querying the data, plus takes longer to backup and restore. AND, if I am not mistaken, it will also increase the memory grants (i.e. RAM requested per query) as it will request 50 bytes per field instead of 25 (I believe it requests 50% the max size of variable length fields). For a more thorough look at the down-stream effects of using more space than you need, please see: Disk Is Cheap! ORLY?.

That being said, between VARCHAR(50) and CHAR(50), I would choose: neither ! Given the names of the columns, I am guessing that these are actually dimensions? Filtering on binary/numeric values is much faster than filtering on strings (there is one exception that I will mention in a moment). Depending on how many potential values there will be over time for clients, customers, and items, I would use either INT or, if truly needed, then BIGINT as the keys for the Client, Customer, and Item dimensions, and store those numeric key values here.

At bare minimum, I would change the collation on the VARCHAR columns to be Latin1_General_100_BIN2 since you clearly aren't needing case or accent insensitivity, else the idea of using a SHA-256 hash would not even be an option. That should at least be able to filter almost as fast as an INT or BIGINT (I say "almost" only due to, again, the effect of storing 20 - 30 byte values for 3 columns instead of 4 or 8 byte values for those same 3 columns, when accounting for millions/billions of rows).

I would strongly advise against using any type of hash value here. Not only would you be bloating the table (i.e. decreasing performance), but you would then lose any meaning of those values, and would hence either need to also store the non-hashed values for those 3 columns in the fact table (i.e. crazy) or require a lookup back to the dimension table (in which case you gained nothing over using INT or BIGINT at 4 or 8 bytes plus no calculation overhead!).

Related Solutions

Mysql – index all columns of a table in MySQL safely

When it comes to querying, indexing of a table should never be your first concern.

The queries you plan to use should dictate the indexes you need.

Based on the queries, some columns can be individually indexed. Other queries require compound indexes. The ORDER BY and GROUP BY clauses should provide immediate hints for indexes to make. Not using such hints may result in temp table sorting rather than using the indexes for data in the desired order needed.

Low cardinality of column values should eliminate the need for an index.

Even with these things taken into consideration, you may find that query may need some adjustment (a.k.a. refactoring) for performance gains.

When you reach the point of having the right indexes, not you have to worry about the size of those indexes. For a MyISAM table, this would mean that the .MYI file may grow significantly.

The size of the index file as well as the number of indexes should now be weighed against the performance of your queries, especially if the indexes provide the proper ordering of data and fastest retrieval.

Explain plans for queries may change over time depending on the number of rows, cardinality of columns, number of DELETEs and UPDATEs. Once a query's explain plan changes from what it looked like months ago, you should explore the need to add or remove indexes.

Postgresql – Should I add an arbitrary length limit to VARCHAR columns

The answer is no.

Related advice in the Postgres Wiki.

Don't add a length modifier to varchar if you don't need it. (Most of the time, you don't.) Just use text for all character data. Make that varchar (standard SQL type) without length modifier if you need to stay compatible with RDBMS which don't have text as generic character string type.

Performance is almost the same, text is a bit faster in rare situations, and you save the cycles for the check on the length. Related:

Any downsides of using data type “text” for storing strings?

If you actually need to enforce a maximum length, varchar(n) is a valid choice, but I would still consider text with a CHECK constraint like:

ALTER TABLE tbl ADD CONSTRAINT tbl_col_len CHECK (length(col) < 51);

You can modify or drop such a constraint at any time without having to mess with the table definition and depending objects (views, functions, foreign keys, ...). And you can enforce other requirements in the (same) constraint.

Length modifiers used to cause problems like this or this or this ...

PostgreSQL 9.1 introduced a new feature to alleviate the pain somewhat. The release notes:

Allow ALTER TABLE ... SET DATA TYPE to avoid table rewrites in appropriate cases (Noah Misch, Robert Haas)

For example, converting a varchar column to text no longer requires a rewrite of the table. However, increasing the length constraint on a varchar column still requires a table rewrite.

More issues with varchar(n) have been fixed in later releases.

Best Answer

Related Solutions

Mysql – index all columns of a table in MySQL safely

Postgresql – Should I add an arbitrary length limit to VARCHAR columns

Related Question