MySQL boolean field search vs indexed varchar performance

MySQLmysql-5.7performance

Will a select filtering on a boolean field be faster than a select filtering on an indexed varchar(255) field?

Specifically:
MySQL 5.7

In this case, records of a certain table should never be deleted. Instead they should be marked as inactive if necessary.

A boolean 'is_active' field could be added to that table, or an existing 'status' indexed varchar field could be used (conceptually overloaded) to also store the value 'inactive' if necessary. The varchar is specified as 255 max length, but realistically it is much smaller.

In this case, 90+% of the rows will be considered active, so the boolean is_active would be 0, or alternately the status would have some value other than 'inactive'; and most searches would be for is_active = 1 (ignoring inactive rows).

I have read that there is little value in indexing boolean fields, so whether the boolean option were indexed or not probably wouldn't be a factor (but if it is, please enlighten me!)

Best Answer

Short answer for bool vs varchar: Probably not much diff, but it may depend.

Short answer for bool that is true 90% of the time: The index is useless.

Long answer...

What percentage of the rows are "is_active"? If it is more than about 20%, the index won't be used!.

Use reasonable limits on VARCHAR. Even without that, 'inactive' is only 8 characters, not 255. So, WHERE status = 'inactive' is not much slower than WHERE is_active.

Is that the entire WHERE clause? I ask because Optimization requires looking at the entire WHERE clause. For that matter, then entire SELECT can be important.

Often the query involves

WHERE is_active
  AND other-stuff

In which case, this may be very beneficial, even with 90% being active:

INDEX(is_active, other-column)

But, again, I need to see the details before giving a definitive answer.

And, no, the poor cardinality of is_active is not relevant in a composite index.

TRADEOFF #1

Obviously, VARCHAR holds the advantage since variable-length data would produce smaller rows and, thus, smaller physical files.

TRADEOFF #2

Since CHAR fields require less string manipulation because of fixed field widths, index lookups against CHAR field are on average 20% faster than that of VARCHAR fields. This is not any conjecture on my part. The book MySQL Database Design and Tuning performed something marvelous on a MyISAM table to prove this. The example in the book did something like the following:

ALTER TABLE tblname ROW_FORMAT=FIXED;

This directive forces are VARCHARs to behave as CHARs. I did this at my previous job back in 2007 and took a 300GB table and sped up index lookups by 20%, without changing anything else. It worked as published. However, it did produce a table almost double in size, but that simply goes back to tradeoff #1.

You could analyze the data being stored to see what MySQL recommends for column definition. Just run the following against any table:

SELECT * FROM tblname PROCEDURE ANALYSE();

This will traverse the entire table and recommend column definitions for every column based on the data it contains, the minimum field values, maximum field values, and so forth. Sometimes, you just have to use common sense with planning CHAR vs VARCHAR. Here is a good example:

If you are storing IP addresses, the mask for such a column is at most 15 characters (xxx.xxx.xxx.xxx). I would jump right at CHAR(15) in a heartbeat because the lengths of IP addresses will not vary all that much and the added complexity of string manipulation controlled by an additional byte. You could still do a PROCEDURE ANALYSE() against such a column. It may even recommend VARCHAR. My money would still be on CHAR over VARCHAR in this instance.

CHAR vs VARCHAR issues can be resolved only through proper planning. With great power comes great responsibility (cliche but true)

Sql-server – Most efficient queries on flags in TSQL: bit field or table join

It is very dependent on the makeup of the data, how the data changes over time and what queries are run as to what the most efficient method is.

I would suggest that a filtered index is the solution that is going to be the most efficient without talking about a specific scenario.

Edit:

You should add a filtered index filtered on where optIn = 1 with the fields that are required by the query of consequence as it will use this index to generate the result and not have to refer to the main table at all. To check that it reads from just the index you can analyse the execution plan. I would expect that the query of consequence would not require all of the fields for the user record and thus the fields in the index would be a subset of the fields in the actual user record.

In short the goal is to do a little extra work on inserting the record (updating the indexes) and storing extra data so that the query has to read a significantly smaller amount of data that is in continuous blocks on the disk (assuming we are talking about traditional disk technology) than it would if it had to scan through the whole table picking about the records and fields in those records required by the query.

Best Answer

Related Solutions

Mysql – Performance implications of MySQL VARCHAR sizes

TRADEOFF #1

TRADEOFF #2

Sql-server – Most efficient queries on flags in TSQL: bit field or table join

Related Question