Mysql – Does creating an index on BIT(n) attribute in MySQL make sense

datatypesindexMySQL

I am working on a very specific thing and I need to use BIT(n) attributes and efficiently search through them. n is generally not a power of 2.

I've seen some mentions on the internet that BIT attributes are not index-friendly in MySQL. Without further elaboration though.

So the question is – do MySQL indexes work well with BIT(n) attributes or I better find some other ways, like restrict n to, say, 32 and use INT instead?

Best Answer

Unless you are querying for an exact match on the whole value stored in the bit column, an index will not help, because it can't generally be used. On the remote chance that you are, then it would.

The reasons an index can't be used when storing a bitmapped value and then querying based on the state of one or more bits include the fact that the commonly-available index types in MySQL are only useful for exact-value or range matching, and bitwise comparisons aren't truly either of those things.

Because of the nature of this limitation, it also holds true if you use any of the integer data types or even the SET data type.

Granted, for example, testing the state of the high bit on an 8-bit number is the same as testing whether the number is >= 128, which is indeed a range match and could be accomplished using a b-tree index, but to make use of an index for this test, the optimizer would have to understand that this what you're "really" asking when what you actually asked for is WHERE bin_col & b'10000000' ... the optimizer isn't going to realize that.

My inclination for storing bitmapped values would be to use an UNSIGNED [something]INT column, since BIT columns in MySQL are actually more analogous to CHAR/BINARY columns than they are to integer data, but this will ultimately depend on your application as well as your storage engine.

MyISAM apparently stores different bit columns together in the raw row data, so if you're not using even increments of 8/16/32/64, there might be a small storage advantage there when using BITs vs INTs -- but unless you're already using MyISAM, I wouldn't think this would be of enough of an advantage to make you consider using it.

The MEMORY and InnoDB engines allocate the smallest standard integer size that can hold the required number of bits in a BIT column.

The SET data type also stores the values as an unsigned integer of 1, 2, 3, 4, or 8 bytes, as required by the number of labels you define for the bits. It is easier on the eyeballs when you're looking at your stored bitmapped data, because if you select from it without casting the result back to an integer (either explicitly or implicitly with SELECT column_name + 0), you get back a comma-separated list of the labels of the bits that are set to 'on'... It doesn't offer any optimization in querying but it does give you the enhancement of expanding your bits in to labels without any real penalty compared to using a raw integer column.

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

You are right in that your example query would not use that index.

The query planner will consider using an index if:

all the fields contained in it are referenced in the query
some of the fields starting from the beginning are referenced

It will not be able to make use of indexes that start with a field not used by the query.

So for your example:

SELECT [id], [name], [customerId], [dateCreated]
   FROM Representatives WHERE customerId=1 
   ORDER BY dateCreated

it would consider indexes such as:

[customerId]
[customerId], [dateCreated]
[customerId], [dateCreated], [name]

but not:

[name], [customerId], [dateCreated]

If it found both [customerId] and [customerId], [dateCreated], [name] its decision to prefer one over the other would depend on the index stats which depend on estimates of the balance of data in the fields. If [customerId], [dateCreated] were defined it should prefer that over the other two unless you give a specific index hint to the contrary.

It is not uncommon to see one index defined for every field in my experience either, though this is rarely optimal as the extra management needed to update the indexes on insert/update, and the extra space needed to store them, is wasted when half of them may never get used - but unless your DB sees write-heavy loads the performance is not going to stink badly even with the excess indexes.

Specific indexes for frequent queries that would otherwise be slow due to table or index scanning is generally a good idea, though don't overdo it as you could be exchanging one performance issue for another. If you do define [customerId], [dateCreated] as an index, for example, remember that the query planner will be able to use that for queries that would use an index on just [customerId] if present. While using just [customerId] would be slightly more efficient than using the compound index this may be mitigated by ending up having two indexes competing for space in RAM instead of one (though if your entire normal working set fits easily into RAM this extra memory competition may not be an issue).

Sql-server – Does this computed CHECKSUM() index design make sense

You are right, this is pointless.

Two (of many) reasons that I see it's wrong

it isn't guaranteed unique (CHECKSUM gives int) whereas the GUID is (over the range of GUID). It's a small chance of duplicate but quite possible: like the "birthday problem" somewhat
it's still random order. The main reason IDENTITY is better then GUID for a clustered index is that IDENTITY is monotonically increasing. CHECKSUM(someGUID) is random order too

I'd add a new IDENTITY column, and then start changing dependencies to use this only.

Best Answer

Related Solutions

SQL Server Indexing – Separate Indexes vs Multi-Column Index

Sql-server – Does this computed CHECKSUM() index design make sense

Related Question