Sql-server – Bitmask Flags with Lookup Tables Clarification

database-designindexsql serversql-server-2008

I've received a dataset from an outside source which contains several bitmask fields as varchars. They come in length as low as 3 and as long as 21 values long. I need to be able to run SELECT queries based on these fields using AND or OR logic.

Using a calculated field, where I just convert the bits into an integer value, I can easily find rows that match an AND query, by using a simple WHERE rowvalue = requestvalue, but the OR logic would require using bitwise & in order to find matching records.

Given that I would need to work with several of these columns and select from hundreds of millions of records, I feel that there would be a huge performance hit when doing bitwise & operations to filter my SELECT results.

I came across this answer from searching and it looked like it may fit my needs, but I need some clarification on how it is implemented.

Is this as simple as creating a lookup table that has all possible search conditions?

Example for 3 bits using (a & b) (Edit: Wrong bitwise op)

001,001
001,011
001,101
001,111
010,010
010,011
010,110
011,011
011,111
etc

The author mentions that it's counter-intuitive initially, but I can't help but feel I'm interpreting the solution incorrectly, as this would give me a single lookup table with likely billions of rows.

Any clarifications on the answer I linked above or other suggestions that would preserve the existing database are appreciated.

Edit: A more concrete example using small data.

Four flags, HasHouse,HasCar,HasCat,HasDog, 0000 is has none, 1111 is has all.

Any number of flags, from all to none, can be flipped, and results must be filtered where selection matches all (Using exact value comparison) or at least 1 (Using bitwise &).

Adding a single calculated column for each bitmask is ok, but adding a column for each bit for more than 100 bits, coupled with how to insert/update the data is why I'm trying to find alternative solutions.

Best Answer

I partially agree with Aaron's comment - in the most general case for storing 21 unrelated pieces of information, you'd probably use 21 bit columns. As a general solution, it may well be your best solution. If you had multiple bitmask-ed varchar columns, that would translate to a row with possibly over a hundred bit flags. FYI, 21 bits get stored as 3 bytes when you don't define them as NULLable, removing the necessity for space in the NULL bitmap. Since you have multiple bitmask columns, you'd end up with every 8 bits mashed into a byte.

What SQL Server ends up doing with your multi-column queries is eventually a bunch of bitmasking routines (yes! SQL Server uses bitmasks, so they the concept per se can't be all bad!) but for average use cases, it makes life easier for you.

If we had more information about what types of queries you run, we may be able to better advise, because ultimately the use cases dictate the design.

If you persist with the COMPUTED column, I would persist and index it if you haven't already. It helps some queries, such as

  1. exact matches

    WHERE computedInt = POWER(2, 6) -- bit position 7

  2. AND matching on 15th bit and OR matching on 2 other bits (10th and 7th)

    WHERE computedInt >= Power(2,14) AND computedInt < Power(2,15) AND computedInt & (Power(2,9) + Power(2,6)) > 0

But these are probably exotic samples and yet also real live in some cases. It's certainly not too much worse than 21 individual bit columns, for which yes your statements could be easier to write, but remember that SQL Server has mashed them for storage into 3 bytes and will be doing the bit-unmasking anyway! You would have thought if bit-masking were all bad (without exception) then SQL Server wouldn't be doing it, right?

EDIT

Re the scenario of

Four flags, HasHouse,HasCar,HasCat,HasDog, 0000 is has none, 1111 is has all.

it is more efficient and logically expedient to test all 4 bits at once and do a single integer based operation, e.g.

WHERE computedInt & (POWER(2,10)+POWER(2,5)+POWER(2,3)+POWER(2,1)) = 0 -- has none
WHERE computedInt & (POWER(2,10)+POWER(2,5)+POWER(2,3)+POWER(2,1)) > 0 -- has one or more

Hypothetically, if this were your most exercised query on the table, you might even group the four columns into another computed column and index it separately, making the bitmask unnecessary (just test the resultant int with =0 and >0). You might even go further and just precompute the answer... horses for courses.