Mysql – Planning and Optimizing the table structure InnoDB MySql

indexinnodblikeMySQL

I am planning to have an InnoDB MySql table with tens of millions of rows.

Each row represents a record that can belong to AT LEAST one of 200+ categories. Most of the records will belong to 2 or more categories. All the categories represented by a numeric id, from 1 to 200+.

What is the most efficient way to store these multiple categories within each record to get the fastest look up speed when querying to return all the records belonging to one of the categories?

I think the fastest way will be to have a separate column for each of the categories and then just query that column. But the problem is I will have to make 200+ columns and will be new categories added in the future. Also, I don’t think having 200+ indexes will do any good for my server.

Then I thought of creating a separate record for each of categories, so there would be only one column for categories and the index will work just fine. The problem with this approach is that the table is already in tens of millions and it will keep on growing over time. If I will throw in duplicate records for each of the current multi-category records, the table size will quickly become unmanageable plus it will affect the performance.

The other approach I can think of will be to store all the categories related to a record in one column separated by comma or any other separator and then do a LIKE %% search. But If I understand correctly, this method doesn’t use an index and will result in the slow lookups.

So what is the correct way to approach it? I was even thinking of creating a separate table for each category, but that would be unacceptable when I will need to do some other types of queries across whole data.

Best Answer

Recommendation

"all the records belonging to one of the categories" --> Build a many:many mapping between records and categories. To have that table optimal, follow the tips here . Be sure to use a 2-byte SMALLINT UNSIGNED for the category_id. (You should not trust that you won't go beyond 255.)

Then the query is something like

SELECT r.*
    FROM RecordsCategories AS rc
    JOIN Records AS r  USING(record_id)
    WHERE rc.category_id = 123;

This approach will do an index scan of a portion of rc to find the record_ids, then randomly reach into r via the PK. Yes, there will be "hundreds of millions of rows" in rc, but it will scan only a small part of it, because all the entries for category_id=123 will be 'clustered' together (either in the data or the secondary index).

Critique of alternatives

If you have a column with a commalist of categories, then every row will need to be fetched. That is a full table scan ("tens of millions of rows" = a few gigabytes) of Records. (BTW, don't use LIKE %123%, use FIND_IN_SET() -- not faster, but less messy to handle edge cases.)

"a separate table for each category" -- This is a very common question on this (and other) forum. The answer is "NEVER!".

"separate column for each of the categories" -- Gag me with a spoon! The table will be fatter and, again, you will need a full table scan.

"200+ indexes" -- The hard limit is 64. The practical limit is more like 5.

Building on Akina's and Acidon's comments: With 8.0, you can use TINYBLOB or BINARY(..) and set bits. This is about as clumsy as having multiple SETs. Before 8.0, bitwise ORs (etc) were limited to 64 bits; now BLOBs work. SET/BLOB/BINARY/BIGINT save the most space. But your entire dataset is only tens of GB, so I don't see 'size' as a problem. Even the many:many table I recommend will be only several extra GB.