Mysql – Planning and Optimizing the table structure InnoDB MySql

indexinnodblikeMySQL

I am planning to have an InnoDB MySql table with tens of millions of rows.

Each row represents a record that can belong to AT LEAST one of 200+ categories. Most of the records will belong to 2 or more categories. All the categories represented by a numeric id, from 1 to 200+.

What is the most efficient way to store these multiple categories within each record to get the fastest look up speed when querying to return all the records belonging to one of the categories?

I think the fastest way will be to have a separate column for each of the categories and then just query that column. But the problem is I will have to make 200+ columns and will be new categories added in the future. Also, I don’t think having 200+ indexes will do any good for my server.

Then I thought of creating a separate record for each of categories, so there would be only one column for categories and the index will work just fine. The problem with this approach is that the table is already in tens of millions and it will keep on growing over time. If I will throw in duplicate records for each of the current multi-category records, the table size will quickly become unmanageable plus it will affect the performance.

The other approach I can think of will be to store all the categories related to a record in one column separated by comma or any other separator and then do a LIKE %% search. But If I understand correctly, this method doesn’t use an index and will result in the slow lookups.

So what is the correct way to approach it? I was even thinking of creating a separate table for each category, but that would be unacceptable when I will need to do some other types of queries across whole data.

Best Answer

Recommendation

"all the records belonging to one of the categories" --> Build a many:many mapping between records and categories. To have that table optimal, follow the tips here . Be sure to use a 2-byte SMALLINT UNSIGNED for the category_id. (You should not trust that you won't go beyond 255.)

Then the query is something like

SELECT r.*
    FROM RecordsCategories AS rc
    JOIN Records AS r  USING(record_id)
    WHERE rc.category_id = 123;

This approach will do an index scan of a portion of rc to find the record_ids, then randomly reach into r via the PK. Yes, there will be "hundreds of millions of rows" in rc, but it will scan only a small part of it, because all the entries for category_id=123 will be 'clustered' together (either in the data or the secondary index).

Critique of alternatives

If you have a column with a commalist of categories, then every row will need to be fetched. That is a full table scan ("tens of millions of rows" = a few gigabytes) of Records. (BTW, don't use LIKE %123%, use FIND_IN_SET() -- not faster, but less messy to handle edge cases.)

"a separate table for each category" -- This is a very common question on this (and other) forum. The answer is "NEVER!".

"separate column for each of the categories" -- Gag me with a spoon! The table will be fatter and, again, you will need a full table scan.

"200+ indexes" -- The hard limit is 64. The practical limit is more like 5.

Building on Akina's and Acidon's comments: With 8.0, you can use TINYBLOB or BINARY(..) and set bits. This is about as clumsy as having multiple SETs. Before 8.0, bitwise ORs (etc) were limited to 64 bits; now BLOBs work. SET/BLOB/BINARY/BIGINT save the most space. But your entire dataset is only tens of GB, so I don't see 'size' as a problem. Even the many:many table I recommend will be only several extra GB.

TECHNIQUE #1 : Use multiple system tablespaces in ext3

You could chain InnoDB tablespaces 2TB at a time. Start off with this

innodb_data_file_path=ibdata1:10M:autoextend:max:2048G

When ibdata1 is about to hit 2TB, add another ibdata file

innodb_data_file_path=ibdata1:2048G:ibdata2:10M:autoextend:max:2048G

When ibdata2 is about to hit 2TB, add another ibdata file

innodb_data_file_path=ibdata:2048G:ibdata2:2048G:ibdata3:10M:autoextend:max:2048G

and so on...

UPDATE 2013-06-01 20:45 EDT

I just discovered something with reference to setting the max filesize for a system tablespace. Here is the situation:

Client has the following InnoDB files

[root@l*****]# ls -l ibd*
-rw-rw---- 1 s-em7-mysql s-em7-mysql     362807296 Jun  2 00:15 ibdata1
-rw-rw---- 1 s-em7-mysql s-em7-mysql 2196875759616 Jun  2 00:15 ibdata2

and has this setting

innodb_data_file_path=ibdata1:346M;ibdata2:500M:autoextend:max:10240000M

MySQL was reporting table full errors.

I tried creating a new file ibdata3 with

innodb_data_file_path=ibdata1:346M;ibdata2:2048G;ibdata3:10M:autoextend

That value did not work

I tried using new sizes for the max value

2046G
2045G
2046000M

After trying for 15 min with different combinations, none of these values worked. Looks like mysqld was never going to startup again. I decided to try the exact number of bytes for the filesize of ibdata2:

innodb_data_file_path=ibdata1:346M;ibdata2:2196875759616;ibdata3:10M:autoextend

When I started mysql, LO AND BEHOLD, MySQL was up.

ibdata3 was created at 10M
ibdata3 was soon growing in 8M chunks
Client was happy once again

Moral of the story : At times, you may need to use the exact filesize for a system tablespace in innodb_data_file_path

TECHNIQUE #2 : RAW DISK PARTITION

According to MySQL 5.0 Certification Study Guide, Page 428

Any raw partitions in the configuration must exist but must have the modifier newraw listed after the size of the file specification. newraw tells InnoDB to initialize the partition when the server starts up. New partitions are treated as read-only after initialization. After InnoDB initializes the tablespace, stop the server change newraw to raw in the partition specfication, and restart the server. For example, to use a 10GB Unix partition named /dev/hdc6, begin with a configuration like this:

[mysqld]
innodb_data_home_dir=
innodb_data_file_path=/dev/hdc6:10Gnewraw

Start the server and let InnoDB initialize the tablespace. Then stop the server and change the configuration from newraw to raw:

[mysqld]
innodb_data_home_dir=
innodb_data_file_path=/dev/hdc6:10Graw

After changing the configuration, restart the server.

Just create a RAID10 Disk with whatever size you need. Just don't put any filesystem on it.

TECHNIQUE #3 : Use ext4 with TECHNIQUE #1

You could chain InnoDB tablespaces 16TB at a time. Start off with this

innodb_data_file_path=ibdata1:10M:autoextend:max:16384G

When ibdata1 is about to hit 16TB, add another ibdata file

innodb_data_file_path=ibdata1:16384G:ibdata2:10M:autoextend:max:16384G

When ibdata2 is about to hit 16TB, add another ibdata file

innodb_data_file_path=ibdata:16384G:ibdata2:16384G:ibdata3:10M:autoextend:max:16384G

And so on...

Best Answer

Related Solutions

Mysql – horizontal partitioning or multiple tables

Mysql – Database Design – Creating Multiple databases to avoid the headache of limit on table size

TECHNIQUE #1 : Use multiple system tablespaces in ext3

UPDATE 2013-06-01 20:45 EDT

TECHNIQUE #2 : RAW DISK PARTITION

TECHNIQUE #3 : Use ext4 with TECHNIQUE #1

Related Question