If you enable MySQL with innodb_file_per_table, each InnoDB table will have its own dedicated tablespace. The system tablespace file, bwetter known as ibdata1, will grow very slowly.
If innodb_file_per_table is disable, everything and its grandmother goes into ibdata1. That will grow quickly.
If you are not using innodb_file_per_table, you need to rearchitect InnoDB's infrastructure. This will give you a limit of 2TB per table instead of 2TB for all InnoDB. This is the case for ext3. On ext4, the table size can be up to 16TB.
Please see my past posts on how to do that
I can provide with a general explanation, but it may not apply specifically to your particular case:
The way decision making works is by evaluation cost of execution plan, then picking up what is hopefully the cheapest plan. This you already know.
When it comes to indexing, though, stuff are getting interesting. The way to evaluate the usefulness or viability of an index is to estimate the selectivity given some value.
For the moment, forget about your FULLTEXT index, and let's assume a simple index on some column col1
, and another index on some column col2
. Given the following two queries:
SELECT * FROM t WHERE col1 < 10 and col2 = 4;
SELECT * FROM t WHERE col1 BETWEEN 100 AND 110 and col2 = 4;
It may happen that the query is evaluated differently in these two cases. Why? Because it may happen that col2 = 4
returns more rows than col1 < 10
, in which case we prefer to use index on col1
. But then, it may return less rows than col1 BETWEEN 100 AND 110
, in which case we prefer the index on col2
.
Your case is not very much different. MySQL estimates the number of rows returned by some index query. When you use more columns, MySQL gets the impression your index is likely to result with few rows. So it chooses to start with TableA
, then joins what should be very few rows with TableB
.
But if MySQL believes the index to return many rows, it may prefer starting with TableB
. Why is that? Because you are sorting on indexed columns of TableB
. Sorting is a lot of work, too. So MySQL may choose to first sort the rows, then join to TableA
and filter by fulltext index. It may not be a bad idea if the fulltext search yields with many rows anyhow.
Best Answer
That sounds like taken from my blog posts (or at least those were my numbers and specs): https://jynus.com/dbahire/testing-the-fastest-way-to-import-a-table-into-mysql-and-some-interesting-5-7-performance-results/ and https://jynus.com/dbahire/testing-again-load-data-on-mysql-5-6-5-7-8-0-non-ga-and-mariadb-10-0-10-1-and-10-2-non-ga/
As you can see my experience is based on actual tests; but not only laboratory tests like the above, I do those because they help me be ready to make sure my database backups (and recoveries) are generated correctly, as well as they can be performed reliably and fastly, handling daily both logical dumps and snapshots for the half a petabyte of data we store on our MariaDB databases: https://www.slideshare.net/jynus/backing-up-wikipedia-databases
A 50 GB database on a 32GB memory server is a very generous ratio, where 60% of the data could fit in the buffer pool. In that case, thoughput can be optimized greatly, as long as you setup your vm, os and mysql configuration for it (disabling the binary log, increasing the buffer pool and transaction log files, loosening consistency parameters during the import, etc). You will also want the original format to be optimized for easy loading, so you dont waste cpu cycles on parsing or converting the format or other changes, as well as doing it in large transactions as well as on several threads in parallel if possible.
As an example, my production has 1-2TB databases with billions of rows, which I can recover logically in 6-12 hours on a 512GB memory machine, including the many indexes.
Under the above right circumstances, with a mostly in memory database, I would be able to load remotelly in parallel a 50 GB database in around 30 minutes. Less than 1h if the storage is slow. Be careful because the tests asume dedicated resources; a cpu, memory or io limitation can create a bottleneck, leading to higher load times.