Mysql – Use MySQL to regularly do multi-way joins on 100+ GB tables

MySQLrdbms

Background:
I’ve created a web application that I would like to be able to scale reasonably well. I know I'm not Google or Twitter, but my app uses a fairly large amount of data for each user and thus has fairly high data requirements. I want to be ready to scale reasonably well without having to re-architect everything later.

I consider myself a software developer, not a database expert. That’s why I am posting here. Hopefully someone with a lot more database expertise can give me advice.

With a relatively large number of users, but nothing like Facebook numbers, I expect to have a DB that looks like this:

One "Big table":

  • 250 million records
  • 20 columns
  • Approximately 100 GB of data
  • Has an indexed bigint(20) foreign key
  • Has an indexed varchar(500) string_id column
  • Has an int(11) "value" column

4 other tables:

  • 10 million records each
  • Approximately 2 – 4 GB of data each
  • each of these tables has 4 – 8 columns
  • one column is datetime date_created
  • one column is the varchar(500) string_id column
  • one or two columns from each of these tables will be selected in a join

One of these tables is used for storing averages — its schema is bigint(20) id, varchar(20) string_id, datetime date_created, float average_value

What I want to do — two relatively expensive queries:

  1. Calculate new average values:

    • Using a foreign key, select up to several million separate records from the big table.
    • Calculate a new average, grouping by the string_id .
    • Insert results into the averages table.
    • As currently constructed, this query uses two joins.
  2. Create de-normalized, read-only records for serving users:

    • Use a foreign key to select anywhere from 1,000-40,000 records from the big table.
    • Join with each of the other four tables on the newest record with the string id column.
    • Insert the results into a de-normalized table.
    • These records are for use by the front-end to display information to users.
    • As currently constructed, this query uses four joins.

I plan to run each of these expensive queries on a batch back-end database that will push its results to a real-time front-end DB server which handles requests from users.
These queries will be run at regular intervals. I haven't decided how often. The average query could be done perhaps once per day. The de-normalize query will need to be more frequent — perhaps every few minutes.

Each of these queries currently runs in a few seconds in MySQL on a very low-end machine with a dataset with 100K records in the “big table.” I am concerned about both my ability to scale and the costs of scaling.

Questions:

  1. Does this approach seem sound? Is there anything obviously wrong with it from a big-picture perspective?
  2. Is a RDBMS the right tool, or should I look at other "big data" solutions like something in the Hadoop family? My inclination is to use a RDBMS because the data is structured and fits nicely into the relational model. At a certain point though, it is my understanding that I may no longer be able to use a RDBMS. Is that true? When would this switch be needed?
  3. Will it work? Can these queries be run in a reasonable amount of time? I can wait perhaps hours for query #1, but query #2 should finish in minutes.
  4. What should I consider from a hardware perspective? What are my RAM and CPU bottlenecks likely to be? I assume keeping indexes in RAM is important. Is there anything else I should consider?
  5. At some point I will probably have to partition my data and use multiple servers. Does my use case seem like it is already in that category, or will I be able to scale a single machine vertically for a while? Will this work with 10x the data? 100x?

Best Answer

Have you tried piling more data and benchmarking it? 100K rows is inconsequential. Try 250M or 500M like you're expecting you'll need to handle and see where the bottlenecks are.

An RDBMS can do a lot of things if you pay careful attention to the limitations and try and work with the strengths of the system. They're exceptionally good at some things, and terrible at others, so you will need to experiment to be sure it's the right fit.

For some batch processing jobs, you really cannot beat flat files, loading the data into RAM, smashing it around using a series of loops and temporary variables, and dumping out the results. MySQL will never, ever be able to match that sort of speed, but if tuned properly and used correctly it can come within an order of magnitude.

What you'll want to do is investigate how your data can be partitioned. Do you have one big set of data with too much in the way of cross-links to be able to split it up, or are there natural places to partition it? If you can partition it you won't have one table with a whole pile of rows, but potentially many significantly smaller ones. Smaller tables, with much smaller indexes, tend to perform better.

From a hardware perspective, you'll need to test to see how your platform performs. Sometimes memory is essential. Other times it's disk I/O. It really depends on what you're doing with the data. You'll need to pay close attention to your CPU usage and look for high levels of IO wait to know where the problem lies.

Whenever possible, split your data across multiple systems. You can use MySQL Cluster if you're feeling brave, or simply spin up many independent instances of MySQL where each stores an arbitrary portion of the complete data set using some partitioning scheme that makes sense.