Mysql – Can MySQL reasonably perform queries on billions of rows

database-designMySQLperformance

I am planning on storing scans from a mass spectrometer in a MySQL database and
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?

Input format

Each input file contains a single run of the spectrometer; each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.

Host system

|----------------+-------------------------------|
| OS             | Windows 2008 64-bit           |
| MySQL version  | 5.5.24 (x86_64)               |
| CPU            | 2x Xeon E5420 (8 cores total) |
| RAM            | 8GB                           |
| SSD filesystem | 500 GiB                       |
| HDD RAID       | 12 TiB                        |
|----------------+-------------------------------|

There are some other services running on the server using negligible processor
time.

File statistics

|------------------+--------------|
| number of files  | ~16,000      |
| total size       | 1.3 TiB      |
| min size         | 0 bytes      |
| max size         | 12 GiB       |
| mean             | 800 MiB      |
| median           | 500 MiB      |
| total datapoints | ~200 billion |
|------------------+--------------|

The total number of datapoints is a very rough estimate.

Proposed schema

I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a runs table, a spectra table with a foreign key to runs,
and a datapoints table with a foreign key to spectra.

The 200 Billion datapoint question

I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question) and am not trying to
shuffle hundreds of MiB across the network, is it remotely plausible for MySQL
to handle this?

Additional info

The scan data will be coming from files in the XML-based
mzML format. The meat of this format is in the
<binaryDataArrayList> elements where the data is stored. Each scan produces >=
2 <binaryDataArray> elements which, taken together, form a 2-dimensional (or
more) array of the form [[123.456, 234.567, ...], ...].

These data are write-once, so update performance and transaction safety are not
concerns.

My naïve plan for a database schema is:

runs table

| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| start_time  | TIMESTAMP   |
| name        | VARCHAR     |
|-------------+-------------|

spectra table

| column name    | type        |
|----------------+-------------|
| id             | PRIMARY KEY |
| name           | VARCHAR     |
| index          | INT         |
| spectrum_type  | INT         |
| representation | INT         |
| run_id         | FOREIGN KEY |
|----------------+-------------|

datapoints table

| column name | type        |
|-------------+-------------|
| id          | PRIMARY KEY |
| spectrum_id | FOREIGN KEY |
| mz          | DOUBLE      |
| num_counts  | DOUBLE      |
| index       | INT         |
|-------------+-------------|

Is this reasonable?


So, as you may have been able to infer, I am the programmer, not the biologist
in the lab, so I don't know the science nearly as well as the actual scientists.

Here's a plot of a single spectrum (scan) of the kind of data with which I'll be
dealing:

Viewer screenshot

The goal of the software is to figure out where and how significant the peaks
are. We use a proprietary software package to figure this out now, but we want
to write our own analysis program (in R) so we know what the heck is going on
under the sheets. As you can see, the vast majority of the data are
uninteresting, but we don't want to throw out potentially-useful data which our
algorithm missed. Once we have a list of probable peaks with which we're
satisfied, the rest of the pipeline will use that peak list rather than the raw
list of datapoints. I suppose that it would be sufficient to store the raw
datapoints as a big blob, so they can be reanalyzed if need be, but keep only
the peaks as distinct database entries. In that case, there would be only a
couple dozen peaks per spectrum, so the crazy scaling stuff shouldn't be as much
of an issue.

Best Answer

I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.

As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.

Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).