Mysql – How to break table into two without losing performance

database-designMySQLnormalizationoptimizationperformance

According to https://stackoverflow.com/a/174047/14731, splitting away infrequently-needed columns frees up the cache allowing faster retrieval of the commonly-used columns.

I've got a table whose columns are always retrieved together, but I'd still like to split them for design reasons (reduce duplication across multiple tables, improve code reuse). For example, I've got different tables that use the same permission scheme. Instead of adding permission columns to each table, I'd like to use a foreign key to reference a separate permission-scheme table.

I've populated MySQL with 1 million rows, ran queries against both versions, and discovered that the version with a JOIN is ~3x slower (0.9 seconds vs 2.9 seconds).

Here are my tables:

original
(
    id BIGINT NOT NULL,
    first BIGINT NOT NULL,
    second BIGINT NOT NULL,
    third BIGINT NOT NULL
);
part1
(
    id BIGINT NOT NULL,
    first BIGINT NOT NULL,
    second BIGINT NOT NULL,
    PRIMARY KEY(id)
);
part2
(
    link BIGINT NOT NULL,
    third BIGINT NOT NULL,
    FOREIGN KEY (link) REFERENCES part1(id)
);

Here are my queries:

SELECT first, second, third FROM original;
SELECT part1.first, part1.second, part2.third FROM part1, part2 WHERE part2.link = part1.id;

Is there any way to reduce the performance overhead of the split design?


If you want to reproduce this test on your side, you can use the following Java application to generate the SQL script to populate the database:

import java.io.FileNotFoundException;
import java.io.PrintWriter;

public class Main
{
    public static void main(String[] args) throws FileNotFoundException
    {
        final int COUNT = 1_000_000;
        try (PrintWriter out = new PrintWriter("/import.sql"))
        {
            for (int i = 0; i < COUNT; ++i)
                out.println("INSERT INTO original VALUES (" + i + ", " + i + ", 0);");
            out.println("INSERT INTO original VALUES (" + (COUNT - 2) + ", " + (COUNT - 1) +
                ", 1);");
            out.println();
            for (int i = 0; i < COUNT; ++i)
            {
                out.println("INSERT INTO part1 (first, second) VALUES (" + i + ", " + i + ");");
                out.println("INSERT INTO part2 VALUES (LAST_INSERT_ID(), 0);");
            }
            out.println("INSERT INTO part1 (first, second) VALUES (" + (COUNT - 2) + ", " +
                (COUNT - 1) + ");");
            out.println("INSERT INTO part2 VALUES (LAST_INSERT_ID(), 1);");
            out.println();
        }
    }
}

Best Answer

This is exactly the reason for using normalization to a limited extent, and after performance testing. Normalization comes at cost of joins (sorting). The main purpose of DWH on 5NF is to store data safely, not to retrieve it quickly.

Alternative 1 There is a concept of Materialized View: a view that saved on hard drive. MySQL does not provide it out of the box, but this article - Materialized views with MySQL - explains how this functionality can be recreated with a SP updating/refreshing a table.

A Materialized View (MV) is the pre-calculated (materialized) result of a query. Unlike a simple VIEW the result of a Materialized View is stored somewhere, generally in a table. Materialized Views are used when immediate response is needed and the query where the Materialized View bases on would take to long to produce a result. Materialized Views have to be refreshed once in a while. It depends on the requirements how often a Materialized View is refreshed and how actual its content is. Basically a Materialized View can be refreshed immediately or deferred, it can be refreshed fully or to a certain point in time. MySQL does not provide Materialized Views by itself.

Alternative 2 You can try to achieve your design by doing things other way round. Instead of splitting main table, create 2-3 views or tables that come from the main one. This way you'll have normalized tables for star schema with distinct values and also you'll keep the main fast table.

Performance tuning is always about the trade off between CPU (time), RAM, and IO (throughput or space). In this case it is between CPU and IO.