MySQL/MariaDB – Speed Up Large JOIN with Partitioning or Merging

join;mariadbMySQLpartitioning

I'm trying to run the following join query as part of a more complex one on MariaDB 10.1.26.

select distinct
    project_commits.project_id,
    date_format(created_at, '%x%v1') as week_commit
    from project_commits
    left join commits
    on project_commits.commit_id = commits.id;

Both join fields are indexed. However, the join involves a full scan of project_commits and an index lookup on commits. This is corroborated by the output of EXPLAIN.

+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| id   | select_type | table           | type   | possible_keys | key     | key_len | ref                                 | rows       | Extra           |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
|    1 | SIMPLE      | project_commits | ALL    | NULL          | NULL    | NULL    | NULL                                | 5417294109 | Using temporary |
|    1 | SIMPLE      | commits         | eq_ref | PRIMARY       | PRIMARY | 4       | ghtorrent.project_commits.commit_id |          1 |                 |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+

The sizes of the two tables are relatively large: project_commits contains 5 billion rows and commits 847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance takes a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which has already run for more than half a day, will take another 373 hours to complete.

/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11

Could I somehow increase the query's performance either by partitioning the tables, so that the join can be performed in memory, or by forcing MySQL to perform a sort-merge join? As the time involved for running alternative strategies could be many hours, I'd rather have a suggestion, instead of trying things out.

Best Answer

From the looks of the EXPLAIN plan, you are doing a full table scan on project_commits.

From the looks of the query, I surmise there is a one-to-many relationship from commits to project_commits. The problem I see is that your query is gathering data as many-on-one.

You are also using LEFT JOIN.

Your query is saying:

Show me all project_commits that are connected to or orphaned from commits

Instead, you might want the query to say:

Show me all project_commits that are connected to commits

SUGGESTION #1 : Flip Table Order

EXPLAIN select distinct
    project_commits.project_id,
    date_format(created_at, '%x%v1') as week_commit
    from commits
    left join project_commits
    on commits.id = project_commits.commit_id;

SUGGESTION #2 : Use INNER JOIN

EXPLAIN select distinct
    project_commits.project_id,
    date_format(created_at, '%x%v1') as week_commit
    from project_commits
    inner join commits
    on project_commits.commit_id = commits.id;

SUGGESTION #3 : Add created_at index

If you are already have an index on created_at or if you already have a compound index whose first column is created_at, skip this suggestion.

ALTER TABLE `project_commits` ADD INDEX created_at_ndx (`created_at`);

Adding an index will change EXPLAIN plan to do an index scan instead of a table scan

SUGGESTION #4 : Alter the JOIN behavior

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

GIVE IT A TRY !!!

Note : I have no idea what to expect from my suggestions. After all, your LEFT JOIN is like an iterative Cartesian Join with the potential of make a temp table that is the following

4.573 Quintillion rows (5.4 Billion X 0.847 Billion) being looked up

Have fun and let us know what you find