I'm trying to run the following join query as part of a more complex one on MariaDB 10.1.26.
select distinct
project_commits.project_id,
date_format(created_at, '%x%v1') as week_commit
from project_commits
left join commits
on project_commits.commit_id = commits.id;
Both join fields are indexed. However, the join involves a full scan of project_commits
and an index lookup on commits
. This is corroborated by the output of EXPLAIN
.
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| 1 | SIMPLE | project_commits | ALL | NULL | NULL | NULL | NULL | 5417294109 | Using temporary |
| 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY | 4 | ghtorrent.project_commits.commit_id | 1 | |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
The sizes of the two tables are relatively large: project_commits
contains 5 billion rows and commits
847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance takes a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which has already run for more than half a day, will take another 373 hours to complete.
/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11
Could I somehow increase the query's performance either by partitioning the tables, so that the join can be performed in memory, or by forcing MySQL to perform a sort-merge join? As the time involved for running alternative strategies could be many hours, I'd rather have a suggestion, instead of trying things out.
Best Answer
From the looks of the EXPLAIN plan, you are doing a full table scan on
project_commits
.From the looks of the query, I surmise there is a one-to-many relationship from
commits
toproject_commits
. The problem I see is that your query is gathering data as many-on-one.You are also using
LEFT JOIN
.Your query is saying:
Instead, you might want the query to say:
SUGGESTION #1 : Flip Table Order
SUGGESTION #2 : Use
INNER JOIN
SUGGESTION #3 : Add
created_at
indexIf you are already have an index on
created_at
or if you already have a compound index whose first column iscreated_at
, skip this suggestion.Adding an index will change
EXPLAIN
plan to do an index scan instead of a table scanSUGGESTION #4 : Alter the
JOIN
behaviorYou could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
This same page recommends doing this:
GIVE IT A TRY !!!
Note : I have no idea what to expect from my suggestions. After all, your
LEFT JOIN
is like an iterative Cartesian Join with the potential of make a temp table that is the followingHave fun and let us know what you find