Mysql – Simple query with JOIN on 60M rows using index takes forever in MariaDB/MySQL

mariadbMySQL

Here's a rather simple SQL query:

select source_id
from data_sources
join sources on sources.id = data_sources.source_id
where log_time + interval sources.retention_days day < current_timestamp;

There are ~800 records in sources and ~60 million records in data_sources.

The DBMS used is Maria DB:

mysqld  Ver 10.3.22-MariaDB-1ubuntu1 for debian-linux-gnu on x86_64 (Ubuntu 20.04)

Indizes are being used apparently:

+--+-----------+------------+-----+-------------------------------+-------------------------------+-------+----------------------+----+-----------+
|id|select_type|table       |type |possible_keys                  |key                            |key_len|ref                   |rows|Extra      |
+--+-----------+------------+-----+-------------------------------+-------------------------------+-------+----------------------+----+-----------+
|1 |SIMPLE     |sources     |index|PRIMARY,idx_sources_1          |index_sources_on_retention_days|4      |NULL                  |757 |Using index|
|1 |SIMPLE     |data_sources|ref  |index_data_sources_on_source_id|index_data_sources_on_source_id|5      |vse_web_dev.sources.id|445 |Using where|
+--+-----------+------------+-----+-------------------------------+-------------------------------+-------+----------------------+----+-----------+

Surprisingly this query takes about an hour on an 8 core Intel i7.
RAM doesn't seem to be a problem: While the MySQL processes are causing a very high CPU load on multiple cores during execution time, there is plenty of free RAM.

Here the DDLs for the tables involved:

create table sources (
    id int auto_increment primary key,
    name varchar(255) null,
    source varchar(255) null,
    start_time varchar(255) null,
    frequency varchar(255) null,
    stop_time varchar(255) null,
    unit varchar(255) null,
    created_at datetime not null,
    updated_at datetime not null,
    type varchar(255) null,
    move_source_file tinyint(1) default 1 null,
    revision int null,
    data_freq_minute int default 1 null,
    shift_right_time_stamp tinyint(1) default 0 null,
    display_name varchar(255) null,
    source_group_id int null,
    retention_days int default 365 not null
);

create index idx_sources_1 on sources (id);
create index index_sources_on_source_group_id on sources (source_group_id);
create index index_sources_on_retention_days on sources (retention_days);

create table data_sources (
    id bigint auto_increment primary key,
    source_id int null,
    log_time datetime null,
    value float null,
    created_at datetime not null,
    updated_at datetime not null
);

create index idx_data_sources_1 on data_sources (id);
create index idx_data_sources_2 on data_sources (id, source_id, log_time);
create index index_data_sources_on_log_time on data_sources (log_time);
create index index_data_sources_on_source_id on data_sources (source_id);
create index index_data_sources_on_updated_at on data_sources (updated_at);


Why is this taking so long and how to speed it up?

Best Answer

Adding an index to data_sources(source_id, log_time) solved the problem.

create index index_data_sources_source_id_log_time on data_sources(source_id, log_time);

This brought the execution time down to a couple of seconds.

Apparently the calculation of interval sources.retention_days day < current_timestamp doesn't have such a massive performance impact, as I suspected.

Thanks for the helpful comments!