Mysql Query Optimisation For Joining a BETWEEN Date and Date query

MySQLmysql-5.5optimizationperformancequery-performance

I have some instagram API data sitting across two tables. One with post_data, and one with profile_data.

The profile_data table has a timestamp of when the account was checked through the API to get the number of followers the account has at a point in time.

The post_data table has the timestamp of a post.

I need to join these two together with a best fit timestamp to calculate an engagement rate (i.e. likes on a post / followers of account).

The profile_data table has just under 13k rows of data and the
post_data table has just under 6k of rows.

There is a third relational table called brands which contains the brand featured in a post_id

My current query looks like this:

SELECT p.ent_id,
       p.post_time,
       p.record_updated,
       p.last_checked,
       p.post_id, 
       p.id, 
       p.image,
       p.caption,
       p.tags,
       p.likes as 'likes',
       p.comment_count as 'comment_count',
       p.comments,
       p.users_in_photo,
       p.post_type,
       p.link,
       p.gender,
       p.model,
       p.content_type,
       p.location,
       p.campaign,
       GROUP_CONCAT(b.brand SEPARATOR',') AS brands, 
       d.followers,
       ((p.likes+p.comment_count)/d.followers)*100 AS 'engagement_rate', 
       p.reach as 'reach', 
       p.impressions as 'impressions', 
       p.saved as 'saved', 
       ((p.reach/d.followers)*100) as 'reach_rate'
FROM      post_data p
LEFT JOIN featured_brands b ON p.post_id = b.post_id
LEFT JOIN profile_data d    ON p.post_time BETWEEN SUBTIME(d.time_checked, '00:30:00') 
                                               AND ADDTIME(d.time_checked, '00:30:00')
WHERE p.post_time > '2019-01-02 00:00:00' 
  AND p.post_time < '2019-05-13 23:59:59'
GROUP BY p.post_id
ORDER BY likes DESC;

This query achieves the goal however for the date range above, the query takes almost 27seconds which is not acceptable. I have identified that changed the SUBTIME and ADDTIME to the equivalent of (d.time_checked - interval 30 minutes) I can get this down to about 15 seconds but this is still too long.

I figure the BETWEEN call is my major problem but I'm unsure of how to achieve the same result more effectively.

Are there any other methods that I could achieve this result in a more efficient manner?

Edit:

Mysql version 5.5.42.

DDL:

CREATE TABLE `featured_brands` (
  `post_id` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `brand` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=latin1;



CREATE TABLE `post_data` (
  `ent_id` int(11) NOT NULL AUTO_INCREMENT,
  `post_time` timestamp NULL DEFAULT NULL,
  `record_updated` timestamp NULL DEFAULT NULL,
  `last_checked` timestamp NULL DEFAULT NULL,
  `post_id` varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '',
  `image` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `caption` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `tags` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `likes` int(7) NOT NULL,
  `comment_count` int(7) NOT NULL,
  `comments` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `users_in_photo` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `post_type` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `link` text CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `gender` varchar(1000) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `model` varchar(1000) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `content_type` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `location` varchar(255) CHARACTER SET utf8mb4 NOT NULL DEFAULT '',
  `campaign` varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL,
  `id` varchar(50) CHARACTER SET utf8mb4 DEFAULT NULL,
  `ig_id` varchar(50) CHARACTER SET utf8mb4 DEFAULT NULL,
  `engagement` int(11) DEFAULT NULL,
  `reach` int(11) DEFAULT NULL,
  `saved` int(11) DEFAULT NULL,
  `impressions` int(11) DEFAULT NULL,
  `no_advanced_data` int(11) DEFAULT NULL,
  PRIMARY KEY (`ent_id`),
  UNIQUE KEY `post_id` (`post_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


CREATE TABLE `profile_data` (
  `record_id` int(11) NOT NULL,
  `time_checked` timestamp NULL DEFAULT NULL,
  `profile_id` int(7) DEFAULT NULL,
  `username` varchar(255) DEFAULT NULL,
  `profile_picture` varchar(255) DEFAULT NULL,
  `full_name` varchar(255) DEFAULT NULL,
  `bio` text,
  `website` varchar(255) DEFAULT NULL,
  `is_business` int(1) DEFAULT NULL,
  `media_count` int(11) DEFAULT NULL,
  `follows` int(11) DEFAULT NULL,
  `followers` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Best Answer

Some optimizations in these areas have been made since 5.5. I recommend you upgrade.

Don't use LEFT unless you need the semantics of it.

Avoid joining to b in the following manner: Change GROUP_CONCAT(b.brand SEPARATOR',') AS brands to

 ( SELECT GROUP_CONCAT(brand SEPARATOR',')
       FROM featured_brands WHERE post_id = p.post_id
 )  AS brands

and get rid of the LEFT JOIN ...

This may help the "explode-implode" pattern of JOINing, then collapsing via GROUP BY.

You may be able to get rid of the GROUP BY altogether by turning the other LEFT JOIN into a "derived table":

SELECT ...
    FROM ( derived-table (see below) ) AS x
    JOIN post_data AS p  ON p.post_id = x.post_id

The derived table would be a SELECT that JOIN post_data and profile_data with the primary goal of finding which post_ids are interesting.

SELECT p1.post_id, d.followers
    FROM post_id AS p1
    JOIN profile_data AS d ON ...
    WHERE post_time ...

Indexes...

profile_data:  (time_checked, followers)  -- in this order; covering
post_data:  (post_time, post_id)
featured_brands  (post_id, brand)  -- unless you have PRIMARY KEY(post_id)