MySQL Optimization – Range Duplicate Query

datetimeMySQLoptimizationquery-performance

I need to eliminate duplicates from a time sheet. I found this solution and have adapted it for my own needs:

DROP TABLE IF EXISTS `activity`;
CREATE TABLE IF NOT EXISTS `activity` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `planned_start` datetime DEFAULT NULL,
  `planned_end` datetime DEFAULT NULL,
  `actual_start` datetime DEFAULT NULL,
  `actual_end` datetime DEFAULT NULL,
  `code_id` int(11) DEFAULT NULL,
  `setting_id` int(11) DEFAULT NULL,
  `notes` text,
  `travel_distance` decimal(8,2) DEFAULT NULL,
  `created_by` int(11) NOT NULL,
  `updated_by` int(11) DEFAULT NULL,
  `submitted` tinyint(1) DEFAULT NULL,
  `approved` datetime DEFAULT NULL,
  `approved_by` int(11) DEFAULT NULL,
  `created` datetime NOT NULL,
  `updated` datetime NOT NULL,
  `peer_engagement_id` int(11) DEFAULT NULL,
  `person_id` int(11) DEFAULT NULL,
  `travel_notes` varchar(8000) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`),
  KEY `code_id_idx` (`code_id`),
  KEY `setting_id_idx` (`setting_id`),
  KEY `created_by_idx` (`created_by`),
  KEY `updated_by_idx` (`updated_by`),
  KEY `approved_by_idx` (`approved_by`),
  KEY `activity_peer_engagement_id_fk` (`peer_engagement_id`),
  KEY `activity_person_id_fk` (`person_id`),
  KEY `actual_start` (`actual_start`,`actual_end`),
  KEY `created` (`created`),
  KEY `person_id` (`person_id`,`actual_start`,`actual_end`)
) ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=165796 ;


SELECT
  COUNT(*) as occurrence
  , sub.id
  , SEC_TO_TIME(SUM(
      IF(a2start > a1start, a1end - a2start, a2end - a1start))) as duration
FROM
  (  SELECT
       a1.id
      , UNIX_TIMESTAMP(a1.actual_start) as a1start
      , UNIX_TIMESTAMP(a1.actual_end) as a1end
      , UNIX_TIMESTAMP(a2.actual_start) as a2start
      , UNIX_TIMESTAMP(a2.actual_end) as a2end
    FROM activity a1
    INNER JOIN activity a2
      ON (a1.id <> a2.id and a1.person_id=a2.person_id
      AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))
  ) sub

Problem is I can't even run explain on my query, my mysql server goes into 100% CPU usage and seems to stay there for minutes.

I can run explain on the inner query:

explain SELECT
   a1.id
  , UNIX_TIMESTAMP(a1.actual_start) as a1start
  , UNIX_TIMESTAMP(a1.actual_end) as a1end
  , UNIX_TIMESTAMP(a2.actual_start) as a2start
  , UNIX_TIMESTAMP(a2.actual_end) as a2end
FROM activity a1
INNER JOIN activity a2
  ON (a1.id <> a2.id and a1.person_id=a2.person_id
  AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))

+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
| id | select_type | table | type  | possible_keys                                | key       | key_len | ref                                  | rows   | Extra                    |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
|  1 | SIMPLE      | a1    | index | activity_person_id_fk,actual_start,person_id | person_id | 23      | NULL                                 | 176586 | Using index              |
|  1 | SIMPLE      | a2    | ref   | activity_person_id_fk,actual_start,person_id | person_id | 5       | mabel_mindandbody_co_nz.a1.person_id |  19705 | Using where; Using index |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
2 rows in set (0.00 sec)

My questions:

  • Why doesn't explain work here?
  • How can I optimize this query to give acceptable speed results?

In regards to optimization – I cannot find anything else than the indexes I already used in my table.

One other option I have thought about is to add an additional field, encoding every day into a number. I do know that a time sheet entry is never longer than 24 hours and I'm sure excluding midnight spanning time sheet entries would be acceptable. So with this I would hope to use a smaller index on this additional column on the inside query.

Best Answer

With the assumption of no timesheet entry spanning midnight, I have added a column

person_date varchar(30) not null

This is being calculated overnight with

update activity set person_date = concat(person_id , '_',date(actual_start) ) where person_date='';

I have also simplified the inside query to:

SELECT a1.id, 
a1.person_id, 
UNIX_TIMESTAMP( a1.actual_start ) AS a1start,
UNIX_TIMESTAMP( a1.actual_end ) AS a1end, 
UNIX_TIMESTAMP( a2.actual_start ) AS a2start, 
UNIX_TIMESTAMP( a2.actual_end ) AS a2end
FROM activity a1
INNER JOIN activity a2
WHERE (
a1.id < a2.id
AND a1.person_date = a2.person_date
AND a1.actual_start < a2.actual_end
AND a1.actual_end > a2.actual_start

Firstly, because my users record their time to hourly boundaries a lot and we have '0:00' overlap then, secondly with a1.id<>a2.id, we find every entry twice, when we need it only one.

I have concluded that there is no good way to optimize for a1.id < a2.id, so with person_date I'm optimizing for something else.