Query Optimization – How to Optimize a Query with Multiple Derived Tables Containing ‘IN’ and ‘GROUP BY’

mariadbperformancequery-performance

I'm collecting nmap data every five minutes and storing it in a database. Information about each scan (e.g. start and end time) is stored in the scans table:

CREATE TABLE `scans` (
  `scan_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `start_time` datetime NOT NULL,
  `end_time` datetime NOT NULL,
  `nmap_version` varchar(20) DEFAULT NULL,
  `nmap_args` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`scan_id`)
) ENGINE=InnoDB AUTO_INCREMENT=34901 DEFAULT CHARSET=utf8

Information about the hosts scanned (e.g. hostname, MAC address) is stored in the hosts table:

CREATE TABLE `hosts` (
  `scan_id` int(10) unsigned NOT NULL,
  `host_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `hostname` varchar(255) NOT NULL,
  `ip_address` int(10) unsigned NOT NULL,
  `mac_address` bigint(20) unsigned DEFAULT NULL,
  `mac_vendor` varchar(255) DEFAULT NULL,
  `status` varchar(20) NOT NULL,
  `hops` int(10) unsigned DEFAULT NULL,
  `last_boot` datetime DEFAULT NULL,
  PRIMARY KEY (`host_id`),
  KEY `scan_id` (`scan_id`),
  KEY `idx_status` (`status`),
  KEY `idx_hostname` (`hostname`),
  CONSTRAINT `hosts_ibfk_1` FOREIGN KEY (`scan_id`) REFERENCES `scans` (`scan_id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2262995 DEFAULT CHARSET=utf8

Now I want to fetch the information from the most recent scan of one or more (up to about a thousand) hosts. The most recent scan will not necessarily be the same for all hosts. I also want to fetch the last time each host was up.

I've been using the following query, but it's slow (it takes about six seconds to fetch the data for three hosts):

SELECT hosts.hostname,
       INET_NTOA(hosts.ip_address) AS ip,
       CONV(hosts.mac_address, 10, 16) AS mac,
       hosts.mac_vendor AS mac_vendor,
       hosts.status AS status,
       scans.start_time AS last_scan,
       u.last_seen AS last_seen
FROM hosts
JOIN (
       -- ID of most recent scan for each host
       SELECT MAX(hosts.scan_id) AS max_scan_id,
              hosts.hostname
       FROM hosts
       WHERE hosts.hostname IN ('foo', 'bar', 'baz')
       GROUP BY hosts.hostname
     ) AS t
  ON (hosts.scan_id = t.max_scan_id AND hosts.hostname = t.hostname)
JOIN scans
  ON scans.scan_id = t.max_scan_id
JOIN (
       -- Last time each host was up
       SELECT MAX(scans.start_time) AS last_seen,
              hosts.hostname
       FROM scans
       JOIN hosts
         ON scans.scan_id = hosts.scan_id
       WHERE hosts.status = 'up'
       GROUP BY hosts.hostname
     ) AS u
  ON hosts.hostname = u.hostname

The EXPLAIN shows that one of the derived queries is doing a table scan:

+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+
| id   | select_type | table      | type   | possible_keys        | key          | key_len | ref                | rows    | Extra                                                               |
+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL                 | NULL         | NULL    | NULL               |   12855 | Using where                                                         |
|    1 | PRIMARY     | scans      | eq_ref | PRIMARY              | PRIMARY      | 4       | t.max_scan_id      |       1 |                                                                     |
|    1 | PRIMARY     | hosts      | ref    | scan_id,idx_hostname | scan_id      | 4       | t.max_scan_id      |      81 | Using where                                                         |
|    1 | PRIMARY     | <derived3> | ref    | key0                 | key0         | 767     | t.hostname         |      10 |                                                                     |
|    3 | DERIVED     | hosts      | ref    | scan_id,idx_status   | idx_status   | 62      | const              | 1136083 | Using index condition; Using where; Using temporary; Using filesort |
|    3 | DERIVED     | scans      | eq_ref | PRIMARY              | PRIMARY      | 4       | wmap.hosts.scan_id |       1 |                                                                     |
|    2 | DERIVED     | hosts      | range  | idx_hostname         | idx_hostname | 767     | NULL               |   12855 | Using index condition                                               |
+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+

I tried adding the IN clause to the outer query and the second subquery (three IN clauses total), which worked great when searching for a single host, but actually performed worse than the original with a large number of hosts.

I also tried creating a temporary table to hold only the hostnames to be matched and a temporary table to hold the results of each subquery:

CREATE TEMPORARY TABLE hosts_to_match (
    hostname VARCHAR(255) NOT NULL PRIMARY KEY
);

INSERT INTO hosts_to_match
VALUES ('foo'), ('bar'), ('baz');

CREATE TEMPORARY TABLE last_scan (
    hostname VARCHAR(255) NOT NULL PRIMARY KEY, 
    scan_id INT(10) UNSIGNED NOT NULL
);

INSERT INTO last_scan (
    scan_id, 
    hostname
) 
SELECT MAX(hosts.scan_id),
       hosts.hostname 
FROM hosts_to_match 
JOIN hosts 
  ON hosts_to_match.hostname = hosts.hostname 
GROUP BY hosts.hostname;

CREATE TEMPORARY TABLE last_seen (
    hostname VARCHAR(255) NOT NULL PRIMARY KEY,
    last_seen DATETIME NOT NULL
);

INSERT INTO last_seen (
    last_seen, 
    hostname
) 
SELECT MAX(scans.start_time),
       hosts.hostname 
FROM hosts_to_match 
JOIN hosts 
  ON hosts_to_match.hostname = hosts.hostname 
JOIN scans 
  ON scans.scan_id = hosts.scan_id 
WHERE hosts.status = 'UP' 
GROUP BY hosts.hostname;

SELECT hosts.hostname, 
       INET_NTOA(hosts.ip_address) AS ip,
       CONV(hosts.mac_address, 10, 16) AS mac, 
       hosts.mac_vendor,
       hosts.status,
       scans.start_time,
       last_seen.last_seen
FROM last_scan
JOIN hosts
  ON (last_scan.hostname = hosts.hostname AND last_scan.scan_id = hosts.scan_id)
JOIN scans
  ON scans.scan_id = last_scan.scan_id
JOIN last_seen
  ON hosts.hostname = last_seen.hostname;

But populating the last_scan and last_seen temporary tables is slow for a large number of hosts.

EXPLAIN for SELECT used to populate last_scan:

+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+
| id   | select_type | table          | type  | possible_keys | key          | key_len | ref                          | rows | Extra                                        |
+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+
|    1 | SIMPLE      | hosts_to_match | index | PRIMARY       | PRIMARY      | 767     | NULL                         |  166 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | hosts          | ref   | idx_hostname  | idx_hostname | 767     | wmap.hosts_to_match.hostname | 2573 |                                              |
+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+

EXPLAIN for SELECT used to populate last_seen:

+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+
| id   | select_type | table          | type   | possible_keys                   | key          | key_len | ref                          | rows | Extra                                        |
+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+
|    1 | SIMPLE      | hosts_to_match | index  | PRIMARY                         | PRIMARY      | 767     | NULL                         |  166 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | hosts          | ref    | scan_id,idx_status,idx_hostname | idx_hostname | 767     | wmap.hosts_to_match.hostname | 2573 | Using where                                  |
|    1 | SIMPLE      | scans          | eq_ref | PRIMARY                         | PRIMARY      | 4       | wmap.hosts.scan_id           |    1 |                                              |
+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+

How can I speed this up? I'm using MariaDB 5.5.47.

Best Answer

You can really benefit from an index on hosts (hostname, scan_id) for this query, and possibly another one including status (especially for the second query below). Your query may also benefit from transferring some joins to per-row totals:

CREATE INDEX idx_hostname_scanid ON hosts (hostname, scan_id);
CREATE INDEX idx_hostname_status_scanid ON hosts (hostname, status, scan_id);

SELECT hosts.hostname,
       INET_NTOA(hosts.ip_address) AS ip,
       CONV(hosts.mac_address, 10, 16) AS mac,
       hosts.mac_vendor AS mac_vendor,
       hosts.status AS status,
       scans.start_time AS last_scan,
       (SELECT MAX(scans.start_time)
        FROM hosts
        JOIN scans ON (scans.scan_id = hosts.scan_id)
        WHERE hosts.hostname = t.hostname AND hosts.status = 'up') AS last_seen
FROM (
       -- ID of most recent scan for each host
       SELECT MAX(hosts.scan_id) AS max_scan_id, hosts.hostname
       FROM hosts
       WHERE hosts.hostname IN ('foo', 'bar', 'baz')
       GROUP BY hosts.hostname
     ) t
JOIN hosts ON (hosts.hostname = t.hostname AND hosts.scan_id = t.max_scan_id)
JOIN scans ON (scans.scan_id = t.max_scan_id);

Also, considering that you already trust the last scan to be the one with the highest id, you may speed-up your query by trusting the last_seen time to be the one with the highest id:

CREATE INDEX idx_hostname_scanid ON hosts (hostname, scan_id);
CREATE INDEX idx_hostname_status_scanid ON hosts (hostname, status, scan_id);

SELECT hosts.hostname,
       INET_NTOA(hosts.ip_address) AS ip,
       CONV(hosts.mac_address, 10, 16) AS mac,
       hosts.mac_vendor AS mac_vendor,
       hosts.status AS status,
       scans.start_time AS last_scan,
       lss.start_time AS last_seen
FROM (
       -- ID of most recent scan for each host
       SELECT MAX(hosts.scan_id) AS max_scan_id, hosts.hostname
       FROM hosts
       WHERE hosts.hostname IN ('foo', 'bar', 'baz')
       GROUP BY hosts.hostname
     ) t
JOIN hosts ON (hosts.hostname = t.hostname AND hosts.scan_id = t.max_scan_id)
JOIN scans ON (scans.scan_id = t.max_scan_id)
LEFT JOIN (
       SELECT MAX(hosts.scan_id) AS max_scan_id, hosts.hostname
       FROM hosts
       WHERE hosts.hostname IN ('foo', 'bar', 'baz') AND hosts.status = 'up'
       GROUP BY hosts.hostname
     ) ls ON (ls.hostname = t.hostname)
LEFT JOIN scans lss ON (lss.scan_id = ls.max_scan_id);