I'm collecting nmap data every five minutes and storing it in a database. Information about each scan (e.g. start and end time) is stored in the scans
table:
CREATE TABLE `scans` (
`scan_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`nmap_version` varchar(20) DEFAULT NULL,
`nmap_args` varchar(255) DEFAULT NULL,
PRIMARY KEY (`scan_id`)
) ENGINE=InnoDB AUTO_INCREMENT=34901 DEFAULT CHARSET=utf8
Information about the hosts scanned (e.g. hostname, MAC address) is stored in the hosts
table:
CREATE TABLE `hosts` (
`scan_id` int(10) unsigned NOT NULL,
`host_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`hostname` varchar(255) NOT NULL,
`ip_address` int(10) unsigned NOT NULL,
`mac_address` bigint(20) unsigned DEFAULT NULL,
`mac_vendor` varchar(255) DEFAULT NULL,
`status` varchar(20) NOT NULL,
`hops` int(10) unsigned DEFAULT NULL,
`last_boot` datetime DEFAULT NULL,
PRIMARY KEY (`host_id`),
KEY `scan_id` (`scan_id`),
KEY `idx_status` (`status`),
KEY `idx_hostname` (`hostname`),
CONSTRAINT `hosts_ibfk_1` FOREIGN KEY (`scan_id`) REFERENCES `scans` (`scan_id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=2262995 DEFAULT CHARSET=utf8
Now I want to fetch the information from the most recent scan of one or more (up to about a thousand) hosts. The most recent scan will not necessarily be the same for all hosts. I also want to fetch the last time each host was up.
I've been using the following query, but it's slow (it takes about six seconds to fetch the data for three hosts):
SELECT hosts.hostname,
INET_NTOA(hosts.ip_address) AS ip,
CONV(hosts.mac_address, 10, 16) AS mac,
hosts.mac_vendor AS mac_vendor,
hosts.status AS status,
scans.start_time AS last_scan,
u.last_seen AS last_seen
FROM hosts
JOIN (
-- ID of most recent scan for each host
SELECT MAX(hosts.scan_id) AS max_scan_id,
hosts.hostname
FROM hosts
WHERE hosts.hostname IN ('foo', 'bar', 'baz')
GROUP BY hosts.hostname
) AS t
ON (hosts.scan_id = t.max_scan_id AND hosts.hostname = t.hostname)
JOIN scans
ON scans.scan_id = t.max_scan_id
JOIN (
-- Last time each host was up
SELECT MAX(scans.start_time) AS last_seen,
hosts.hostname
FROM scans
JOIN hosts
ON scans.scan_id = hosts.scan_id
WHERE hosts.status = 'up'
GROUP BY hosts.hostname
) AS u
ON hosts.hostname = u.hostname
The EXPLAIN
shows that one of the derived queries is doing a table scan:
+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 12855 | Using where |
| 1 | PRIMARY | scans | eq_ref | PRIMARY | PRIMARY | 4 | t.max_scan_id | 1 | |
| 1 | PRIMARY | hosts | ref | scan_id,idx_hostname | scan_id | 4 | t.max_scan_id | 81 | Using where |
| 1 | PRIMARY | <derived3> | ref | key0 | key0 | 767 | t.hostname | 10 | |
| 3 | DERIVED | hosts | ref | scan_id,idx_status | idx_status | 62 | const | 1136083 | Using index condition; Using where; Using temporary; Using filesort |
| 3 | DERIVED | scans | eq_ref | PRIMARY | PRIMARY | 4 | wmap.hosts.scan_id | 1 | |
| 2 | DERIVED | hosts | range | idx_hostname | idx_hostname | 767 | NULL | 12855 | Using index condition |
+------+-------------+------------+--------+----------------------+--------------+---------+--------------------+---------+---------------------------------------------------------------------+
I tried adding the IN
clause to the outer query and the second subquery (three IN
clauses total), which worked great when searching for a single host, but actually performed worse than the original with a large number of hosts.
I also tried creating a temporary table to hold only the hostnames to be matched and a temporary table to hold the results of each subquery:
CREATE TEMPORARY TABLE hosts_to_match (
hostname VARCHAR(255) NOT NULL PRIMARY KEY
);
INSERT INTO hosts_to_match
VALUES ('foo'), ('bar'), ('baz');
CREATE TEMPORARY TABLE last_scan (
hostname VARCHAR(255) NOT NULL PRIMARY KEY,
scan_id INT(10) UNSIGNED NOT NULL
);
INSERT INTO last_scan (
scan_id,
hostname
)
SELECT MAX(hosts.scan_id),
hosts.hostname
FROM hosts_to_match
JOIN hosts
ON hosts_to_match.hostname = hosts.hostname
GROUP BY hosts.hostname;
CREATE TEMPORARY TABLE last_seen (
hostname VARCHAR(255) NOT NULL PRIMARY KEY,
last_seen DATETIME NOT NULL
);
INSERT INTO last_seen (
last_seen,
hostname
)
SELECT MAX(scans.start_time),
hosts.hostname
FROM hosts_to_match
JOIN hosts
ON hosts_to_match.hostname = hosts.hostname
JOIN scans
ON scans.scan_id = hosts.scan_id
WHERE hosts.status = 'UP'
GROUP BY hosts.hostname;
SELECT hosts.hostname,
INET_NTOA(hosts.ip_address) AS ip,
CONV(hosts.mac_address, 10, 16) AS mac,
hosts.mac_vendor,
hosts.status,
scans.start_time,
last_seen.last_seen
FROM last_scan
JOIN hosts
ON (last_scan.hostname = hosts.hostname AND last_scan.scan_id = hosts.scan_id)
JOIN scans
ON scans.scan_id = last_scan.scan_id
JOIN last_seen
ON hosts.hostname = last_seen.hostname;
But populating the last_scan
and last_seen
temporary tables is slow for a large number of hosts.
EXPLAIN
for SELECT
used to populate last_scan
:
+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | hosts_to_match | index | PRIMARY | PRIMARY | 767 | NULL | 166 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | hosts | ref | idx_hostname | idx_hostname | 767 | wmap.hosts_to_match.hostname | 2573 | |
+------+-------------+----------------+-------+---------------+--------------+---------+------------------------------+------+----------------------------------------------+
EXPLAIN
for SELECT
used to populate last_seen
:
+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | hosts_to_match | index | PRIMARY | PRIMARY | 767 | NULL | 166 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | hosts | ref | scan_id,idx_status,idx_hostname | idx_hostname | 767 | wmap.hosts_to_match.hostname | 2573 | Using where |
| 1 | SIMPLE | scans | eq_ref | PRIMARY | PRIMARY | 4 | wmap.hosts.scan_id | 1 | |
+------+-------------+----------------+--------+---------------------------------+--------------+---------+------------------------------+------+----------------------------------------------+
How can I speed this up? I'm using MariaDB 5.5.47.
Best Answer
You can really benefit from an index on
hosts (hostname, scan_id)
for this query, and possibly another one includingstatus
(especially for the second query below). Your query may also benefit from transferring some joins to per-row totals:Also, considering that you already trust the last scan to be the one with the highest id, you may speed-up your query by trusting the
last_seen
time to be the one with the highest id: