I wrote a query that finds me records in discovery
table that contain a needle from whitelisted
(LIKE %needle%) and don't have a record in logs
table (and some other easy to understand filtering):
SELECT * FROM (
SELECT discovery.id, discovery.resource AS url, discovery.event_datetime, whitelisted.keyword
FROM discovery
INNER JOIN whitelisted
ON discovery.`resource` LIKE concat('%', whitelisted.`keyword`, '%')
LEFT JOIN `logs`
ON `logs`.discovery_id = discovery.id
WHERE
discovery.event_datetime >= NOW()
AND
discovery.provider = 'XXX'
AND
`logs`.id IS NULL
AND
discovery.resource NOT LIKE '%SOMETHING-TO-EXCLUDE%'
LIMIT 0, 20
) logless_resources
GROUP BY logless_resources.url
ORDER BY logless_resources.event_datetime ASC
discovery table: 143k rows
- id UNSIGNED INT AI PRIMARY
- resource VARCHAR(1024) INDEX
- provider ENUM
- event_datetime DATETIME
- created_at DATETIME
whitelisted table: 535 rows (not thousands, just 0.5k)
- id UNSIGNED INT AI PRIMARY
- keyword VARCHAR(128) INDEX
- provider ENUM
- created_at DATETIME
logs: 585k rows
- id UNSIGNED INT AI PRIMARY
- discovery_id UNSIGNED INT FK to discovery.id
Right now this takes 90 seconds – can it be made faster?
Best Answer
The focus needs to be on the "derived" table, since it delivers only 20 rows to the outer parts.
Starting with the
WHERE
clause...So
discovery
needsINDEX(provider, event_datetime)
, in that order.Now to get to the other tables
Logs needs
INDEX(discovery_id)
This is terrible for optimizing. It will need to test every row of either
whitelisted
ordiscovery
. Since I have guessed thatdiscovery
will be the first table, I expect a tedious scan ofwhitelisted
.Do you care that there is no
ORDER BY
? That is, you could be getting a random 20 rows, not the newest 20, not the oldest, etc.(Please use
SHOW CREATE TABLE
when presenting a schema; it is much more precise than prose.)