MySQL: need help making a query faster

MySQLmysql-5.7optimization

I wrote a query that finds me records in discovery table that contain a needle from whitelisted (LIKE %needle%) and don't have a record in logs table (and some other easy to understand filtering):

SELECT * FROM (
    SELECT discovery.id, discovery.resource AS url, discovery.event_datetime, whitelisted.keyword 
    FROM discovery
    INNER JOIN whitelisted
    ON discovery.`resource` LIKE concat('%', whitelisted.`keyword`, '%')
    LEFT JOIN `logs`
    ON `logs`.discovery_id = discovery.id
    WHERE
    discovery.event_datetime >= NOW()
    AND
    discovery.provider = 'XXX'
    AND
    `logs`.id IS NULL
    AND
    discovery.resource NOT LIKE '%SOMETHING-TO-EXCLUDE%'
    LIMIT 0, 20
) logless_resources
GROUP BY logless_resources.url
ORDER BY logless_resources.event_datetime ASC

discovery table: 143k rows

  • id UNSIGNED INT AI PRIMARY
  • resource VARCHAR(1024) INDEX
  • provider ENUM
  • event_datetime DATETIME
  • created_at DATETIME

whitelisted table: 535 rows (not thousands, just 0.5k)

  • id UNSIGNED INT AI PRIMARY
  • keyword VARCHAR(128) INDEX
  • provider ENUM
  • created_at DATETIME

logs: 585k rows

  • id UNSIGNED INT AI PRIMARY
  • discovery_id UNSIGNED INT FK to discovery.id

Right now this takes 90 seconds – can it be made faster?

Best Answer

The focus needs to be on the "derived" table, since it delivers only 20 rows to the outer parts.

        SELECT  d.id, d.resource AS url, d.event_datetime,
                wl.keyword
            FROM  discovery AS d
            INNER JOIN  whitelisted AS wl
                    ON d.`resource` LIKE concat('%', wl.`keyword`, '%')
            LEFT JOIN  `logs`  ON `logs`.discovery_id = d.id
            WHERE  d.event_datetime >= NOW()
              AND  d.provider = 'XXX'
              AND  `logs`.id IS NULL
              AND  d.resource NOT LIKE '%SOMETHING-TO-EXCLUDE%'
            LIMIT  0, 20 

Starting with the WHERE clause...

              d.event_datetime >= NOW() -- last in index (range)
              d.provider = 'XXX'  -- first in the index because of "="
              `logs`.id IS NULL   -- another table, not very selective
              d.resource NOT LIKE '%SOMETHING-TO-EXCLUDE%'  -- probably useless

So discovery needs INDEX(provider, event_datetime), in that order.

Now to get to the other tables

ON `logs`.discovery_id = d.id

Logs needs INDEX(discovery_id)

ON d.`resource` LIKE concat('%', wl.`keyword`, '%')

This is terrible for optimizing. It will need to test every row of either whitelisted or discovery. Since I have guessed that discovery will be the first table, I expect a tedious scan of whitelisted.

Do you care that there is no ORDER BY? That is, you could be getting a random 20 rows, not the newest 20, not the oldest, etc.

(Please use SHOW CREATE TABLE when presenting a schema; it is much more precise than prose.)