Mysql – Optimizing multi-table left joins

join;MySQL

I have four tables which I am trying to join together.

Table event

+------------+------------------+------+-----+---------+-------+
| Field      | Type             | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| sid        | int(10) unsigned | NO   | PRI | NULL    |       |
| cid        | int(10) unsigned | NO   | PRI | NULL    |       |
| signature  | int(10) unsigned | NO   | MUL | NULL    |       |
| timestamp  | datetime         | NO   | MUL | NULL    |       |
| is_deleted | tinyint(1)       | NO   | MUL | 0       |       |
+------------+------------------+------+-----+---------+-------+

Indexes – sid,cid,timestamp,is_deleted

Table iphdr

+----------+----------------------+------+-----+---------+-------+
| Field    | Type                 | Null | Key | Default | Extra |
+----------+----------------------+------+-----+---------+-------+
| sid      | int(10) unsigned     | NO   | PRI | NULL    |       |
| cid      | int(10) unsigned     | NO   | PRI | NULL    |       |
| ip_src   | int(10) unsigned     | NO   | MUL | NULL    |       |
| ip_dst   | int(10) unsigned     | NO   | MUL | NULL    |       |
+----------+----------------------+------+-----+---------+-------+

Indexes – sid,cid,(ip_src,ip_dst)

Table tcphdr

+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| sid       | int(10) unsigned     | NO   | PRI | NULL    |       |
| cid       | int(10) unsigned     | NO   | PRI | NULL    |       |
| tcp_sport | smallint(5) unsigned | NO   | MUL | NULL    |       |
| tcp_dport | smallint(5) unsigned | NO   | MUL | NULL    |       |
+-----------+----------------------+------+-----+---------+-------+

Indexes – sid,cid

Table udphdr

+-----------+----------------------+------+-----+---------+-------+
| Field     | Type                 | Null | Key | Default | Extra |
+-----------+----------------------+------+-----+---------+-------+
| sid       | int(10) unsigned     | NO   | PRI | NULL    |       |
| cid       | int(10) unsigned     | NO   | PRI | NULL    |       |
| udp_sport | smallint(5) unsigned | NO   | MUL | NULL    |       |
| udp_dport | smallint(5) unsigned | NO   | MUL | NULL    |       |
+-----------+----------------------+------+-----+---------+-------+

Indexes – sid,cid

The fields 'sid','cid' are common to all the tables and are primary keys in all. The first two tables have about 1M rows while the next two have about 600K and 200K rows respectively.

I need to find distinct IP links for not deleted events. So I write a query like

SELECT 
inet_ntoa(iphdr.ip_src) AS src_ip, iphdr.ip_src AS src_ip_inet, inet_ntoa(iphdr.ip_dst) AS dst_ip, iphdr.ip_dst AS dst_ip_inet, count('*') AS total, count(DISTINCT event.signature) AS count_unique  
FROM 
event
INNER JOIN iphdr ON event.sid = iphdr.sid AND event.cid = iphdr.cid 
LEFT OUTER JOIN tcphdr ON iphdr.sid = tcphdr.sid AND iphdr.cid = tcphdr.cid 
LEFT OUTER JOIN udphdr ON iphdr.sid = udphdr.sid AND iphdr.cid = udphdr.cid 
WHERE event.is_deleted = 0
GROUP BY iphdr.ip_src, iphdr.ip_dst 
order by total desc 
LIMIT 5

The query runs fine and returns result in about 7 seconds.

The MySQL query plan is

-+--------+---------------------------------+
| id | select_type | table  | type   | possible_keys                                  | key     | key_len | ref                               | rows   | Extra                           |
+----+-------------+--------+--------+------------------------------------------------+---------+---------+-----------------------------------+--------+---------------------------------+
|  1 | SIMPLE      | event  | ref    | PRIMARY,index_event_sid,index_event_cid,is_del | is_del  | 1       | const                             | 548283 | Using temporary; Using filesort |
|  1 | SIMPLE      | iphdr  | eq_ref | PRIMARY,index_iphdr_sid,index_iphdr_cid        | PRIMARY | 8       | test.event.sid,test.event.cid |      1 |                                 |
|  1 | SIMPLE      | tcphdr | eq_ref | PRIMARY,index_tcphdr_sid,index_tcphdr_cid      | PRIMARY | 8       | test.iphdr.sid,test.event.cid |      1 | Using index                     |
|  1 | SIMPLE      | udphdr | eq_ref | PRIMARY,index_udphdr_sid,index_udphdr_cid      | PRIMARY | 8       | test.iphdr.sid,test.event.cid |      1 | Using index                     |
+----+-------------+--------+--------+------------------------------------------------+---------+---------+-----------------------------------+--------+---------------------------------+

As can be seen the index iplink (ip_src,ip_dst) is not used.

If I drop the where clause,

SELECT 
inet_ntoa(iphdr.ip_src) AS src_ip, iphdr.ip_src AS src_ip_inet, inet_ntoa(iphdr.ip_dst) AS dst_ip, iphdr.ip_dst AS dst_ip_inet, count('*') AS total, count(DISTINCT event.signature) AS count_unique  
FROM 
iphdr 
LEFT OUTER JOIN tcphdr ON iphdr.sid = tcphdr.sid AND iphdr.cid = tcphdr.cid 
LEFT OUTER JOIN udphdr ON iphdr.sid = udphdr.sid AND iphdr.cid = udphdr.cid 
INNER JOIN event ON event.sid = iphdr.sid AND event.cid = iphdr.cid 
GROUP BY iphdr.ip_src, iphdr.ip_dst 
order by total desc 
LIMIT 5

The query is 2X faster (Runs in about 3.9s) and returns correct result. Also, the query plan is altered.

+----+-------------+--------+--------+-------------------------------------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table  | type   | possible_keys                             | key     | key_len | ref                               | rows    | filtered | Extra                                        |
+----+-------------+--------+--------+-------------------------------------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------+
|  1 | SIMPLE      | iphdr  | index  | PRIMARY,index_iphdr_sid,index_iphdr_cid   | iplink  | 8       | NULL                              | 1065036 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | tcphdr | eq_ref | PRIMARY,index_tcphdr_sid,index_tcphdr_cid | PRIMARY | 8       | test.iphdr.sid,test.iphdr.cid |       1 |   100.00 | Using index                                  |
|  1 | SIMPLE      | udphdr | eq_ref | PRIMARY,index_udphdr_sid,index_udphdr_cid | PRIMARY | 8       | test.iphdr.sid,test.iphdr.cid |       1 |   100.00 | Using index                                  |
|  1 | SIMPLE      | event  | eq_ref | PRIMARY,index_event_sid,index_event_cid   | PRIMARY | 8       | test.iphdr.sid,test.iphdr.cid |       1 |   100.00 |                                              |
+----+-------------+--------+--------+-------------------------------------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------+

But since I need to filter out deleted events, I add WHERE clause to the INNER JOIN

SELECT 
inet_ntoa(iphdr.ip_src) AS src_ip, iphdr.ip_src AS src_ip_inet, inet_ntoa(iphdr.ip_dst) AS dst_ip, iphdr.ip_dst AS dst_ip_inet, count('*') AS total, count(DISTINCT event.signature) AS count_unique  
FROM 
event
INNER JOIN iphdr ON event.sid = iphdr.sid AND event.cid = iphdr.cid and event.is_deleted = 0
LEFT OUTER JOIN tcphdr ON iphdr.sid = tcphdr.sid AND iphdr.cid = tcphdr.cid 
LEFT OUTER JOIN udphdr ON iphdr.sid = udphdr.sid AND iphdr.cid = udphdr.cid 
GROUP BY iphdr.ip_src, iphdr.ip_dst 
order by total desc 
LIMIT 5

This query runs in about 7s and gives same query plan as query 1

I also tried to use subquery to select only those events that had is_deleted=0 but it gave me an even worse performance (9s)

My questions are –

  1. How do I optimize the first query to make use of the index on (ip_src,ip_dst) (i.e. my group by terms) without sacrificing the where clause?
  2. Why does the where clause (or join with event.is_deleted=0) lead to such a drastic resetting of query plan
  3. Can I re-write the query to make it more efficient?

Best Answer

PLEASE use SHOW CREATE TABLE; DESCRIBE is not descriptive enough.

You seem to be working with IPv4 IP addresses. IPv6 is upon us!

Your query inherently must do a full table scan in order to get the "total" upon which you do the ORDER BY. Since there are a million rows, it will take time.

count('*')

The quotes are not needed.

The two SELECTs are likely to be identical; order across JOIN does not matter. It does matter across LEFT JOIN. (INNER and OUTER add nothing.)

Suggestion...

First, see if this gives the correct information (albeit incomplet info):

SELECT  e.sid, e.cid, COUNT(*) AS total
FROM  event AS e
JOIN  iphdr as h ON e.sid = h.sid
                AND  e.cid = h.cid
WHERE  e.is_deleted = 0
GROUP BY  i.ip_src, i.ip_dst
ORDER BY  total DESC
LIMIT  5;

If so, then use that as a subquery, a la FROM (that-select) and LEFT JOIN to the other two tables to get the rest of the info.

The trick... The subquery is doing less work since it is touching 2 tables (vs 4). The LEFT JOINs only have to do 5 probes each, not a million.