Mariadb – Improve query “JOIN … WHERE NOT EXISTS (SELECT 1 FROM … WHERE … )”

mariadbperformancequery-performance

I'm trying to optimize a query because the plan generates a full table scan for a subquery.

Data

The (simplified) table structure:

CREATE TABLE RECORD (
  ID INTEGER PRIMARY KEY AUTO_INCREMENT
);

CREATE TABLE POINT (
  ID INTEGER PRIMARY KEY AUTO_INCREMENT,
  RECORD_ID INTEGER NOT NULL,
  CONSTRAINT POINT_FK1 FOREIGN KEY (RECORD_ID) REFERENCES RECORD(ID)
);

CREATE TABLE SEGMENT (
  ID INTEGER PRIMARY KEY AUTO_INCREMENT,
  POINT_START INTEGER NOT NULL,
  POINT_END INTEGER NOT NULL,
  RECORD_ID INTEGER NOT NULL,
  CONSTRAINT SEGMENT_FK1 FOREIGN KEY (POINT_START) REFERENCES POINT(ID),
  CONSTRAINT SEGMENT_FK2 FOREIGN KEY (POINT_END) REFERENCES POINT(ID),
  CONSTRAINT SEGMENT_FK3 FOREIGN KEY (RECORD_ID) REFERENCES RECORD(ID)
);

The data (in real life the tables have hundreds of thousands of rows):

INSERT INTO RECORD(ID) VALUES (1), (2), (3);
INSERT INTO POINT(ID, RECORD_ID) VALUES (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2);
INSERT INTO SEGMENT(ID, POINT_START, POINT_END, RECORD_ID) VALUES (1, 1, 2, 1), (2, 3, 4, 1);

Query

The (simplified) original query:

SELECT
  RECORD.ID AS RECORD_ID,
  POINT.ID AS POINT_ID
FROM
  RECORD
  LEFT OUTER JOIN POINT
    ON RECORD.ID = POINT.RECORD_ID
    AND (NOT EXISTS (SELECT 1 FROM SEGMENT WHERE POINT_START = POINT.ID OR POINT_END = POINT.ID))

Results:

+-----------+----------+
| RECORD_ID | POINT_ID |
+-----------+----------+
|         1 |     NULL |
|         2 |        5 |
|         2 |        6 |
|         3 |     NULL |
+-----------+----------+

This has a query plan with a full table scan for the subquery:

+------+--------------------+---------+-------+-------------------------+-----------+---------+----------------+------+--------------------------+
| id   | select_type        | table   | type  | possible_keys           | key       | key_len | ref            | rows | Extra                    |
+------+--------------------+---------+-------+-------------------------+-----------+---------+----------------+------+--------------------------+
|    1 | PRIMARY            | RECORD  | index | NULL                    | PRIMARY   | 4       | NULL           |    3 | Using index              |
|    1 | PRIMARY            | POINT   | ref   | POINT_FK1               | POINT_FK1 | 4       | TEST.RECORD.ID |    1 | Using where; Using index |
|    2 | DEPENDENT SUBQUERY | SEGMENT | ALL   | SEGMENT_FK1,SEGMENT_FK2 | NULL      | NULL    | NULL           |    2 | Using where              |
+------+--------------------+---------+-------+-------------------------+-----------+---------+----------------+------+--------------------------+

The intent of the query is to return at least 1 row per RECORD. If there are POINTs that don't belong to a SEGMENT, return each of those on their own row. The results are processed like this:

cur_record = nil
for row in results
    if (cur_record == nil) or (row['RECORD_ID'] != cur_record.id)
        cur_record = createRecord()
        cur_record.id = row['RECORD_ID']
    end
    if row['POINT_ID'] != nil
        point = createPoint()
        point.id = row['POINT_ID']
        cur_record.addPoint(point)
    end
end

Current improved query

The best query I've come up with so far is:

SELECT
  RECORD.ID AS RECORD_ID,
  POINT.ID AS POINT_ID
FROM
  RECORD
  LEFT OUTER JOIN POINT
    ON RECORD.ID = POINT.RECORD_ID
    AND POINT.ID NOT IN (SELECT POINT_START FROM SEGMENT WHERE POINT_START = POINT.ID)
    AND POINT.ID NOT IN (SELECT POINT_END FROM SEGMENT WHERE POINT_END = POINT.ID)

This has the query plan:

+------+--------------------+---------+-------+---------------+-------------+---------+----------------+------+--------------------------+
| id   | select_type        | table   | type  | possible_keys | key         | key_len | ref            | rows | Extra                    |
+------+--------------------+---------+-------+---------------+-------------+---------+----------------+------+--------------------------+
|    1 | PRIMARY            | RECORD  | index | NULL          | PRIMARY     | 4       | NULL           |    3 | Using index              |
|    1 | PRIMARY            | POINT   | ref   | POINT_FK1     | POINT_FK1   | 4       | TEST.RECORD.ID |    1 | Using where; Using index |
|    3 | DEPENDENT SUBQUERY | SEGMENT | ref   | SEGMENT_FK2   | SEGMENT_FK2 | 4       | TEST.POINT.ID  |    1 | Using where; Using index |
|    2 | DEPENDENT SUBQUERY | SEGMENT | ref   | SEGMENT_FK1   | SEGMENT_FK1 | 4       | TEST.POINT.ID  |    1 | Using where; Using index |
+------+--------------------+---------+-------+---------------+-------------+---------+----------------+------+--------------------------+

This causes it to use the FK indexes, which makes the query a lot faster. However, it has two dependent subqueries.

Is there a better query I could use?

Best Answer

I tried the LEFT JOIN .. IS NULL idea I noted in the comment:

SELECT
  RECORD.ID AS RECORD_ID,
  POINT.ID AS POINT_ID
FROM
  RECORD
  LEFT OUTER JOIN (
    POINT
    LEFT JOIN SEGMENT s1 ON POINT.ID = s1.POINT_START
    LEFT JOIN SEGMENT s2 ON POINT.ID = s2.POINT_END
  ) ON RECORD.ID = POINT.RECORD_ID
      AND s1.POINT_START IS NULL AND s2.POINT_END IS NULL;

http://sqlfiddle.com/#!9/f6e9ca/1

The dependent subqueries are now joins instead, using the same indexes but it should have less overhead - dependent subquery means firing a subquery per main (outer) select row which is not as well optimized as joins are.