Mysql – JOIN with the last id

join;MySQLperformancequery-performancesubquery

I want to select a record from one table (customers table) and join only with the last from other related table (sales table). I know two ways of doing it, but I don't know which one have better performance. What is the best approach?

Note: in my case I'll get only one row because I want the data for one specific customer but what if I want all the main rows, does it change something?

Option 1, using sub-query in WHERE:

SELECT c.id, c.name, s.id, s.cost
FROM
  customers as c
  LEFT JOIN sales as s ON s.customer_id = c.id
WHERE
  c.id = 42
  AND s.id = (SELECT MAX(s2.id) FROM sales as s2 WHERE s2.customer_id = c.id) ;

Option 2, using sub-query in FROM:

SELECT c.id, c.name, s.id, s.cost
FROM
  customers as c
  LEFT JOIN (
    SELECT s2.customer_id as cid, MAX(s2.id) as maxid 
    FROM sales as s2 
    WHERE s2.customer_id = 42 
    GROUP BY s2.customer_id
  ) maxids ON c.id = maxids.cid
  LEFT JOIN sales as s ON s.id = maxids.maxid
WHERE
  c.id = 42 ;

Here are the tables CREATE:

CREATE TABLE `customers` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `name` varchar(120) COLLATE utf8_spanish_ci NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=22149 DEFAULT CHARSET=utf8 COLLATE=utf8_spanish_c

CREATE TABLE `sales` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `customer_id` int(11) DEFAULT NULL,
 `status` tinyint(3) unsigned DEFAULT NULL,
 `buydate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `cost` decimal(10,2) NOT NULL,
 PRIMARY KEY (`id`),
 KEY `customer_key` (`customer_id`) USING BTREE,
 KEY `status_key` (`status`) USING BTREE,
 CONSTRAINT `sales_ibfk_1` FOREIGN KEY (`customer_id`) REFERENCES `customers` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11555 DEFAULT CHARSET=utf8 COLLATE=utf8_spanish_ci

And here their EXPLAINS:

id  select_type     table   partitions  type    possible_keys   key     key_len     ref     rows    filtered    Extra
1   PRIMARY c   NULL    const   PRIMARY PRIMARY 4   const   1   100.00  NULL
1   PRIMARY s   NULL    const   PRIMARY,customer_key    PRIMARY 4   const   1   100.00  NULL
2   DEPENDENT SUBQUERY  NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    Select tables optimized away

id  select_type     table   partitions  type    possible_keys   key     key_len     ref     rows    filtered    Extra
1   PRIMARY c   NULL    const   PRIMARY PRIMARY 4   const   1   100.00  NULL
1   PRIMARY <derived2>  NULL    system  NULL    NULL    NULL    NULL    1   100.00  NULL
1   PRIMARY s   NULL    const   PRIMARY PRIMARY 4   const   1   100.00  NULL
DERIVED s2  NULL    ref customer_key    customer_key    5   const   11  100.00

Best Answer

I sometimes find this technique useful for judging the performance of a query, or for comparing two possible queries. It is especially helpful before the table is large enough to get reliable timings.

FLUSH STATUS;
SELECT ... ;
SHOW SESSION STATUS LIKE 'Handler%';

That will give a few counters. If I see numbers that are about equal to the number of rows in a table I am selecting from, then I know there was a table scan. Numbers near the resultset size tend to be more beneficial. Handler...write... indicates that a temp table needed to be generated, and the number says how big it was.

Even without understanding what the numbers represent, it is often 'obvious' which of two SELECTs is better.