MySQL Optimization – How to Repeat the Same Function in a Query

MySQLoptimization

In the below query, there are repeated calculations such as the three calls to SUM(p.amount). Does MySQL re-calculate for each function call or is there some kind of memoization optimization under the hood? If not, how can this kind of query be optimized for maximum performance?

It seems like it would be faster after the first calculation to get the next one by the alias name, total_payemnts, but that just throws an error.

SELECT LEFT(c.last_name, 1) AS 'last_names',
    SUM(p.amount) AS 'total_payments', 
    COUNT(p.rental_id) AS 'num_rentals',
    SUM(p.amount) / COUNT(p.rental_id) AS 'avg_pay'
FROM customer c
JOIN payment p ON p.customer_id = c.customer_id
GROUP BY LEFT(c.last_name, 1)
ORDER BY SUM(p.amount) DESC;

This query runs on the MySQL Sakila sample database.

Best Answer

A definite answer is really difficult. However, some trials might give us some hints.

We start with a few data:

CREATE TABLE customers
(
    customer_id integer PRIMARY KEY,
    last_name varchar(50) NOT NULL
) ;

CREATE TABLE payments
(
    rental_id integer not null,
    customer_id integer not null, --  references customer(customer_id),
    amount numeric(10,2) not null
) ;

INSERT INTO customers 
VALUES 
  (1, 'A - Customer 1'),
  (2, 'B - Customer 2'),
  (3, 'C - Customer 3'),
  (4, 'D - Customer 4') ;

INSERT INTO payments
VALUES
  (1, 1, 900.0),
  (2, 1, 800.0),
  (3, 1, 500.0),
  (4, 1, 900.0),
  (5, 2, 500.0),
  (6, 2, 500.0) ;

We first define a DETERMINISTIC (although it has side effects) function to log calls:

CREATE FUNCTION log_numeric(_n numeric(10,2), _counter character(2))
    RETURNS numeric(10,2)
    LANGUAGE SQL
    DETERMINISTIC
    BEGIN
      UPDATE log SET n = n + 1 WHERE counter = _counter ;
      RETURN _n ;
    END ;

We have previously defined a log table, to just store counts:

CREATE TABLE log
(
    counter character(2) primary key,
    n integer default 0
) ENGINE=MyISAM;
INSERT INTO log 
VALUES
    ('x', 0),
    ('L', 0),
    ('T', 0),
    ('N', 0),
    ('L2', 0),
    ('T2', 0),
    ('N2', 0) ;

At this point, if MySQL were able to just perform the common subexpression elimination optimization (common for compilers), it would just make ONE call when we issue:

SELECT
    log_numeric(30, 'x') AS c1,
    log_numeric(30, 'x') AS c2 ;

Because log_numeric(30, 'x') would return the same value in both cases (because the function is deterministic).

In fact, it makes two calls to this function, as proven by watching counter 'x';

SELECT
    *
FROM
    log
WHERE
    counter = 'x' ;

[NOTE: Tested with MySQL 5.7.12]

It might be the case that MySQL has decided that the log_numeric function has some side effects, or that User Defined Functions do not work the same way as built-in functions. Built-in functions could be optimized, whereas UDF would not. In any case, if MySQL would have optimized the two expressions with just one call, it would be clear that this optimization is standard. As it did not, we can make an educated guess (which is not the same as a proof) and assume that the functions are called as many times as they're present.

The same can be done with the case posted:

SELECT 
    log_char(LEFT(c.last_name, 1), 'L') AS last_names,
    log_numeric(SUM(p.amount), 'T') AS total_payments, 
    log_numeric(COUNT(p.rental_id), 'N') AS num_rentals,
    log_numeric(SUM(p.amount), 'T') / log_numeric(COUNT(p.rental_id), 'N') AS avg_pay
FROM 
    customers c
    JOIN payments p ON p.customer_id = c.customer_id
GROUP BY 
    log_char(LEFT(c.last_name, 1), 'L')
ORDER BY 
    log_numeric(SUM(p.amount), 'T') DESC;

And the results point to all calls being evaluated as many times as they appear.

You can check all theses tests at http://rextester.com/ZKS91815.


NOTE: For curiosity, postgreSQL (9.6.1) does a little better: it recognizes the ORDER BY and GROUP BY as "already computed" on the SELECT, but it doesn't optimize either for Common subexpression elimination. Room for improvement.

Alternative query avoiding repeating functions

If you're worried things are computed more than necessary, use:

SELECT
    last_names, total_payments, num_rentals,
    total_payments / num_rentals AS avg_pay
FROM
(
    SELECT 
        log_char(LEFT(c.last_name, 1), 'L') AS last_names,
        log_numeric(SUM(p.amount), 'T') AS total_payments, 
        log_numeric(COUNT(p.rental_id), 'N') AS num_rentals
    FROM 
        customers c
        JOIN payments p ON p.customer_id = c.customer_id
    GROUP BY 
        log_char(LEFT(c.last_name, 1), 'L')    
) AS q0
ORDER BY
    total_payments DESC ;

I honestly don't think the difference will be much. The weight of the functions in the time spent for the whole query is probably (ver) small. It might be signficant if the functions were complex. Depending on how MySQL deals with Subqueries (that might be temporary tables), this could actually be worse.