MySQL Optimization – How to Repeat the Same Function in a Query

MySQLoptimization

In the below query, there are repeated calculations such as the three calls to SUM(p.amount). Does MySQL re-calculate for each function call or is there some kind of memoization optimization under the hood? If not, how can this kind of query be optimized for maximum performance?

It seems like it would be faster after the first calculation to get the next one by the alias name, total_payemnts, but that just throws an error.

SELECT LEFT(c.last_name, 1) AS 'last_names',
    SUM(p.amount) AS 'total_payments', 
    COUNT(p.rental_id) AS 'num_rentals',
    SUM(p.amount) / COUNT(p.rental_id) AS 'avg_pay'
FROM customer c
JOIN payment p ON p.customer_id = c.customer_id
GROUP BY LEFT(c.last_name, 1)
ORDER BY SUM(p.amount) DESC;

This query runs on the MySQL Sakila sample database.

Best Answer

A definite answer is really difficult. However, some trials might give us some hints.

We start with a few data:

CREATE TABLE customers
(
    customer_id integer PRIMARY KEY,
    last_name varchar(50) NOT NULL
) ;

CREATE TABLE payments
(
    rental_id integer not null,
    customer_id integer not null, --  references customer(customer_id),
    amount numeric(10,2) not null
) ;

INSERT INTO customers 
VALUES 
  (1, 'A - Customer 1'),
  (2, 'B - Customer 2'),
  (3, 'C - Customer 3'),
  (4, 'D - Customer 4') ;

INSERT INTO payments
VALUES
  (1, 1, 900.0),
  (2, 1, 800.0),
  (3, 1, 500.0),
  (4, 1, 900.0),
  (5, 2, 500.0),
  (6, 2, 500.0) ;

We first define a DETERMINISTIC (although it has side effects) function to log calls:

CREATE FUNCTION log_numeric(_n numeric(10,2), _counter character(2))
    RETURNS numeric(10,2)
    LANGUAGE SQL
    DETERMINISTIC
    BEGIN
      UPDATE log SET n = n + 1 WHERE counter = _counter ;
      RETURN _n ;
    END ;

We have previously defined a log table, to just store counts:

CREATE TABLE log
(
    counter character(2) primary key,
    n integer default 0
) ENGINE=MyISAM;
INSERT INTO log 
VALUES
    ('x', 0),
    ('L', 0),
    ('T', 0),
    ('N', 0),
    ('L2', 0),
    ('T2', 0),
    ('N2', 0) ;

At this point, if MySQL were able to just perform the common subexpression elimination optimization (common for compilers), it would just make ONE call when we issue:

SELECT
    log_numeric(30, 'x') AS c1,
    log_numeric(30, 'x') AS c2 ;

Because log_numeric(30, 'x') would return the same value in both cases (because the function is deterministic).

In fact, it makes two calls to this function, as proven by watching counter 'x';

SELECT
    *
FROM
    log
WHERE
    counter = 'x' ;

[NOTE: Tested with MySQL 5.7.12]

It might be the case that MySQL has decided that the log_numeric function has some side effects, or that User Defined Functions do not work the same way as built-in functions. Built-in functions could be optimized, whereas UDF would not. In any case, if MySQL would have optimized the two expressions with just one call, it would be clear that this optimization is standard. As it did not, we can make an educated guess (which is not the same as a proof) and assume that the functions are called as many times as they're present.

The same can be done with the case posted:

SELECT 
    log_char(LEFT(c.last_name, 1), 'L') AS last_names,
    log_numeric(SUM(p.amount), 'T') AS total_payments, 
    log_numeric(COUNT(p.rental_id), 'N') AS num_rentals,
    log_numeric(SUM(p.amount), 'T') / log_numeric(COUNT(p.rental_id), 'N') AS avg_pay
FROM 
    customers c
    JOIN payments p ON p.customer_id = c.customer_id
GROUP BY 
    log_char(LEFT(c.last_name, 1), 'L')
ORDER BY 
    log_numeric(SUM(p.amount), 'T') DESC;

And the results point to all calls being evaluated as many times as they appear.

You can check all theses tests at http://rextester.com/ZKS91815.

NOTE: For curiosity, postgreSQL (9.6.1) does a little better: it recognizes the ORDER BY and GROUP BY as "already computed" on the SELECT, but it doesn't optimize either for Common subexpression elimination. Room for improvement.

Alternative query avoiding repeating functions

If you're worried things are computed more than necessary, use:

SELECT
    last_names, total_payments, num_rentals,
    total_payments / num_rentals AS avg_pay
FROM
(
    SELECT 
        log_char(LEFT(c.last_name, 1), 'L') AS last_names,
        log_numeric(SUM(p.amount), 'T') AS total_payments, 
        log_numeric(COUNT(p.rental_id), 'N') AS num_rentals
    FROM 
        customers c
        JOIN payments p ON p.customer_id = c.customer_id
    GROUP BY 
        log_char(LEFT(c.last_name, 1), 'L')    
) AS q0
ORDER BY
    total_payments DESC ;

I honestly don't think the difference will be much. The weight of the functions in the time spent for the whole query is probably (ver) small. It might be signficant if the functions were complex. Depending on how MySQL deals with Subqueries (that might be temporary tables), this could actually be worse.

Related Solutions

Mysql – Best solution for search scenario

It depends on the number of articles you expect to have. If you have indices on the columns you are filtering, it will be pretty fast. Joining is very fast and will surely not be the problem.

But your WHERE clause is very intensive. Because you are basically scanning all rows in the table and therefore loading it all from the disk to compare. This cannot be avoided in your case, but it will be the crucial part. Nevertheless, if you aren't going past 100'000 articles I wouldn't be too concerned. Still - test it, if you want to be sure!

However, your use case seems to be indexing document (articles) and get the top x of them, based on a search query. A way to be faster and more efficient is, to create vectors from documents which contain for each word its occurrence. Your search query will be transformed into one too, and then it's all about comparing vectors. (simplified)

You don't have to implement that of course, there are libraries for document indexing and searching. Maybe you want to have a look at lucene (or maybe you can even use solr which should do everything for you). Just to give you some ideas, with those tools you can build your own little 'google'. Lucene would of course build its own (noSql) database / index. So you would maybe need to separate this functionality from the rest of your application (I don't know the context).

ORA-00937 – Calculating Sub-Value in Aggregate SELECT Statement

Your additional calculated value:

(select count(t1.num) from T t1)

Is a scalar subquery, which is a dynamic rather than static expression. As such it's treated the same as a column as far as the aggregate is concerned and needs to be included in the group by clause to avoid the ORA-00937: not a single-group group function error

However, oracle does not allow subqueries as part of the group by clause and trying to include the scalar subquery and/or the whole case statement:

group by (case (select count(*) cnt from t t1) when 0 then 1 else 0 end)

just results in an ORA-22818: subquery expressions not allowed here error.

The only ways around this are to either convert your scalar subquery to an aggregate value like so:

max(case (select count(*) cnt from t t1) when 0 then 1 else 0 end)

(case max((select count(*) cnt from t t1)) when 0 then 1 else 0 end)

or rewrite your query to move the unaggregated scalar subquery out of the aggregated query:

select (case (select count(*) cnt from t t1) when 0 then 1 else 0 end) * sum
  from (select sum(t3.num) sum from t t3) t2;

or precompute your scalar subquery so it can be used in the group by clause:

select case t1.cnt when 0 then 1 else 0 end * sum(t2.num)
  from t t2
     , (select count(*) cnt from t) t1
 group by case t1.cnt when 0 then 1 else 0 end

Best Answer

Alternative query avoiding repeating functions

Related Solutions

Mysql – Best solution for search scenario

ORA-00937 – Calculating Sub-Value in Aggregate SELECT Statement

Related Question