MySQL – How to Get Rolling Total Query to Sum Correctly

MySQLsum

I have a problem with the following query:

SELECT  t1.idcliente,       
        c.nombre,c.apellido,
        t1.fecha,
        t1.haber,
        t1.debe,       
        SUM(t2.haber -t2.debe) as SUM
FROM estado_cliente t1
INNER JOIN estado_cliente t2 ON t1.idestado_cliente >= t2.idestado_cliente 
INNER JOIN cliente c ON t1.idcliente= c.idcliente
where t1.idcliente=t2.idcliente  
GROUP BY t1.idcliente ,
         t1.fecha,
         t1.haber,
         t1.debe,
         c.nombre,c.apellido
ORDER BY t1.idcliente,t1.fecha
;

See this sqlfiddle.

Here's part of the results:

idcliente  nombre  apellido  fecha       haber  debe   SUM
   6       Matias  Blanco    2017-07-03  0       8100   -8100
   6       Matias  Blanco    2017-07-08  0      15000  -23100
   6       Matias  Blanco    2017-09-24  4312       0  -38788    <--- problem
   6       Matias  Blanco    2017-09-25  0      10000  -33100
   6       Matias  Blanco    2017-09-26  0      10000  -43100

Each result row should have the SUM column return the cumulative total of all haber values minus all the debe values, up to and including the given date. These represent payments and debits, and I want the total amount still outstanding on that date.

However in the third line, that's not what I'm getting. The results should look like:

idcliente  nombre  apellido  fecha       haber  debe   SUM
   6       Matias  Blanco    2017-07-03  0       8100   -8100
   6       Matias  Blanco    2017-07-08  0      15000  -23100
   6       Matias  Blanco    2017-09-24  4312       0  -18788  <-- correct 
   6       Matias  Blanco    2017-09-25  0      10000  -28788
   6       Matias  Blanco    2017-09-26  0      10000  -38788

Given the sqlfiddle link above, can anyone help me figure out what I'm doing wrong?

Best Answer

The problem with your query is simply that you're not actually processing the records in the order you want to.

Your query picks up all records with an idestado_cliente less than the current row. This would work fine if all records were entered in order strictly by date. however, as we see in your test data, they are not; the last three entries are for earlier dates than the previous entries for each cliente.

In order to deal with that, we need to modify the INNER JOIN to the t2 version of estado_cliente, as follows:

INNER JOIN estado_cliente t2 ON (   t1.fecha > t2.fecha
                                 OR (    t1.fecha = t2.fecha
                                     AND t1.idestado_cliente >= t2.idestado_cliente
                                    )
                                )

Here, we're grabbing the current record, any records with an earlier date, and any records with the same date, but an earlier idestado_cliente.

This will give you the results you want.

That said, as pointed out by Rick James, this isn't necessarily the best way to do it. In your query, you are selecting a second set of rows from estado_cliente for every row you want to show. You don't need to do that; instead, you can just touch each row once in order to get the totals you need, using variables.

You can store the current row's idcliente and sum in variables. Then, when you go to process the next row, you can check if the idcliente has changed (in which case you need to start over), and then calculate this row's sum (using the old sum if the idcliente didn't change, and starting from 0 again if it did).

This version of the query looks like this:

SET @idcliente = 0;
SET @sum := 0;

SELECT c.idcliente,
        c.nombre,
        c.apellido,
        summed.fecha,
        summed.haber,
        summed.debe,
        summed.`SUM`
FROM (SELECT  t1.fecha,
              t1.haber,
              t1.debe,
              t1.idestado_cliente,
              @sum := (  CASE WHEN @idcliente = t1.idcliente THEN @sum ELSE 0 END
                       + t1.haber - t1.debe )
              as `SUM`,
              @idcliente := t1.idcliente as idcliente
      FROM estado_cliente t1
      ORDER BY t1.idcliente,t1.fecha,t1.idestado_cliente
      ) summed
INNER JOIN cliente c ON summed.idcliente= c.idcliente
ORDER BY c.idcliente,summed.fecha,summed.idestado_cliente
;

This updated sqlfiddle link includes all three versions of the query:

  • the original query, as you posted it (left in primarily to ensure that all values are indeed in memory for the following two queries);
  • the fixed version of the original query (using the self-join); and
  • the final version of the query, using variables.

I've run it several times, and even considering the small amount of data involved, the version of the query using variables always seems to run at least as fast as the version that uses a self-join, and usually runs 1 millisecond faster.


Follow-up:

In another use case, I will need to run that query for a specific date range only. For best performance, should I make a temporary table filtering estado_cliente by date, and then use that temp table in the query?

First, and most important: Your rolling sum will be based on the rows used in the query only. You may already have realized that, but just in case, I thought I should cover it.

If you processed the sample data from the question, but only included rows dated September 1-25, 2017, you won't get:

idcliente  nombre  apellido  fecha       haber  debe   SUM
   6       Matias  Blanco    2017-09-24  4312       0  -18788
   6       Matias  Blanco    2017-09-25  0      10000  -28788

Instead, you'll get:

idcliente  nombre  apellido  fecha       haber  debe   SUM
   6       Matias  Blanco    2017-09-24  4312       0    4312
   6       Matias  Blanco    2017-09-25  0      10000   -5688

That's because the rows from July are excluded, so their haber and debe are not calculated into the SUM.

To get the first set of results (with an absolute SUM, where you get the same value for each row even when you exclude older rows from your output), you have to include all records prior to your date range in the full process, and filter out the ones from before the target dates in the final query.

If the second set of results are what you're after (where the SUM is relative to the set of dates being displayed), then you can fully filter out your full date range before calculating the SUM.

In either case, the upper bound of a date range should be applied before calculating the SUM - records past the dates you want have no impact on the SUM for earlier dates, regardless.

This sqlfiddle link shows the same three queries as the previous link, plus versions of the self-join and variable queries that restrict the output to rows from September 1-25, 2017 and still show the "absolute" sums. I've added an initial column to the results of each, to make it clear which query is which.

So, there are a few ways to do the filtering before processing. Let's just consider two: simply filter out the rows by adding WHERE clauses to the existing queries, or write the estado_cliente data for the only the rows we need to process to a temporary table. The 4th and 5th queries in the link just above are using WHERE clauses.

Ultimately, I can only recommend that you try both, and see which performs better for you.

I can provide some guidance as to what I would expect:

  • Writing to disk is generally the most time-consuming operation you can perform in a database (reads from disk are a close second, normally - solid state drives tend to reduce the cost of both of these operations). So, writing the data you need to a temporary table can actually make a query take longer; if the database engine was able to process everything in memory, and didn't need to write any intermediate data to disk, then I would expect writing to the temp table would be worse.

  • If you would be using the temporary table for other queries as well, then creating it could result in a net improvement overall. If there are a large number of estado_cliente columns you won't be using, the temporary table would fit more data into each page, and you'd have fewer pages to bring into memory. If estado_cliente doesn't have many columns beyond the ones you're using, then there is less benefit. The fact that every page has only the rows you want to work with might still mean fewer page to bring into memory. Note that, in the version of the query with the self-join, you are accessing estado_cliente twice, so a temporary table might benefit that version of the query as well.

  • I don't use MySQL heavily, but in MS SQL Server (where I do most of my work), I have sometimes seen queries that completed faster broken into two queries (one that populates a temp table, and one that performs the final steps using the temp table) than as a single query. I believe this is simply a matter of the complexity of the query making it difficult for the database engine to find an optimal execution plan within the time limits it has for that step of things. It does find optimal plans for the two separate queries, so they go fast; it picks the best plan it can for the more complex query in the time allotted, and that plan can turn out to be bad.

  • My "gut instinct" says that the query that uses variables is pretty unlikely to benefit from the temp table (say, 25% chance), and the query using a self-join is somewhat unlikely (say, 45% chance).

But like I said earlier - It's probably worth the time to try both options (filter the unwanted date rows in the query, and write the desired date rows to a temporary table), and see which way seems better with your data and system.