Mysql query – updating each record in table with sum, count of previous records

MySQLmysql-5.7

We have a mysql transaction table where every run registration attempt (paystatus = completed, pending, incomplete, etc) is recorded with a bunch of other data. I want to create additional columns in the same table and update them with aggregate values calculated from previous rows (e.g – same runner can come to multiple runs, get aggregate of previous run records for same person into current record).

Unique columns –

  • regid – primary key
  • combination of email+firstname (not exactly accurate but we use this
    assumption for a unique person)

Source table structure (sqlfiddle – http://sqlfiddle.com/#!9/ff191d/6) and example data

source example data - 
+---------------------------------------------------------------------------------------------------------------+
| regid | registeredon | paystatus | firstname |    email   | amountpaid | totalruns | totalamount | influencer |
+---------------------------------------------------------------------------------------------------------------+
|  a2   |  25-05-2018  | completed |    jay    |    j@x.c   |    100     |     0     |      0      |     0      |
|  z2   |  14-03-2018  | completed |    jay    |    j@x.c   |    100     |     0     |      0      |     0      |
|  y2   |  23-02-2018  | completed |  vijay    |    v@z.c   |    150     |     0     |      0      |     0      |
|  y6   |  21-02-2018  |  pending  |  vijay    |    v@z.c   |     0      |     0     |      0      |     0      |
|  x2   |  14-01-2018  | completed |  vijay    |    v@z.c   |    100     |     0     |      0      |     0      |
|  x6   |  11-01-2018  |  pending  |  vijay    |    v@z.c   |    100     |     0     |      0      |     0      |
|  a1   |  23-05-2018  | completed |   ajay    |    a@x.c   |    120     |     0     |      0      |     0      |
|  a5   |  21-05-2018  |  pending  |   ajay    |    a@x.c   |     0      |     0     |      0      |     0      |
|  z1   |  04-03-2018  | completed |   ajay    |    a@x.c   |    100     |     0     |      0      |     0      |
|  z5   |  02-03-2018  |  pending  |   ajay    |    a@x.c   |     0      |     0     |      0      |     0      |
|  y1   |  03-02-2018  | completed |   ajay    |    a@x.c   |    150     |     0     |      0      |     0      |
|  x1   |  02-01-2018  | completed |   ajay    |    a@x.c   |    100     |     0     |      0      |     0      |
|  x5   |  01-01-2018  |  pending  |   ajay    |    a@x.c   |     0      |     0     |      0      |     0      |
+---------------------------------------------------------------------------------------------------------------+

The aggregate value columns to be calculated are (count only successful registrations for all i.e paystatus = completed)

  • totalruns – Total runs runner has participated in, including current one.
  • totalamount – Sum of all amount runner has paid for current and previous runs. Total amount runner has paid till that run including current row.
  • influencer – count of successful registrations for that email (not including firstname like above)

Expected result data after calculations

+---------------------------------------------------------------------------------------------------------------+
| regid | registeredon | paystatus | firstname |    email   | amountpaid | totalruns | totalamount | influencer |
+---------------------------------------------------------------------------------------------------------------+
|  a2   |  25-05-2018  | completed |    jay    |    j@x.c   |    100     |     2     |     200     |     1      |
|  z2   |  14-03-2018  | completed |    jay    |    j@x.c   |    100     |     1     |     100     |     1      |
|  y2   |  23-02-2018  | completed |  vijay    |    v@z.c   |    150     |     2     |     250     |     2      |
|  y6   |  21-02-2018  |  pending  |  vijay    |    v@z.c   |     0      |     1     |     100     |     1      |
|  x2   |  14-01-2018  | completed |  vijay    |    v@z.c   |    100     |     1     |     100     |     1      |
|  x6   |  11-01-2018  |  pending  |  vijay    |    v@z.c   |    100     |     1     |     100     |     0      |
|  a1   |  23-05-2018  | completed |   ajay    |    a@x.c   |    120     |     4     |     470     |     4      |
|  a5   |  21-05-2018  |  pending  |   ajay    |    a@x.c   |     0      |     3     |     350     |     3      |
|  z1   |  04-03-2018  | completed |   ajay    |    a@x.c   |    100     |     3     |     350     |     3      |
|  z5   |  02-03-2018  |  pending  |   ajay    |    a@x.c   |     0      |     2     |     250     |     2      |
|  y1   |  03-02-2018  | completed |   ajay    |    a@x.c   |    150     |     2     |     250     |     2      |
|  x1   |  02-01-2018  | completed |   ajay    |    a@x.c   |    100     |     1     |     100     |     1      |
|  x5   |  01-01-2018  |  pending  |   ajay    |    a@x.c   |     0      |     0     |      0      |     0      |
+---------------------------------------------------------------------------------------------------------------+

Explanation – totalamount, totalruns, influencer for regid x5 is 0 because it was not a successful registration (paystatus = pending). x1 aggregate values are of the same record as it is the first successful registration. y1 aggregate values are sum of values for x1 and y1. z5 has aggregate values of previous successful registration, instead this can be left at 0 too (if update is run only for paystatus=completed).

I have been able to get to this query with the help of a friend, but it is not correct (sum of all records with email+firstname combination is shown instead of only older rows with that combination) –

    SELECT 
t1.regid, registeredon, paystatus, firstname, email, 
amountpaid, SumTotalruns, SumTotalamount
FROM test t1, 
(select t2.regid,
    concat(t2.email, t2.firstname) tempid,  
    COUNT(*) SumTotalruns, 
    SUM(t3.amountpaid) SumTotalamount
    from test t2 left join test t3
    on concat(t2.email , t2.firstname) = concat(t3.email , t3.firstname)
        and t2.paystatus = 'completed'
        and t2.registeredon < t3.registeredon
    group by concat(t2.email, t2.firstname)
    ) tmp
where concat(t1.email , t1.firstname) = tmp.tempid
order by id desc

Output of above query –

regid|   registeredon |   paystatus   |   firstname   |   email   |   amountpaid   |   SumTotalruns   |   SumTotalamount
a1   |   2018-05-05   |   completed   |   ajay        |   a@x.c   |   120          |   15             |   710
z2   |   2018-03-14   |   completed   |   jay         |   j@x.c   |   100          |   2              |   100
z1   |   2018-03-18   |   completed   |   ajay        |   a@x.c   |   100          |   15             |   710
y6   |   2018-02-21   |   pending     |   vijay       |   v@z.c   |   0            |   5              |   150
y2   |   2018-02-23   |   completed   |   vijay       |   v@z.c   |   150          |   5              |   150
y1   |   2018-02-03   |   completed   |   ajay        |   a@x.c   |   150          |   15             |   710
x6   |   2018-01-11   |   pending     |   vijay       |   v@z.c   |   100          |   5              |   150
x5   |   2018-01-01   |   pending     |   ajay        |   a@x.c   |   0            |   15             |   710
x2   |   2018-01-14   |   completed   |   vijay       |   v@z.c   |   100          |   5              |   150
x1   |   2018-01-11   |   completed   |   ajay        |   a@x.c   |   100          |   15             |   710
a5   |   2018-05-03   |   pending     |   ajay        |   a@x.c   |   0            |   15             |   710
a2   |   2018-05-23   |   completed   |   jay         |   j@x.c   |   100          |   2              |   100
z5   |   2018-03-15   |   pending     |   ajay        |   a@x.c   |   0            |   15             |   710

All these update queries will be added to a stored procedure, so if looping is better, that can be used too. Can you please help with finding the most efficient way to get values for these? Will appreciate your help!

Figured the query with the help of another friend

    update test a join
(SELECT
    outers.regid,
    outers.registeredon, outers.paystatus, outers.firstname, outers.email,
    outers.amountpaid,
    sum(InnerAmount) SumTotalAmount,
    count(1) SumTotalRuns
    from test as outers,
        (select id,
        test.regid,concat(test.email, test.firstname) tempid,
        test.amountpaid InnerAmount,
        test.registeredon
        from test
        where test.paystatus = 'completed'
    ) inner_table
    where concat(outers.email,outers.firstname) = inner_table.tempid
    and outers.registeredon >= inner_table.registeredon
    group by regid
) b
on a.regid=b.regid
set a.totalruns = b.SumTotalRuns,
a.totalamount = b.SumTotalAmount

But running this on the main table (total 12.5k rows and 7.6k rows with paystatus='completed') is taking 40+ seconds.
Any help in optimizing this query?

Best Answer

The query offered by Your friend

    update test a join
(SELECT
    outers.regid,
    outers.registeredon, outers.paystatus, outers.firstname, outers.email,
    outers.amountpaid,
    sum(InnerAmount) SumTotalAmount,
    count(1) SumTotalRuns
    from test as outers,
        (select id,
        test.regid,concat(test.email, test.firstname) tempid,
        test.amountpaid InnerAmount,
        test.registeredon
        from test
        where test.paystatus = 'completed'
    ) inner_table
    where concat(outers.email,outers.firstname) = inner_table.tempid
    and outers.registeredon >= inner_table.registeredon
    group by regid
) b
on a.regid=b.regid
set a.totalruns = b.SumTotalRuns,
a.totalamount = b.SumTotalAmount

is close to optimal, needed a lot of minor modifications.

  1. Fields id and regid in inner_table are not used in outer code and must be removed.

  2. Using CONCAT() function prevents using index (if exists) and must be replaced with separate fields in both places it occured.

  3. Fields registeredon, paystatus, firstname, email, amountpaid from outer SELECT statement are not used in UPDATE and must be removed.

  4. To optimize joining process in outer SELECT You may create index by (`firstname`, `email`) or maybe even by (`firstname`, `email`, `registeredon`).

So the result is:

CREATE INDEX idx_firstname_email ON test (firstname, email /* , registeredon */ );

and

UPDATE test a 
JOIN ( SELECT outers.regid,
              SUM(InnerAmount) SumTotalAmount,
              COUNT(1) SumTotalRuns
       FROM test AS outers,
            (SELECT test.email, 
                    test.firstname,
                    test.amountpaid InnerAmount,
                    test.registeredon
             FROM test
             WHERE test.paystatus = 'completed'
            ) inner_table
       WHERE outers.email = inner_table.email 
         AND outers.firstname = inner_table.firstname
         AND outers.registeredon >= inner_table.registeredon
       GROUP BY regid
     ) b ON a.regid=b.regid
SET a.totalruns = b.SumTotalRuns,
    a.totalamount = b.SumTotalAmount;