Mysql – Aggregating sum day by day from multiple joins returns unexpected results

aggregateMySQLmysql-5.6

I'm trying to make a table where the task hours and group hours are aggregated day by day.
I'm able to get the wanted result for both tasks and groups, but when I try to get them in the same query, I get unexpected results.

Here's my test:

Sample data:

CREATE TABLE IF NOT EXISTS `groups` (
  `id` int(11) NOT NULL,
  `name` varchar(200) NOT NULL,
  `hours` float NOT NULL,
  `created` datetime NOT NULL
);


INSERT INTO `groups` (`id`, `name`, `hours`, `created`) VALUES
(1, 'Description of job 1', 11, '2014-12-02 10:09:52'),
(2, 'Description of job 2', 10, '2014-12-04 10:09:52'),
(3, 'Description of job 3', 25, '2014-12-11 10:09:52');


CREATE TABLE IF NOT EXISTS `tasks` (
  `id` int(7) NOT NULL,
  `groupid` int(11) NOT NULL,
  `hours` int(5) NOT NULL,
  `text` text NOT NULL,
  `created` datetime NOT NULL
);

INSERT INTO `tasks` (`id`, `groupid`, `hours`, `text`, `created`) VALUES
(1, 1, 1, 'Some task on job 1', '2014-12-03 10:10:00'),
(2, 1, 2, 'Some task on job 1', '2014-12-04 10:10:00'),
(3, 1, 3, 'Some task on job 1', '2014-12-10 10:10:00'),
(4, 2, 5, 'Some task on job 2', '2014-12-05 10:10:00'),
(5, 2, 5, 'Some task on job 2', '2014-12-06 10:10:00'),
(6, 2, 1, 'Some task on job 2', '2014-12-08 10:10:00');


CREATE TABLE IF NOT EXISTS `datetable` (
  `thedate` datetime NOT NULL
);

INSERT INTO `datetable` (`thedate`) VALUES
('2014-11-28 00:00:00'),
('2014-11-29 00:00:00'),
('2014-11-30 00:00:00'),
('2014-12-01 00:00:00'),
('2014-12-02 00:00:00'),
('2014-12-03 00:00:00'),
('2014-12-04 00:00:00'),
('2014-12-05 00:00:00'),
('2014-12-06 00:00:00'),
('2014-12-07 00:00:00'),
('2014-12-08 00:00:00'),
('2014-12-09 00:00:00'),
('2014-12-10 00:00:00'),
('2014-12-11 00:00:00'),
('2014-12-12 00:00:00'),
('2014-12-13 00:00:00');

Now, by running the query below, I get number of total hours per day, day by day for every task on this day and previous days.

SELECT
DATE_FORMAT(dt.thedate, '%Y-%m-%d') as the_date,
SUM(tt.sum_task) as sum_t

FROM datetable dt

LEFT JOIN   (
                SELECT
                DATE(tx.created) as created_date,
                SUM(tx.hours) as sum_task

                FROM tasks tx

                 -- Some extra where clauses here

                GROUP BY tx.created
            )
            AS tt ON DATE(tt.created_date) <= DATE(dt.thedate)


GROUP BY dt.thedate

ORDER BY dt.thedate ASC

Fiddle: http://sqlfiddle.com/#!2/38fe4/2/0

Good. Now I want the same type of column with the groups, so I add it the same way:

SELECT
DATE_FORMAT(dt.thedate, '%Y-%m-%d') as the_date,
SUM(tt.sum_task) as sum_t,
SUM(tg.sum_group) as sum_g

FROM datetable dt

LEFT JOIN   (
                SELECT
                DATE(tx.created) as created_date,
                SUM(tx.hours) as sum_task

                FROM tasks tx

                 -- Some extra where clauses here

                GROUP BY tx.created
            )
            AS tt ON DATE(tt.created_date) <= DATE(dt.thedate)

LEFT JOIN   (
                SELECT
                DATE(gx.created) as created_date2,
                SUM(gx.hours) as sum_group

                FROM groups gx

                 -- Some extra where clauses here

                GROUP BY gx.created
            )
            AS tg ON DATE(tg.created_date2) <= DATE(dt.thedate)

GROUP BY dt.thedate

ORDER BY dt.thedate ASC

Fiddle: http://sqlfiddle.com/#!2/38fe4/1

But now I get numbers which seem to add to itself multiple times.

When I query with only one LEFT JOIN, I get the result I want, but when I try to join them both, I get the unexpected results.

What exactly is happening here, and how can I output the group hours and task hours without the self aggregation numbers?

Expected result:

the_date        sum_t   sum_g
2014-11-28      NULL    NULL
2014-11-29      NULL    NULL
2014-11-30      NULL    NULL
2014-12-01      NULL    NULL
2014-12-02      NULL    11
2014-12-03      1       11
2014-12-04      3       21
2014-12-05      8       21
2014-12-06      13      21
2014-12-07      13      21
2014-12-08      14      21
2014-12-09      14      21
2014-12-10      17      21
2014-12-11      17      46
2014-12-12      17      46
2014-12-13      17      46

Actual result:

the_date        sum_t   sum_g
2014-11-28      NULL    NULL
2014-11-29      NULL    NULL
2014-11-30      NULL    NULL
2014-12-01      NULL    NULL
2014-12-02      NULL    11
2014-12-03      1       11
2014-12-04      6       42
2014-12-05      16      63
2014-12-06      26      84
2014-12-07      26      84
2014-12-08      28      105
2014-12-09      28      105
2014-12-10      34      126
2014-12-11      51      276
2014-12-12      51      276
2014-12-13      51      276

Best Answer

I first thought that you needed to GROUP BY the datetable in a derived table as well, to avoid a cross join but it seems that the datetable already has distinct dates so that is not the cause. The problem is the <= in the (more than one) joins. This creates a kind of cross join and wrong results.

So, the solution is to do the LEFT JOIN and the <= inside the derived tables and in the external level join with equality = and no GROUP BY:

Another issue comes from not using GROUP BY DATE(DateColumn). Use that - and then the ON conditions can be simplified as well.

I also changed the DATE(tx.created) <= DATE(dt.thedate) condition to tx.created < (dt.thedate + INTERVAL 1 DAY) so an index on (created, hours) can be used:

SELECT
DATE_FORMAT(dt.thedate, '%Y-%m-%d') AS the_date,
COALESCE(tt.sum_task, 0) AS sum_t,
COALESCE(tg.sum_group, 0) AS sum_g

FROM datetable AS dt

LEFT JOIN   (                                  -- this could be an INNER JOIN, no difference
                SELECT
                dt.thedate,
                SUM(tx.hours) as sum_task

                FROM datetable AS dt
                  LEFT JOIN tasks AS tx
                    ON tx.created < (dt.thedate + INTERVAL 1 DAY)

                 -- Some extra where clauses here   -- move them to the ON above

                GROUP BY dt.thedate
            )
            AS tt ON tt.thedate = dt.thedate

LEFT JOIN   (                                 -- this could be an INNER JOIN, no difference
                SELECT
                dt.thedate,
                SUM(gx.hours) as sum_group

                FROM datetable AS dt
                  LEFT JOIN groups AS gx
                    ON gx.created < (dt.thedate + INTERVAL 1 DAY)

                 -- Some extra where clauses here   -- move them to the ON above

                GROUP BY dt.thedate
            )
            AS tg ON tg.thedate = dt.thedate

ORDER BY dt.thedate ;

Tested at SQLFiddle