Mysql – Group data by Custom Period Ranges Using a Reference Date

MySQL

If for example, I have a table that looks like this:

+----+--------+---------------------+
| id | volume |      createdAt      |
+----+--------+---------------------+
|  1 |   0.11 | 2018-01-26 13:56:01 |
|  2 |   0.34 | 2018-01-28 14:22:12 |
|  3 |   0.22 | 2018-03-11 11:01:12 |
|  4 |   0.19 | 2018-04-13 12:12:12 |
|  5 |   0.12 | 2014-04-21 19:12:11 |
+----+--------+---------------------+

I want to perform a query that can accept starting point and then loop through a given number of days, and then group by that date range.

For instance, I'd like the result to look like this:

+------------+------------+--------+
| enddate    | startdate  | volume |
+------------+------------+--------+
| 2018-04-25 | 2018-04-12 |   0.31 |
| 2018-04-11 | 2018-03-29 |   0.00 |
| 2018-03-28 | 2018-03-15 |   0.00 |
| 2018-03-14 | 2018-03-01 |   0.22 |
| 2018-02-28 | 2018-02-15 |   0.00 |
| 2018-02-14 | 2018-02-01 |   0.00 |
| 2018-01-31 | 2018-01-18 |   0.45 |
| ...        | ...        |    ... |
+------------+------------+--------+

In essence, I want to be able to input a start_date e.g 2018-04-25, a time_interval e.g. 14, like in the illustration above and then the query will sum the volumes in that time range.

I know how to use INTERVAL with the DATE_SUB() and the DATE_ADD() functions but I cannot figure out how to perform the loop I think is necessary.

Please help.

Best Answer

I hope I understand this right and you can use this for at least a start.

SELECT x.startdate + INTERVAL (x.days + 1) DAY enddate,
       x.startdate,
       (SELECT sum(t.volumes)
               FROM tab t
               WHERE t.createdAt >= x.startdate
                     AND t.createdAt < x.startdate + INTERVAL (x.days + 1) DAY) volume
       FROM (SELECT '2018-04-12' startdate,
                    12 days
             UNION
             SELECT '2018-03-29' startdate,
                    12 days
             UNION
             SELECT '2018-03-15' startdate,
                    12 days
             UNION
             SELECT '2018-03-01' startdate,
                    12 days
             UNION
             SELECT '2018-02-15' startdate,
                    12 days
             UNION
             SELECT '2018-02-01' startdate,
                    12 days
             UNION
             SELECT '2018-01-18' startdate,
                    12 days) x;

I haven't quite gotten on how you'd like to specify the dates and intervals though. Maybe you can clarify that, than I can probably edit this in.

SUGGESTED QUERY

SELECT
    CONCAT(A.ndx,' - ',A.ndx+9) "score range",
    IFNULL(B.rowcount ,0) "number of occurrences"
FROM
(
    SELECT 0 ndx UNION SELECT 10 UNION SELECT 20 UNION
    SELECT 30 UNION SELECT 40 UNION SELECT 50 UNION SELECT 60
    UNION SELECT 70 UNION SELECT 80 UNION SELECT 90
) A
LEFT JOIN
(
    SELECT ndx,COUNT(1) rowcount FROM
    (SELECT FLOOR(score/10)*10 ndx FROM yoshi_scores) AA
    GROUP BY ndx
) B USING (ndx);

SAMPLE DATA

use test
DROP TABLE IF EXISTS yoshi_scores;
CREATE TABLE yoshi_scores
(id int not null auto_increment,
score int not null,primary key (id), key (score));
INSERT INTO yoshi_scores (score) VALUES
(97),(74),(22),(98),(65),(62),(47),(64),(82),( 8),(60),(12),
(27),(14),(13),(28),(60),(12),(27),(34),(32),(89),(15),( 2);

Data would look like this

mysql> SELECT * FROM yoshi_scores ORDER BY id;
+----+-------+
| id | score |
+----+-------+
|  1 |    97 |
|  2 |    74 |
|  3 |    22 |
|  4 |    98 |
|  5 |    65 |
|  6 |    62 |
|  7 |    47 |
|  8 |    64 |
|  9 |    82 |
| 10 |     8 |
| 11 |    60 |
| 12 |    12 |
| 13 |    27 |
| 14 |    14 |
| 15 |    13 |
| 16 |    28 |
| 17 |    60 |
| 18 |    12 |
| 19 |    27 |
| 20 |    34 |
| 21 |    32 |
| 22 |    89 |
| 23 |    15 |
| 24 |     2 |
+----+-------+
24 rows in set (0.00 sec)

SUGGESTED QUERY EXECUTED

mysql> SELECT
    ->     CONCAT(A.ndx,' - ',A.ndx+9) "score range",
    ->     IFNULL(B.rowcount ,0) "number of occurrences"
    -> FROM
    -> (
    ->     SELECT 0 ndx UNION SELECT 10 UNION SELECT 20 UNION
    ->     SELECT 30 UNION SELECT 40 UNION SELECT 50 UNION SELECT 60
    ->     UNION SELECT 70 UNION SELECT 80 UNION SELECT 90
    -> ) A
    -> LEFT JOIN
    -> (
    ->     SELECT ndx,COUNT(1) rowcount FROM
    ->     (SELECT FLOOR(score/10)*10 ndx FROM yoshi_scores) AA
    ->     GROUP BY ndx
    -> ) B USING (ndx);
+-------------+-----------------------+
| score range | number of occurrences |
+-------------+-----------------------+
| 0 - 9       |                     2 |
| 10 - 19     |                     5 |
| 20 - 29     |                     4 |
| 30 - 39     |                     2 |
| 40 - 49     |                     1 |
| 50 - 59     |                     0 |
| 60 - 69     |                     5 |
| 70 - 79     |                     1 |
| 80 - 89     |                     2 |
| 90 - 99     |                     2 |
+-------------+-----------------------+
10 rows in set (0.00 sec)

mysql>

Why is the Query designed this way ??? Look at the first subquery

(
    SELECT 0 ndx UNION SELECT 10 UNION SELECT 20 UNION
    SELECT 30 UNION SELECT 40 UNION SELECT 50 UNION SELECT 60
    UNION SELECT 70 UNION SELECT 80 UNION SELECT 90
) A

I performed a LEFT JOIN of this to the counts for a reason. Notice that the dataset has nothing in the 50 - 59 range. Its count would not show up in the second query:

mysql> SELECT ndx,COUNT(1) rowcount FROM
    -> (SELECT FLOOR(score/10)*10 ndx FROM yoshi_scores) AA
    -> GROUP BY ndx;
+------+----------+
| ndx  | rowcount |
+------+----------+
|    0 |        2 |
|   10 |        5 |
|   20 |        4 |
|   30 |        2 |
|   40 |        1 |
|   60 |        5 |
|   70 |        1 |
|   80 |        2 |
|   90 |        2 |
+------+----------+
9 rows in set (0.00 sec)

mysql>

I am sure you wanted the range 50 - 59 to show up, so the query is designd to catch all ranges. Any missing range is essentially defaulted to zero.

GIVE IT A TRY !!!

CAVEAT: Unfortunately, the first subquery requires you to hardcode it. It will be the only hardcoding required to make it work.

MySQL – Using Composite Indexes with Date Ranges for Performance

It's hard to tell if this will hurt or improve performance but adding a composite index on start_date and end_date shouldn't worsen your query, the index will either be used or not used. You didn't mention your RDBMS but I don't think there's a big chance your engine's optimizer will pick a worse plan (but there is always a possibility I guess).

That being said, there might a slight impact on the plan generation time as your optimizer has an extra index to consider, and unused indexes could hurt overall performance since they need to be maintained when updating/inserting.

Whether or not the index will help your query is hard to tell, it depends on the selectivity of your query and cardinality statistics. The optimizer of your RDBMS should pick the fastest way to get your data, and even if an index exists on the columns your query selects on a full scan might still be the better option, especially since you are selecting all fields.

Why is selectivity important?

When an index is used, the index points to the actual record and the record might need to be fetched using the pointer. Most RDBMS work that way.

See for example the Oracle documentation

In general, index access paths should be used for statements that retrieve a small subset of table rows, while full scans are more efficient when accessing a large portion of the table

or the SQL Server documentation

In many cases, the optimizer forces a table scan for queries with result sets of approximately 5 percent, although the table scan becomes more efficient than index access at selectivities of 8 to 10 percent.

Why is select * important

If you fetch all fields from a table your database engine will have to fetch the actual record using the pointer stored in the index and you lose the option of using included columns (if your RDBMS supports them) or covering indexes.

See for example this article for SQL server (but the principle goes for a lot of vendors): Using Covering Indexes to Improve Query Performance

However, under some circumstances, the overhead associated with nonclustered indexes may be deemed too great by the query optimizer and SQL Server will resort to a table scan to resolve the query.

and

The observed improvement is due to the fact that the nonclustered index contained all of the required information to resolve the query. No Key Lookups were required.

Best Answer

Related Solutions

Mysql – “GROUP BY” in ranges

SUGGESTED QUERY

SAMPLE DATA

SUGGESTED QUERY EXECUTED

GIVE IT A TRY !!!

MySQL – Using Composite Indexes with Date Ranges for Performance

Related Question