Mysql – Select user activity over past months grouped by cohorts

MySQLselect

In MySQL 5.7, I have a table user that contains the columns registered and lastLogin. My goal is to group the users by calendar months of registration and then select for each group how many of these users were still active after 1 month, after 2 months and so on.

20,000 users have registered in June 2015, from these 10,000 have been active in July 2015 or later, 5,000 have been active in August 2015 or later, etc. Continue this output for all users registered in the past months until today.

I understand how I can select any of the activity months by hand, for example select all users that were active one calendar month after their registration:

SELECT year(registered), month(registered), count(id)
FROM user
WHERE lastLogin >= DATE_ADD(DATE_SUB(date(registered), INTERVAL DAYOFMONTH(registered) - 1 DAY), INTERVAL 1 MONTH)
GROUP BY year(registered), month(registered)

This would give me the number of users who were active one month after their registration or later.

But in addition I would also like to select the number of users who were active 2 months after their registration, 3 months after their registration and so on up to the current month.

The results would be something like this (not necessarily look like this, but it should get the idea across):

Columns:      Cohort |    total | 1 month | 2months | 3months | ... | month n 
             07/2015 |   20,000 | 10,100  | 5,500   | 2,600   | ... | 150
             08/2015 |   21,000 | 9,800   | 4,300   | 1,300   | ... | 180
             09/2015 |   19,700 | 11,400  | 6,200   | 3,500   | ... | 200
    ...
             08/2017 |   25,300 |  13,000 
             09/2017 |   12,500 

How do I change the query to select all months of activity instead of a specific month?

Best Answer

With the help of Rick to look for pivot table I was able to create the final query as described in this post: https://stackoverflow.com/questions/7674786/mysql-pivot-table

Query for a 6 months analysis:

SELECT 
    YEAR(registered), MONTH(registered), COUNT(*) as total,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 1 MONTH) AS month1,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 2 MONTH) AS month2,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 3 MONTH) AS month3,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 4 MONTH) AS month4,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 5 MONTH) AS month5,
    SUM(lastLogin >= DATE(registered) - INTERVAL DAYOFMONTH(registered) - 1 DAY + INTERVAL 6 MONTH) AS month6
FROM
    user
WHERE 
    registered >= "2015-01-01"
GROUP BY YEAR(registered) , MONTH(registered)

The part DATE_SUB(DATE(registered), INTERVAL DAYOFMONTH(registered) - 1 DAY) will return the first of a month to make sure the analysis is based on calendar months. I am not sure if it could be simplified. I was looking at STR_TO_DATE(LEFT(registered, 7), "%y-%m") but I was reading working on Strings is slower and should be avoided. If someone knows a better / cleaner way of doing that, please comment and I will include it in the answer.

Here is some exemplary output of the query:

  year, month, total , month1, month2, month3, month4, month5, month6
'2015',  '1', '14776', '4302', '3225', '2827', '2547', '2330', '2163'
'2015',  '2', '12162', '3859', '2866', '2465', '2259', '2046', '1890'
'2015',  '3', '10770', '3831', '2841', '2507', '2288', '2135', '1987'
'2015',  '4', '15685', '3731', '2641', '2273', '2045', '1872', '1730'
'2015',  '5', '18130', '3686', '2403', '1993', '1744', '1528', '1363'