Postgresql – Optimizing a user retention query

optimizationpostgresql

I'm trying to automatically calculate a table for user retention, so that the end result is something like:

Week | %
0    | 100
1    | 50
2    | 35

For now I'm not bothering with the % and just trying to get a count of users, i.e. users that came back on week 1, week 2, etc.

I have an accounts table and an action table. Currently I don't have indexes in either of these. Both have a user_email that stores the associated user, and both have a timestamp field that's a TIMESTAMP WITH TIME ZONE – for accounts this is the account creation date and for action it's the timestamp of the action.

The two tables are currently super simple:

accounts Table:
timestamp  | timestamp with time zone
user_email |character varying
action Table:
timestamp  | timestamp with time zone
user_email |character varying

That's it, no keys or anything else.

I managed to piece together a query that seems to do this calculation:

WITH weeks AS (
    SELECT (n || 'week')::interval as week from generate_series(0, 3) AS n
)
SELECT 

    week,

    (SELECT count(*) FROM accounts WHERE user_email IN 
        (SELECT user_email FROM action
            WHERE age(action.timestamp, accounts.timestamp) >= week 
            AND age(action.timestamp, accounts.timestamp) < week + '1 week'::interval)) 
            AS "Returning Users"

FROM weeks;

The output here is something like:

week    | Returning Users
0 days  |  100
7 days  |  50
14 days |  35
21 days |  25

The thing is, this takes super long to calculate (with 1300 users and about 100000 actions it takes a few minutes). I'm also convinced there's a more efficient way to calculate this, or that I can at least add indexes somewhere.

Unfortunately my SQL knowledge is quite limited and after looking around for a while I'm still not sure where to start. I think I can add indexes to the timestamp columns? And I'm not sure my way of checking if timestamps fall in certain ranges is correct.

Best Answer

It's a bit hard to tell how your tables are linked together, but I think the following should be more efficient:

WITH weeks AS (
  SELECT make_interval(weeks => n) as week 
  from generate_series(0, 3) AS n
)
SELECT week, 
       count(x.user_email) as "Returning Users"
FROM weeks
  LEFT JOIN (
     SELECT act.user_email, age(act.timestamp, acc.timestamp) as diff
     FROM accounts acc
       JOIN action act on act.user_email = acc.user_email
  ) x ON x.diff >=  week and diff < week + '1 week'::interval
FROM weeks
GROUP BY week;