Calculating active users and churn in a database


Have you ever struggled with calculating your active users per day, or month? How about your new activations or churned users? It's a common problem, but often the first solution you find will not scale.

You should avoid sub queries, and use a window function instead. But that's only a part of the solution.

Assuming your active user definition is "someone who has signed in within 7 days" and you have this data in a table with columns user_id, timestamp. The first thing you want to do is calculate when a user activates for the first time, when they churn and when they resurrect.

CREATE OR REPLACE VIEW "vw_login" AS 
 SELECT *
    , LEAST (LEAD("timestamp") OVER w, "timestamp" + 7) AS "activeExpiry"
    , CASE WHEN LAG("timestamp") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("timestamp") OVER w IS NULL THEN true
 WHEN LEAD("timestamp") OVER w - "timestamp" > 7 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("timestamp") OVER w IS NULL THEN false
 WHEN "timestamp" - LAG("timestamp") OVER w <= 7 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "timestamp")
With this, you've added three boolean columns to your login data: activated, churned and resurrected, which tell you when a user activates for the first time, after which login they churn and which login makes them resurrect. Now your data is in such format that you can calculate active users over any date.

To calculate your daily active users you can apply the following query:
SELECT
SUM(SUM(activated::int - churned::int + resurrected::int)) OVER w
, "timestamp"::date FROM vw_login
 GROUP BY "timestamp"::date
 WINDOW w AS (ORDER BY "timestamp"::date)
 ORDER BY "timestamp"::date;
Hold on. No you can't.

For churn, you also need to take into account the time-shift: your row indicates that the user churns after that login. The day your user becomes inactive is after 7 days of the login, not on the day of the login. To accomplish this efficiently it's best to first create a view that aggregates the data to a daily level and then do a join.

-- Aggregate daily
CREATE OR REPLACE VIEW "vw_activity" AS
SELECT 
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "timestamp"::date "date"
  FROM "vw_login"
  GROUP BY "timestamp"::date "date"
  ;
For the running total you should generate a time series first to ensure that all the dates you are joining the churn to exist. Also, take note on wrapping your values with a COALESCE function as if your SUM includes NULL values, it will return null.
-- Calculate a running total
SELECT
 d."date"
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2004-03-07'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '7 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";
That's it. That will give you the number of active users per day and the number of users who activated, churned or were resurrected on that day. It's now up to you to figure out what your definition for an active user is and if there are dimensions you need to add. But the basic logic is clear enough.

Let me know if you found this useful!

Comments

Popular posts from this blog

How to access AWS S3 with pyspark locally using AWS profiles tutorial