Using survival plot to analyze churn in Power BI



I did not guess I'd be working with Kaplan-Meier survival plots so soon.

Analyzing churn and trying to figure out what kind of users churn more likely is not so easy. In order to calculate churn, you need to have a good volume of users that you can follow month over month, having a number of active users per month and what share of them are left behind every month. But what if you want to select a different set of users, another segment? Producing an analytics cube with the necessary dimensions takes time. And if you end up with a segment that doesn't have high volumes every month, interpreting the results can be quite tricky.


So needless to say, using churn to analyze segments or micro-segments in your user base is not so very easy. What if there was a tool you could use to quickly analyze churn in any arbitrarily selected group of accounts? For this, retention is a great proxy for churn. And even better, survival plot.

A survival plot will show you what share of signups (y) will survive a certain number of days (x). Compared to the most obvious calculation (out of users who have signed up more than x days ago, what share was still active at x days) the Kaplan-Meier method uses all data and can therefore be more accurate.

With Kaplan-Meier method each individual day is calculated a daily survival rate, and a "to-date" number is the product of the earlier consecutive days. This way, even if only a fraction of users are 100 days old, they still get to contribute to the day 0, 1, 2, 3, etc. survival rates.

This said, I'm sure you can already imagine how the Kaplan-Meier survival rate is calculated, but what if you have an urgent need right now to make this happen in Power BI? It's very simple. You need to have the following columns in your customer dataset:
  1. Have they churned
  2. How many days they (have) survived
You could do a pure DAX solution like described in this post, but the problem I at least was not able to solve is: What if you want to use filters? See, calculated tables are evaluated at data load and they are not affected by filters. If you find a way of doing this with a table variable, let me know. What I eventually ended up doing is using an R-script, this:


library(plyr)
library(reshape2)
library(ggplot2)

km_data <- function(this_data) {

  date_range = data.frame(seq(0, max(this_data$days), by=1))
  colnames(date_range) <- c('days')
  
  summary <- ddply(this_data, .(days), summarize,
                   count = length(id),
                   dns = sum(dns_int)
  )
  patients = sum(summary$count)
  
  merged = merge (x= date_range, y=summary, by="days", all.x=TRUE)
  merged[is.na(merged)] <- 0
  merged$started <- patients - cumsum(merged$count) + merged$count
  merged$survival_rate <- 1 - merged$dns / merged$started
  merged$km <- cumprod(merged$survival_rate)
  merged
}

my_dataset <- data.frame(dataset$days, dataset$dns_int, dataset$id)
my_dataset = my_dataset[complete.cases(my_dataset),]
names(my_dataset) <- c("days", "dns_int", "id")
km1 <- km_data(my_dataset)

km1 = km1[,c("days", "km")]

ggplot(data=km1,
       aes(x=days, y=km)) + geom_line() + scale_y_continuous(name="survival", limits=c(0,1))

That's the simple solution. Things you might still want to consider implementing additionally are:

  • Comparing to a base line
  • Confidence intervals
Let me know if you have an elegant solution to these.

Comments

Popular posts from this blog

How to access AWS S3 with pyspark locally using AWS profiles tutorial