Spot your churning users sooner with statistics

The girl (or guy) you're dating has called you on average five times a week lately. As a neurotic analytics enthusiast/stalker you happen to know this. Now it's been three days without a call. Should you be worried?



Remember Gamma-distributions? 




If you would model the delay between calls (by a number of different girls you're simultaneously courting, you stud) and you have a lot of girls calling you, then the distribution of those delays is likely exponential. However, if you are modeling an individual girl's call delays, the events no longer are independent, and therefore, a Gamma distribution is more likely. (read more about the different distributions)

In order to use the Gamma distribution you need two parameters: the shape α = k and either scale θ, rate β = 1/θ or mean μ = k/β. You need to do some work to figure out which parameterization fits your data best. Let's assume our shape k = 2, and our mean μ = 1.75, which means our rate β = 2 / 1.75 = 1.14 and scale θ = 1.75/2 = 0.875.

We check how likely this scenario is. There's a handy online tool for that. Or if you have R, just run this:
> pgamma(3, 2, scale=0.875, lower.tail = FALSE)
[1] 0.1436329

... which is to say that with these parameters over 14% of the delays are longer. Sounds reasonable. But what does this have to do with your business? And is it anybody else's business than yours whether she calls you or not?

How your users are dating you


You can apply the above logic to e.g. how often your user signs in or does something of value from your perspective. And you can easily apply this on a user-specific level, getting a quick feeling of whether their feelings for you are mellowing. If a user tends to sign into your service every three days (μ = 3 and if again we assume k = 2, then θ = 1.5 ) and you haven't seen them now for a week (waiting time = 7), you can say statistically that this is getting rare. Only 5 out of 100 delays in your user's distribution will be this long. But what does that mean?

That most likely they're on vacation and they'll still come back to you.

Until they don't. The longer you wait, the more certain you can be that they're not coming back.

How quickly can you spot a churn?


Let's assume that you define churn as a user who doesn't come back in a while, e.g. a week. The ratio of the current delay to an average delay determines the outcome of the gamma distribution probability function. If you select a few dates randomly from the past and check what that ratio is for all of your users and use it in a logistic regression model to predict if they will churn or not, you will find out that there is plenty of prediction power there. And why shouldn't there be? After all, it's just a more accurate measure of the relative delay a particular user has been away.

So what then?


It helps to understand how your users are different. You might still choose to have the same definition of being active for all your users, but at least you will understand how that has an effect on your numbers. Using statistics like this will most likely give some early warning on churn. It is also quite powerful to point a few users who have been active, but quit cold-turkey. This raises a lot of interesting questions: What happened? What could we have done to prevent it? Should we ask them? Were there signs already before we could have noticed?

Indeed, were there?

Comments

Popular posts from this blog

How to access AWS S3 with pyspark locally using AWS profiles tutorial