Hacker News new | past | comments | ask | show | jobs | submit login
How Bayesian Probability Models Can Make CLV Predictions 12x More Accurate (custora.com)
82 points by cpierson on Feb 27, 2012 | hide | past | web | favorite | 14 comments

I was a fan of Gelman-style Bayesian modeling for a while, but it's a lot of work when you don't have a True Model.

Have you tried machine learning techniques? You have a ton of predictors available for each customer --- gender, age, price of the first purchase, whether they bought before Christmas, etc. None of those fit neatly into standard models. Why not throw them all into one big linear regression for CLV, then move on to more sophisticated techniques?

Lead data scientist at Custora here.

The problem with linear regression (or any machine learning technique) in CLV prediction is extrapolation. Since we are making a predictive measure, we are projecting out what customers will spend in the future.

You can't construct a valid training dataset to project a 2 year CLV if you business is not two years old. You need to make some assumptions about how customer's ordering will continue. We have found that assuming that customer behavior follows the latent attrition model is a robust assumption for most of our clients.

Even if the business has been around for a long time, we have found that customer behavior tends to change over time. In that early adopters are often far more valuable than more recent customers, so using the earlier adopters as the training set leads to misleading results.

To use a machine learning framework, you need to make the assumption that customers who join recently strongly resemble customers with similar attributes who joined in the past. We have found that this is often not a valid assumption, and so we make the simplifying assumptions about customer behavior to add power to our models.

OK, that's a good point.

Have you considered hierarchical modeling, like in Bayesian Data Analysis? I would have lambda and mu drawn from per-company gamma distributions, and have the parameters of these gammas drawn from global distributions (gamma distributions themselves?)

Also, you're using maximum likelihood. Have you done the full MCMC computations? (I don't think that it would make much of a difference - but it's nice to have empirical validation of that)

I would enjoy reading more about the HMM.

We've considered hierarchical modeling, but concluded that there were no real gains. We have enough data from each of our clients to identify the parameters of the model. We will probably add more hierarchical modeling as we improve predictions of seasonality, primarily it will be useful for predicting the 'christmas effect' for new clients.

The posterior mode of the Pareto/NBD obtained through full MCMC is extremely close to the MLE, and the MLE is much faster to calculate so we use MLE. [1]

There has been some work done on using HMM to predict CLV. It turns out that in most cases the Pareto/NBD is a robust model for CLV. [2]

[1] http://dl.acm.org/citation.cfm?id=1305575

[2] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1904562

ced: We do some post-hoc analysis to find differences on whatever dimensions our clients give us. We have found that in general, the largest effect is the month of acquisition, and so this is the only factor that we include in the model right now.

Thank you for the info.

One last question: do you use gender, age, and other customer-specific predictors in your model? The distribution of lambdas for men and women could vary significantly.

The model you describe in your post seems to be assuming identifiability across customers regardless of when the relationship started. If you are not making inferences about new customers based on similar customers with prior attributes, how are you doing it? The ways I can think to would force great naivete on the inferences about new customers.

We use the gamma prior and assume that shape parameter remains constant across time, and the scale parameter varies month to month. For each new month there are only two parameters that need to be estimated, one for the attrition rate and one for the purchase rate, so we rarely run into identifiability issues.

The title of the article really hooked me. My background:

  1. I have a degree in mathematics
  2. I've been writing software since high school
  3. I've had to do more CLV calculations than I can count
  4. I've co-founded a retail startup (http://everlane.com)
So, take this critique with an open heart.

Here's a summary of the points below, for those short of attention. The article is caught between writing for a technical and non-technical audience. There are too many technical terms for a non-technical person to make real sense of it, and not enough detail for a technical person to use the article as a reference.

First, and overall, I'm not sure who the target audience is. The topic is technical enough that I'd guess it's for the head of analytics or a data scientist inside a retail company.

If that's the case, the article is really light on detail. There's nothing I can apply immediately to my work, except to Google for some of the phrases in the blog to learn more. There aren't even links to external papers with more detail.

If it's more for marketing people, well, it's too technical. Bayesean? Gamma distribution?

Second, assuming I'm in your target audience, give me the math! Spell it out plainly, even if you don't tell me "why" it works. For me "plainly" means "mathematically" coupled with a plain-English explanation. You can link to external papers for that. But I'm not afraid of a summation or an argmax. In fact, it's much easier for me to understand than someone writing it out in plain English. You throw out "gamma distribution" but don't even link to the definition, explain what it is, or explain how λ and μ fit into it.

"The gamma distribution is a perfect candidate, since it characterizes most customer bases very well."

Why does it characterize most customer bases very well? I really really want to know. The fact that it's a gamma distribution is almost incidental to the deeper point about what distributions characters which features of customer behavior, and why.

Because of my background I know some of those things and can easily figure out the rest, but you're not making it easy for me.

It's like you go into detail on the soft stuff (where I care less about detail), and about detail on the hard stuff (which is exactly where I want detail).

Third, numbers! The margin of error chart is mostly irrelevant. In any case, it's not detailed enough for me to use to compare these different methods, and at best serves as a way for a non-technical marketer to say, "No, the Bayesean way is better. Look at this chart." Give me a concrete example of applying this technique first, and then give me a concrete example of comparing it to the other techniques.

Fourth, the typeface on the blog is just awful. If you're producing scientific content I'd recommend using a serif typeface. Georgia is ok, but something like Garamond or Century Schoolbook are more similar to Computer Modern (the default LaTeX typeface).

We often struggle writing for both audiences, and your feedback is well taken.

Here's a brief rundown of the math, more details can be found in the papers linked below [1,2].

We assume a latent attrition model, that is customers purchase with exponentially distributed interpurchase times, and have a constant probability of dying. We then assume that the rate parameters of these two distributions are gamma distributed.

The gamma distribution is the first choice of distribution because it is the conjugate prior for the exponential distribution. For the Pareto/NBD it means that we can write the likelihood function without having to use quadrature to solve the integral. It is possible than another distribution would work even better, though it would likely be more computationally intensive.

Another nice characteristic over, say, the log-normal is that when the shape parameter is less than 1, lim_{x -> 0} = \Infty. This is a nice feature for many customer bases who have many infrequent customers, or many one-time customers.

For the percent error numbers, we picked a representative sample of our clients who had over two years of data, and ran the three models with a holdout set of the most recent year. We then compared the performance of the Pareto/NBD compared to ARPU and compared with picking the year old cohort. I uploaded a boxplot of the data, which you might find more informative [3].

Happy to chat more about the math here or by email (aaron@custora.com). Also would love to hear more about your retail startup and your CLV issues around that.

[1] http://www.jstor.org/pss/2631608

[2] http://marketing.wharton.upenn.edu/documents/research/Fader_...

[3] http://blog.custora.com/custora-content/uploads/2012/02/esti... (Note, the boxplot was generated a few months ago from different data, and we've updated the numbers for the blog post)

These details are great! But also too technical for me to follow immediately since I'm not a statistician. It's enough information to point me in the right direction, though.

It's hard for me to express my writerly intuition, here. The key is to (1) define your target audience as concretely as possible and (2) understand what shared vocabulary you have at your disposal.

Saying "Bayesean" or "gamma distribution" is going to put you out of reach of anyone non-technical. Saying "conjugate prior" or "shape parameter" is going to put you out of reach of anyone who isn't a practiced statistician.

I'd aim somewhere in the middle. Technical people who aren't afraid of following a well-outlined, mathematical description of a problem they encounter regularly. "Why a gamma distribution?" would be a good footnote, for example, linking to a paper or another blog post of yours that explains it in more detail.

I find the articles that do the best are ones which take a somewhat-complicated topic and explain it, step-by-step, to an intelligent, technical audience. Pretend you were giving a lecture to a room full of HN members -- all technical and versed in basic mathematics, but not practiced statisticians. Write something that is not just a one-off, but could serve as reference material months and years from now.

I'm probably at the upper-end of this target audience in terms of mathematical maturity, in the sense that when you say "conjugate prior" I know what you mean but can't remember the definition off the top of my head. However, I could look it up and understand it instantly. I'd have a harder time understanding why being the conjugate prior of the exponential distribution implies that the gamma distribution is well-suited for modeling customer behavior on an e-commerce site.

(I'd really like to know, though.)

To this day I have people link to articles I wrote 3-4 years ago when talking about certain topics (A/B testing, viral marketing, etc.).

Example: http://blog.socialcam.com/mobile-ab-testing-made-easy

This isn't to say the article SocialCam linked to is something fantastic. The content is really basic stuff anyone who has taken one or two statistics classes knows. It's success is more in how it is written and explained than the content. In other words: digestibility and clarity are features.

Anyhow, I'm done yammering. Nice article! I've printed out a handful of academic papers related to the topic.


I believe the model they're using is called Pareto/NBD and is fleshed out a bit more here:


Indeed. Here is the original paper if you are interested in more details.


Thanks! Interesting articles by the way.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact