Allstate compares SAS, Hadoop and R for Big-Data Insurance Models

dbecker · on Jan 25, 2013

1) This is just ad copy

2) It is dishonest ad copy

This model could have been estimated with the biglm library. Revolution's claim that they are the only game in town for big data computing with R is bullshit.

absherwin · on Jan 25, 2013

I did a similar analysis several years ago. The two fastest solutions that didn't involve writing lower-level code aren't mentioned: A specialized program targeted at the insurance industry called Emblem and Stata. If I recall, they ran as fast on a single system as RevoScale does on a cluster. Of course, comparing different benchmarks is fraught with peril.

The SAS performance reported is surprisingly bad and likely is a result of the data processing steps involved. A SAS procedure is much more similar to a program than a function. While SAS wasn't the fastest solution, most of its slowness was due to building the design matrix rather than running the actual regression. There are a number of options for both tuning how this is done and caching this preprocessing. The biggest different would be using SSD's on the machine which seems unlikely given the results.

dbecker · on Jan 26, 2013

Stata won't work on data that don't fit in memory. It is fast, but it isn't an option for very large datasets.

absherwin · on Jan 27, 2013

True but 150MM records with 70 degrees of freedom would likely be doable with 16GB of RAM and almost certainly with 32 as most of the variables are likely boolean and so can be stored in a single byte which Stata supports unlike R and SAS.

kyt · on Jan 25, 2013

He forgot 5) Write it in C/C++. 150M records is not that large and using Hadoop, which is generally used for I/O bound problems, seems like overkill. A lot of these problems can be avoided by simply dropping down to a lower level language. For example, I was able to write a C implementation of a matrix factorization algorithm (100M records) that ran on my laptop in ~5 minutes. The same algorithm took over 24 hours to run on a Mahout/Hadoop cluster (it also cost about $30 to run on AWS EMR).

absherwin · on Jan 25, 2013

C/C++ shouldn't generally be needed. The computationally intensive part is optimized. What makes a given system slow is either poorly optimized numerical code (less common) or doing a bunch of needless repetitive work because the system doesn't make it easy to separate different components of the modeling.

They also won't even be seriously considered at most large companies because the typical person in that role doesn't have the skills and a single-person using a different solution makes it difficult to transition work.

pinhead · on Jan 25, 2013

I wonder how these would have compared to Spark http://spark-project.org/ Hadoop isn't really an ideal framework for this type of computation.

nopal · on Jan 25, 2013

This seems like spam, but I'll leave it to someone with more domain experience to flag, if so.

actuary · on Jan 25, 2013

This is my area of expertise and I concur that the article is a thinly-veiled advertisement, but my account is relatively new and I don't seem to have the ability to flag it.

jmount · on Jan 25, 2013

No, r-bloggers is a well regarded voluntary aggregator.

jmount · on Jan 25, 2013

Go ahead- down vote me more. But r-bloggers asks permission before they aggregate. That is what I consider good non-spammy behavior. And the article may not be to your taste, but that is what business presentations at a conference like Strata are: we tried a few things that were chosen not to work and a tool we like. Yes there are other tools that would also work (that should be included, and would be insisted on in true academic work). But saying "I feel it is spam" is weak. Either say "it is spam, and here are a few of my reasons" (and flag) or wait on commenting.

EvanMiller · on Jan 25, 2013

5) Take a random sample of 50,000 records and get 90% of the business intelligence value in about 2 seconds.

If you're fitting a model with only 70 degrees of freedom then analyzing 150M records is a complete waste of time.

actuary · on Jan 25, 2013

This is absolutely not true for insurance data, where the task is to predict expected losses per policy and (in any given year) perhaps only 1% of policies will have any losses at all. Even if your statement were true, this sort of analysis has nothing to do with business intelligence. The goal is to minimize adverse selection in a competitive marketplace. There is no such thing as "good enough". (If there were, I would be out of a job.)

absherwin · on Jan 25, 2013

Try stratified sampling. Removing records without claims only increases the variance of the denominator which is much less variable. You actually can eliminate the majority of the data and find results that are the same to several decimal places. Note this only works with very large datasets without extremely high cardinality variables.

That said, 50000 is too few. For a dataset of this size, 20 million records is likely more reasonable. The actual answer depends on the variance of the individual predictors and their correlation with each other.

rorrr · on Jan 25, 2013

Smells like a bullshit paid article.