

Allstate compares SAS, Hadoop and R for Big-Data Insurance Models - ArchieIndian
http://www.r-bloggers.com/allstate-compares-sas-hadoop-and-r-for-big-data-insurance-models/

======
dbecker
1) This is just ad copy

2) It is dishonest ad copy

This model could have been estimated with the biglm library. Revolution's
claim that they are the only game in town for big data computing with R is
bullshit.

------
absherwin
I did a similar analysis several years ago. The two fastest solutions that
didn't involve writing lower-level code aren't mentioned: A specialized
program targeted at the insurance industry called Emblem and Stata. If I
recall, they ran as fast on a single system as RevoScale does on a cluster. Of
course, comparing different benchmarks is fraught with peril.

The SAS performance reported is surprisingly bad and likely is a result of the
data processing steps involved. A SAS procedure is much more similar to a
program than a function. While SAS wasn't the fastest solution, most of its
slowness was due to building the design matrix rather than running the actual
regression. There are a number of options for both tuning how this is done and
caching this preprocessing. The biggest different would be using SSD's on the
machine which seems unlikely given the results.

~~~
dbecker
Stata won't work on data that don't fit in memory. It is fast, but it isn't an
option for very large datasets.

~~~
absherwin
True but 150MM records with 70 degrees of freedom would likely be doable with
16GB of RAM and almost certainly with 32 as most of the variables are likely
boolean and so can be stored in a single byte which Stata supports unlike R
and SAS.

------
kyt
He forgot 5) Write it in C/C++. 150M records is not that large and using
Hadoop, which is generally used for I/O bound problems, seems like overkill. A
lot of these problems can be avoided by simply dropping down to a lower level
language. For example, I was able to write a C implementation of a matrix
factorization algorithm (100M records) that ran on my laptop in ~5 minutes.
The same algorithm took over 24 hours to run on a Mahout/Hadoop cluster (it
also cost about $30 to run on AWS EMR).

~~~
absherwin
C/C++ shouldn't generally be needed. The computationally intensive part is
optimized. What makes a given system slow is either poorly optimized numerical
code (less common) or doing a bunch of needless repetitive work because the
system doesn't make it easy to separate different components of the modeling.

They also won't even be seriously considered at most large companies because
the typical person in that role doesn't have the skills and a single-person
using a different solution makes it difficult to transition work.

------
pinhead
I wonder how these would have compared to Spark <http://spark-project.org/>
Hadoop isn't really an ideal framework for this type of computation.

------
nopal
This seems like spam, but I'll leave it to someone with more domain experience
to flag, if so.

~~~
jmount
No, r-bloggers is a well regarded voluntary aggregator.

~~~
jmount
Go ahead- down vote me more. But r-bloggers asks permission before they
aggregate. That is what I consider good non-spammy behavior. And the article
may not be to your taste, but that is what business presentations at a
conference like Strata are: we tried a few things that were chosen not to work
and a tool we like. Yes there are other tools that would also work (that
should be included, and would be insisted on in true academic work). But
saying "I feel it is spam" is weak. Either say "it is spam, and here are a few
of my reasons" (and flag) or wait on commenting.

------
EvanMiller
5) Take a random sample of 50,000 records and get 90% of the business
intelligence value in about 2 seconds.

If you're fitting a model with only 70 degrees of freedom then analyzing 150M
records is a complete waste of time.

~~~
actuary
This is absolutely not true for insurance data, where the task is to predict
expected losses per policy and (in any given year) perhaps only 1% of policies
will have any losses at all. Even if your statement were true, this sort of
analysis has nothing to do with business intelligence. The goal is to minimize
adverse selection in a competitive marketplace. There is no such thing as
"good enough". (If there were, I would be out of a job.)

~~~
absherwin
Try stratified sampling. Removing records without claims only increases the
variance of the denominator which is much less variable. You actually can
eliminate the majority of the data and find results that are the same to
several decimal places. Note this only works with very large datasets without
extremely high cardinality variables.

That said, 50000 is too few. For a dataset of this size, 20 million records is
likely more reasonable. The actual answer depends on the variance of the
individual predictors and their correlation with each other.

------
rorrr
Smells like a bullshit paid article.

