
Using R to detect fraud at 1M transactions per second [video] - sndean
http://blog.revolutionanalytics.com/2016/09/fraud-detection.html
======
gearhart
That's half of the population of earth buying something on a credit card.

I'm assuming this wasn't real-time, real-world data (although I didn't watch
the whole 1.5hr video to confirm), but the implication is that this system
could process the peak load of global credit card transactions as they
happened. That's pretty impressive.

------
dhd415
The presentation is a little light on the technical details of how the demo
was run. What I could get from the presentation was 1M fraud predictions/sec
via R stored procedures on data streaming into SQL Server 2016 stored in in-
memory column store tables on a 4-socket "commodity" server.

------
baldfat
> PROS has been using R for a while in development, but found running R within
> SQL Server 2016 to be 100 times (not 100%, 100x!) faster for price
> optimization. "This really woke us up that we can use R in a production
> setting ... it's truly amazing," he says.

WOW if this is even half true we have a new area of R.

~~~
nerdponx
What does it mean to run R "within" SQL Server here?

~~~
larrydag
SQL Server R Services [https://msdn.microsoft.com/en-
us/library/mt604845.aspx](https://msdn.microsoft.com/en-
us/library/mt604845.aspx)

New for SQL Server 2016

------
contingencies
"Using <technology of the day> to <contribute to some exciting high level
business sounding goal> with <impressive statistic>". Video, reportedly
without technical specifics.

They say if you can't communicate something succinctly, then you don't truly
understand it. They are right.

------
vasaulys
Does anybody use R in production services or just for exploratory work?

It seems that once you figure out a good model in R, its almost always
rewritten into either Scala or Java for real production work.

~~~
apohn
I used to work in the consulting arm of a software firm and we wrote and
deployed R code in production at many Fortune 500 companies. We worked in
almost every industry.

I spent quite a bit of time refactoring bad R code so it could run reliably in
a production environment. There is a ton of bad R code out there that barely
works for exploratory analysis, let alone a production environment.

So yes, R is used in production environment in a lot of places.

~~~
vijucat
Did you guys separate out the R process (or multiple processes?) from the rest
of the transaction-processing / other server infrastructure or embed the
REngine (which sounds like a bad idea to me; incorrect data serialization can
easily crash the whole process)?

What is a stable way to connect (and reconnect!) to R, assuming it was a
separate process? I would think that an indirect communication path, such as
Server <\--> Database <\--> R would work best, but I'd love to hear your
battle hardened take on it.

~~~
apohn
We used separate workflows depending on if the data was streaming or batch
oriented (e.g. on-demand or triggered by a user). First I'll talk about batch
oriented jobs.

The company I worked for had a tomcat based product that exposed R via a
RESTful API. It was similar to what you get from AzureML now, except it was
on-premise. So basically we would call out to this and configure it to restart
R sessions if they crashed or timed out.

In an ideal situation we would isolate this server from the rest of the
processing as much as possible. To be honest our server was pretty basic - it
basically served to queue jobs (if needed) and manage RSessions if the server
was configured to run multiple sessions. For serious failover we had a second
server.

We did try to do as much as possible outside of R such as data pipelining an
ETL. That was done for the obvious reasons, but also because many customers
had SQL and Data people, but not R people. So if one of their Data people
understood the data ETL, they could fix it without calling us.

For many customers they'd never let R connect to a Database directly. So
They'd have a separate process pull data and write it to disk. Then an R
script would be triggered and would pick this data up.

I never saw major crashing issues with R in production with batch oriented
jobs unless there was something unexpected with the size or type of data.
Typically as long as there was time between jobs, R's garbage collector would
sort things out and be ready for the next job. Also by the time something made
it into production we'd hardened the script, frozen the CRAN package versions,
etc. So some small issue wouldn't cause a major issue.

Streaming data presented it's own adventure. To get data into/out of R as
quickly as possible, you need to embed the REngine and talk to it via rJava.
If we streamed data through R very quickly it would do fine for a while - then
you'd see the memory usage go up and the time for each transaction started to
vary greatly. Then it would crash.

The solution to this was multiple Rsessions and a lot of telemetry. We would
track how long each transaction took through R. As soon as we started seeing a
lot of variance in the time we'd restart the engine. By running the multiple
Rsessions in round-robin we'd delay the onset of this instability, and it
didn't matter when R sessions needed to be restarted.

Another trick we used was to cache data in an in-memory database so if
something crashed the whole service would restart and pull from the in-memory
database instead of trying to fetch old data from the server.

~~~
vijucat
Thanks, this is all quite useful! I faced crashes with REngine + rJava, too,
and thought of a DB as a intermediary, but your in-memory DB idea adds an
interesting twist that adds performance, too.

