
Architecting a Machine Learning System for Risk - lennysan
http://nerds.airbnb.com/architecting-machine-learning-system-risk/
======
sytelus
I'm actually bit surprised at people using PMML and this architecture.
Clearly, here the attempt is to isolate model generation and runtime
prediction but doing this also confines you to least common denominator. This
means you can't generate any model that Opescoring can't handle. If you think
about it, there is no real need for Opescoring. You can wip up REST service
very easily that wraps sk-learn predictor and I would bet it's actually much
easier to do than writing PMML exporters. Then you can use all the goodness of
top of the line models white your service interface still remains same. The
architecture that enforces you to use lowest common denominator just for
abstraction purposes is a poor design, IMO.

~~~
mwexler
PMML allows you to use models generated by a variety of tools and systems,
including external vendors. While you are right that making it core does
create some constraints, it also allows them to easily replace portions of the
model building pipeline with relative ease.

Yes, PMML isn't perfect (being kind), but it continues to be extended and is
the one shared lingua franca we have across model creation systems, short of
(sigh) SAS code and "recode the model in generic C", both of which I see too
often.

I suspect in the future we'll see "standard" architecture with pipelines with
multiple parallel feeds and runtime engines into ensembles, each of which
allows various model types in "native" format (sklearn and other pythonics, R,
java, etc.) which would be interesting, instead of having to cram all into
PMML. Just a thought.

~~~
zephyrnh
Agreed, this describes our reasoning pretty well. Additionally, as we
mentioned towards the end of the post, the incremental benefit of a model not
supported by PMML is unlikely to be more significant than the incremental
improvements we see from investing in improving our features and ground truth.
As such, our lowest common denominator isn't actually the model, but rather
the features and data pipelines. The suggestion sytelus makes is a perfectly
good way of doing it though, and we will likely change our approach when we
find that our models do become the lowest common denominator, or have a higher
relative ROI for the time we invest in improving them.

------
jacquesm
I've just built a system very much like this for a large customer. Extremely
interesting and I learned a lot while doing it. Funny to see companies
operating at a similar scale running into similar problems and solving them in
roughly similar ways.

------
shoyer
Looks like a cool project... but I hope the plan to open source their library
to export Scikit-Learn classifiers to PMML! This would be a great way for them
to give back to the open source community.

~~~
zephyrnh
We'd love to do this. As you can imagine, however, it's a very time consuming
task, and there are a lot of competing priorities (including other projects
we've open sourced) and we therefore can't make any guarantees about if/when
we'll be able to do it.

------
xtacy
Nice writeup. It seems like a supervised learning approach to fraud detection.
I have a question: Where does the is_fraud variable come? Is it done by
humans?

~~~
msherry
(I don't work for Airbnb, but work in a similar space)

Yes, this variable is usually set after the fact. For instance, a given
transaction may have led to a chargeback, or may be done by a known fraudster.
These models are usually trained on historical data, so we can know with some
certainty which transactions are fraud.

It could be that a transaction was fraudulent but has not yet led to a
chargeback (maybe the real cardholder hasn't yet seen their statement?), so
there's still some uncertainty, but hopefully that approaches a minimum after
some time passes.

~~~
segmondy
what software are you using?

~~~
msherry
I don't want to hijack this thread too much from the original post, but we use
some of the same software as Airbnb (scikit-learn, randomforest models, etc.)
as well as some stuff developed in-house. Credit card fraud has been one of
our biggest issues, and we've developed some pretty robust systems to fight
it. Contact me privately and I'd be happy to talk about it -- this stuff is
what I do for a living.

------
Hortinstein
Don't mean to hijack the comment thread, but can anyone recommend any good
videos that introduce machine learning or courses? I studied computer science,
but was not able to take any classes on the subject. I found the Stanford one,
anyone have experience with it?

[http://online.stanford.edu/course/machine-
learning](http://online.stanford.edu/course/machine-learning)

~~~
Derander
I've heard nothing but good reviews of the online version of the class. I took
the class at Stanford (and actually worked on the system mentioned in the
article) and I found its content to be useful. I believe that the online
version contains less theory but this is not necessarily a bad thing if all
you want is an introduction.

I've also found reading papers to be illuminating: often the first article
about a given classifier is fairly well written and accessible if you have a
strong background in math.

This is also a useful thing to keep in mind: [http://scikit-
learn.org/stable/tutorial/machine_learning_map...](http://scikit-
learn.org/stable/tutorial/machine_learning_map/)

~~~
hoprocker
This looks clipped in my browser, but you can see the original diagram here:
[http://peekaboo-vision.blogspot.com/2013/01/machine-
learning...](http://peekaboo-vision.blogspot.com/2013/01/machine-learning-
cheat-sheet-for-scikit.html)

------
ShabbyDoo
I was thinking about the sorts of fraud categories AirBnB likely experiences.
Most fraudsters want cash or cash equivalents, and the use of lodging on a
particular night is nearly as illiquid as stolen fine art. So, those seeking
stuff to resell will choose to defraud one of the zillion online marketers who
ship stuff to doorsteps. A buyer who actually used the space he reserved could
initiate a chargeback later claiming that the service promised via AirBnB
wasn't provided -- couldn't access apartment, wasn't as described, etc.
However, space providers likely will cooperate with AirBnB and provide
evidence in their defense. Better to attempt a chargeback elsewhere if one is
short on money. It seems that using AirBnB as a platform for crimes between
buyer and space provider is possible, and there certainly has been at least
one heavily publicized case, but we would hear a lot more about these events
if they were happening much.

So, what's left? Collusion between buyer and space provider -- in all
likelihood, they are one in the same, or identities have been stolen. For
example, I list my condo on AirBnB for $100/night. Someone books it for the
weekend, and then doesn't show up. AirBnB owes me $200 -- after all, I gave up
other options to profit from its use. An honest buyer pays up. But, maybe the
buyer is dishonest -- he used a stolen credit card, etc. In this case, AirBnB
eats the loss and pays me as the space provider. Now, wouldn't it be
convenient if I was also the buyer? Cash from stolen credit cards, funneled
through AirBnB (much akin to the way online poker sites were used to transfer
stolen money via bad heads-up play). This would work until AirBnB noticed that
my listing seems to have a suspicious propensity to attract fraudulent buyers.
Then, they'll shut me down. So, I'll pop-up elsewhere. After all, no need to
actually have a space because no one I accept will ever show up!

I bet the usage patterns of the party/parties involved in this fraud are
drastically different than those of legitimate market participants. Someone
with a fraudulent listing could out himself by rejecting a bunch of legitimate
AirBnB buyers, and this behavior would stand-out as it's the opposite of the
behavior expected of an honest seller. So, he must protect against this risk
by making his listing unappealing (high price, bad photos/description,
unpopular location, etc.). The behavior of users browsing AirBnB when viewing
this property could identify its relative undesirability (few clicks, etc.),
and price outliers could be identified by comparing similar offerings by
date/location/type. The click stream of the "buyer" likely is most revealing.
Someone selecting an unappealing property without doing much comparison
shopping likely isn't a legit buyer.

What other stuff might predict fraud? Vague descriptions might indicate a
fraudulent listing. Most space providers love to tell buyers what's special
about their offering. Could some scoring of a listing's prose prove a strong
predictor? I've never listed with AirBnB. What do they do to verify listings?
As a buyer, they verified my identity. Could this serve multiple purposes?
Certainly, I'd feel better listing my guest room if I know that AirBnB will
know the identity of the guy who rented the room and then stabbed me at 3AM.
But, in addition, does identifying market participants in strong ways help
keep fraudsters from repeating their crimes by setting up multiple accounts?
Obviously, newer market participants are more risky than established ones,
especially those who have interacted with known legit, long-time users. The
social graph comes to the rescue here. Even astroturfing ought to show up as a
small, disconnected graph unless legit users' identities are stolen.

Of course, this comment is all just conjecture. Obviously, AirBnB can't tell
the public about specific fraud methods or how they identify suspicious
activity. However, I like the concreteness of considering actual fraud
scenarios, so I decided to put forth some ideas for discussion.

------
bayesianhorse
I didn't quite understand the need for openscoring and pmml. If it's just a
question of using a sklearn model to predict an outcome, why not just build it
into a simple json-rpc with Tornado, Gevent or whatever the rage is,
currently?

~~~
gallamine
As I'm working on a very similar problem right now, the difficulty is that to
save the fitted sklearn model you have to pickle it (pickled decent size
random forest are several megabytes). Then, at classification time, you have
to import pickle, sklearn (and numpy), depickle the object, run the example
through the classifier and extract the output. Perhaps the Openscoring model
is more efficient?

~~~
ogrisel
You can use `all_model_filenames = joblib.dump(model, filename)` after fit on
your dev enviroment. joblib will store each numpy array in the model
datastructure as an independent file and `all_model_filenames[0] == filename`
refers to the file holding the main pickle structure.

Then on your prediction servers, ensure that you have a copy of
`all_model_filenames` in the same folder. You can then load the model with
`model = joblib.load(filenames[0], mmap_mode='r')`. This will make it possible
to use shared memory (memory mapping) for the model parameters of a large
random forest so that all the Gunicorn, Celery or Storm worker processes
running on the same server will use the same memory pages, making it a very
efficient way to deploy large models on RAM constrained servers.

You can even use docker to ship the model as part of a container and treat the
model as binary software configuration.

------
czbond
I was wondering if their solution is a home grown version of SiftScience?

~~~
zminjie
Actually SiftScience lists AirBnb as one of their customers on their site, so
I'd say it is very likely built on top of SiftScience.

~~~
czbond
Great catch. I had thought about doing the same.

------
elliott34
does anyone know if Java can port a gradient boosting model from R

------
Mangalor
OMG I love that cartoon. "Machine learning" is such a funny phrase if you
think about it.

------
contingencies
Pet peeve: the verb is _designing_.

Besides that, let me rephrase here.

 _At <west coast startup, essentially a copy of earlier successful European
businesses such as HouseTrip, but with access to stupid amounts of US capital
and therefore more profitable>, we <make superfluous, keyword-laden,
unverifiable claim about ourselves in the future>. We <continue to integrate
feel-good community pronouns>. We <here discuss something only tangential to
our core business and assert that we have allocated at least two people to
this area>. We <have nothing better to do than write it up, because quite
frankly, there's nothing more pressing for us to work on in an already
automated business of relative simplicity>_.

OK, so that's a bit harsh, but there's some points toward reality in there.
Sorry, as someone who used to run a complex travel industry business (3200+
hotel contracts... all of them in Chinese, all business by digital fax (no
convenience here!), constant rate changes, in 6 human languages and multiple
currencies with a real time call center) and who co-pitched for VC with
HouseTrip's management in London in 2009, I just have very little respect for
AirBNB.

~~~
voronoff
Care to do a write up of what you're doing as far as fraud prevention goes?
However you feel about Airbnb, this was an interesting post. It's not ground
breaking or earth shattering, but it shows the tech stack that a large company
is using to solve a real problem, and that's useful. They even did a fairly
good job of explaining the why.

~~~
contingencies
_Care to do a write up of what you 're doing as far as fraud prevention goes?_

Sure, but only high level. I would hazard a guess that fraud prevention is a
lot more complex for us at [https://www.kraken.com/](https://www.kraken.com/)
... dealing with many cryptographic currencies and conventional currencies
spread across probably over a hundred legal jurisdictions is not easy. We
likely have to consider far more factors than these guys. We have recently
added two more quants from programs highly regarded in the conventional
finance industry to our team, plus we have over seven figures of investment in
legal and training programs in the area. We also use _R_.

Basically, it's inputs (behavior), processing (metric extraction, risk model),
output (boolean choices, statistical cluster membership, etc.)... where a
series of such outputs may feed in to a heirarchy of scores for different
elements within a system. Some applications may be real time, others after-
the-fact.

At a high level, which is mostly where my involvement is in hiring people,
fraud prevention is not dissimilar to spam or intrusion detection: you can
basically use a combined, constantly tweaked set of inputs to a Bayesian-style
scoring algorithm. Inputs include both static rules and statistical anomaly
detection.

[http://en.wikipedia.org/wiki/Bayesian_probability](http://en.wikipedia.org/wiki/Bayesian_probability)

