
Successful machine learning models: lessons learned at Booking.com - joker3
https://blog.acolyer.org/2019/10/07/150-successful-machine-learning-models/
======
Odenwaelder
"Content Overload: Accommodations have very rich content, e.g.descriptions,
pictures, reviews and ratings."

Laughed at that one. Booking.com is so full of dark patterns that I dread
using it.

~~~
Bootwizard
What do you mean by "dark patterns"? I'm not familiar with that term

~~~
tgsovlerkhgsel
Deceiving, tricking and pressuring users into taking actions.

For example, LinkedIn having a flow that has an e-mail and password box, which
will get a less attentive user to just re-enter their LinkedIn credentials.
But it's actually a phishing form for your e-mail, so if your LinkedIn and
e-mail password is the same, you have now "consented" to have your address
book scraped and your contacts spammed.

Or, in the case of Booking.com:

* Every step has items designed to pressure you to book NOW because it'll be too late otherwise:

\- "booked x times in the last x hours" on the listing, or

\- "Only 1 room left!" (they now add "on our site" after they lost a consumer
protection lawsuit)

\- Showing booked-out listings "You missed it"

\- Various notifications like "last booked X minutes ago" and "limited supply"
popping in while you're scrolling to raise the pressure

* Misleading or deceptive claims

\- "Jackpot, this is the cheapest price you've seen" (emphasis should be on
"you've seen", this will be shown even if you look at overpriced properties)

\- They seem to have stopped the "one person looking at this property" thing
(to make you think that it may be gone if you don't book now - that one person
is you), probably after being forced to do so by court

\- a misleading rating system (the lowest possible rating is 2.5/10, and you
rate category-by-category, which means that if the staff is friendly and the
hotel is in a good location etc. but the rats and cockroaches ate your luggage
while you slept, that's an 8/10 property - in practice, you should assume that
anything below 8 is not good, below 7.5 is bad, below 7 is catastrophic, below
6 you may not survive)

\- I'd also assume that they mess with the reviews in various ways, like
showing mostly positive ones etc., but I haven't verified that.

Overall, I like to compare the booking experience with a drill sergeant
yelling into your ear to convert (book) right now, NOW, DO IT, NOW, YOU
MAGGOT! They seem to have improved significantly over previous experiences
with them, probably due to a combination of me getting used to ignore the
yelling, or because they realized that such a bad experience pushes customers
away, or because their practices got banned one by one.

It's a shame, because other than the drill sergeant, their site is great.

~~~
smueller1234
For a while, somebody (not me) in the infrastructure department was
maintaining a greasemonkey (I know) script that would remove the urgency
messaging elements from the site. They used it both for themselves and to make
a point about how much more pleasant the site was without them.

------
anthony_doan
Experimental design is just a t-test? At least accord to that picture it seems
that way. There are no ANOVA or interaction test?

Do websites usually just use t-test only? Like adding one feature at a time?

~~~
amirmasoudabdol
It's even worst than that. Most of the time the validity of the t-test that
they are running is questionable. They are technically running an online
t-test and as soon as they find significance they stop. This is fundamentally
wrong, and not conclusive at all.

~~~
smueller1234
A few years ago when I was still working there, involved in the
experimentation tooling among other things, we largely excised that particular
behavior. What are you basing your assertion on?

------
carapace
As an aside,

> developing an organisational capability to design, build, and deploy
> successful machine learned models in user-facing contexts is, in my opinion,
> as fundamental to an organisation’s competitiveness

You hear that, right? In 2019 already you have to have AI and do it well to be
competitive. I just wanted to point out how _cyberpunk_ that is.

~~~
contingencies
I believe booking.com ran on perl for a very long time. Maybe still does.
~relevant quotes from
[https://github.com/globalcitizen/taoup](https://github.com/globalcitizen/taoup)
...

 _In #devops is turtle all way down but at bottom is perl script._ \-
@devops_borat

 _Comedy: You, trying to launch a startup from scratch using Java. Tragedy:
Me, trying to debug 27k lines of legacy Perl that brings $113MM /yr_ \-
@NeckbeardHacker

~~~
mmcclimon
It still does. Booking and ZipRecruiter are easily the two largest employers
of Perl programmers.

~~~
fludlight
Booking has been around forever but I thought the later was a newish company?
Did they consciously pick Perl in 2010 when they launched? I guess that makes
sense if that’s the founders’ background and their business model is
practically (crawling and) extracting and reporting job postings from all over
the web.

------
woliveirajr
> Model performance is not the same as business performance

This is interesting. Sometimes some people from business side consider that AI
is the solution to all problems (as if there was just one catch-them-all AI
solution) and some academic people think that the top-performance model for
some classification task is the must-go, and all they forget that the goal is
to earn money.

~~~
ghaff
That was an interesting result from the original Netflix challenge.

First of all it turned out that the winner wasn't actually all that useful for
various reasons such as computational intensity.

But, more interestingly, it also turned out that the goals of the
model--"best" recommendations--isn't actually the goal of Netflix at all which
is much more interested in customer retention and similar metrics. The two
things may be correlated but they're certainly not the same thing.

I don't remember all the details but I thought it was a really good insight at
the time.

~~~
joker3
The big problem for Netflix was that their data was all from DVD rentals, but
by the end of the contest, their business model was very streaming oriented.
As you might imagine, people have beliefs about what they'll want to watch in
a few days that don't exactly match up with what they want to watch now. That
difference killed the model.

~~~
mv4
another problem was that their "user" was actually a household, with a range
of (sometimes conflicting) likes and interests - which could result in strange
recommendations. They finally added profiles to fix that data problem.

------
beefield
Nice to see that I am not only one hating booking.com with a passion.

But what really amazes me is the market failure that hotels and other
accomodation providers can't come up with a co-op booking site. I am sure
there are issues that are difficult to solve from competition point of view,
but are they _really_ so difficult to solve that the rent seeking fees of
current booking sites are justified?

~~~
kfk
Yes it is. You are maybe not considering the challenge of putting all those
accommodations across the globe in one place. Accommodation providers don’t
care about how they fill up their rooms as long as they get filled. Booking,
contrary to many other tech companies, is very successful financially with
14.5b of revenues and something like 25+% EBIT. I find hard to believe that a
coop is going to be able to build such service worldwide.

------
sandGorgon
>> _Booking.com go to some lengths to minimise the latency introduced by
models, including horizontally scaled distributed copies of models, a in-house
developed custom linear prediction engine, favouring models with fewer
parameters, batching requests, and pre-computation and /or caching._

Any idea what these are ? especially the pre-computation/caching and batching.
I'm not able to see what advantage does batching bring...or how you can really
cache a prediction request

~~~
Falcorian
Here is an overly simple example:

Pre-compute the recommended hotels for my top users every night. Now when that
user comes back, they see a slightly stale recommendation, but it's lightning
fast.

You can also pre-compute and cache some of the inputs cheer model, like maybe
a vector representation of the description of a hotel.

------
djp518
> Once deployed, beyond the immediate business benefit they often go on to
> become a foundation for further product development.

This is one of the reasons I am a big believer in having a system to track
model research and deployment lineage. (I personally use Domino Data Lab for
this. I also work for Domino, but use it in my own modeling work and that of
others I mentor.) No matter which system you use to track lineage, I've found
it important to have a strict history of retraining, versioning, and
experimentation. When models are used in downstream systems from the one they
were originally intended, it becomes even more critical to able to explain and
reproduce the 'research' that led up to deployment.

------
orasis
I’m glad they highlighted inference latency. This is a big issue that I’ve
started running into at scale.

~~~
romaaeterna
Coming from the math side, I don't really get this. Isn't all of the latency
introduced by the learning side of things? Shouldn't the answer side be
entirely decoupled from the learning, and simply be plugging data into an
equation with a bunch of constant parameters (with values discovered by your
learning system, updated at a less than realtime frequency)?

~~~
alexhutcheson
The "equation with a bunch of constant parameters" generated by an ML model
can be huge, with thousands of inputs or more. Evaluating that equation for a
specific observation can require a huge number of computations, which is why
there's a boom in ML inference hardware right now.

~~~
romaaeterna
Yeah, I don't buy that. It's still just a matrix multiplication (for the
linear bits). Incredibly fast. Besides, the old physics rule of thumb is that
any real world equation with a bunch of parameters only has 5-7 that actually
matter, and only 3 that matter a lot. Everything else can be set to zero
without noticeable change in the result.

~~~
orasis
If you’re making decisions that involve multiple variables you may be doing
hundreds to thousands of inferences for a single page load. Keeping latency
under 50ms becomes a real challenge.

~~~
romaaeterna
But it comes down to this doesn't it:

x1 * a + x2 * b + x3 * c + ... + x1000 * zzz + ...

If a, b, c ... zzz, are all fixed constants already discovered by your
learning algorithm. That's a _very_ fast calculation, and doesn't take
anything like 50ms.

Also, in the real world, you can establish a significance cutoff for a lot of
these constants and get something like this as your final equation:

x13 * m + x523 * cdf + x777 * wdc + x893 * ydz

~~~
jiofih
The inputs to those functions might be coming from external data sources, or
aggregated. These have a cost too. But mostly, it just adds up. At a
_thousand_ features, you have a 0.05ms budget for each. Without taking into
account network latency since you won’t be running those models inside the
application server.

~~~
romaaeterna
So why not load the calculated constants to the application server to reduce
network latency?

And the learning side of things should have culled that list of thousand
features down to a list of 5 - 10 that mattered.

It really sounds like the off-the-shelf stuff isn't built for efficiency.

~~~
Breza
Nobody's saying this is an impossible problem. The paper shows how much
additional work is required beyond a traditional data science workflow.

The team behind the paper built a model that had good performance on training
data. They're a smart lot so they knew they needed to cross-validate. The
results held up in cross-validation! Hooray, the model works! ...right?

That's as far as a lot of data scientists go. This paper points out that you
need to have a model that does (at least) three things: 1\. Generates good
scores with training and testing data 2\. Outperforms existing models in the
real world 3\. Runs really really quickly There are a lot of data scientists
who have no idea how to do #2 and #3. This paper says "These parts are really
important!!!"

------
villux
Any other good resources on production level deep learning practices?

