Hacker News new | past | comments | ask | show | jobs | submit login
Successful machine learning models: lessons learned at Booking.com (acolyer.org)
321 points by joker3 on Oct 7, 2019 | hide | past | favorite | 87 comments

"Content Overload: Accommodations have very rich content, e.g.descriptions, pictures, reviews and ratings."

Laughed at that one. Booking.com is so full of dark patterns that I dread using it.

This. I still (grudgingly) use them because they also seem to have figured out that its important that your booking process works and it's friction-free, and they often do have the best price.

But if I find an alternative that has the same width of offers and a booking process that doesn't feel like a drill sergeant constantly yelling "BOOK NOW YOU WORTHLESS SCUM, BOOK, BOOK, WHAT ARE YOU WAITING FOR YOU IMBECILE, CLICK IT, BOOK, NOW, NOW" - what do they think will happen?

> what do they think will happen?

Booking.com A/B tests everything: the drill-sergeant-like funnel probably has higher conversion rates than any gentler variation. So to answer your question - they think you might not book through them without the shoutiness.

Conversion rate, yes. But the parent asked about customer retention, which is also quite important but much harder to optimise for...

But does it have higher customer retention over the long term? Conversions are only one piece of running an online business

> But does it have higher customer retention over the long term?

I'm going to guess the answer is no - which is why organizations have to be careful which metrics they measure and incentivize on. Granted, this failing is industry-wide as the longest view on most orgs' dashboards is YoY. When that metric starts freefalling, it most likely will be too late to do anything about it, but most of the staff (up to the CEO) will have padded Resumes with amazing numbers for improved conversion/revenue which will get them to the next job.

Booking is obsessed with maximizing conversions, which just leads to dark patterns.

One lessons from all these things is they maximize for what’s easiest to measure, not what’s most important. Conversions aren’t the end all be all, nobody wants to come back to a store with the pushy salesperson.

I'll get the phrasing wrong here but "What is easy to measure will be deemed important what is difficult to measure will be deemed unimportant."

Oh, absolutely! How can you measure "I'm sick of being tricked into buying", "I wish this site treated me with respect", "I'm the client, and I feel like the product"?

They do/did have a customer satisfaction survey somewhere at the end or after the booking process.

I made sure to provide feedback.

When people book with you but give you 1/10 stars, that's probably a pretty strong warning sign that the customer isn't happy with the site and the first usable competitor that they find will get their business.

Lifetime value (LTV). It’s tricky to measure in the travel industry because users only transact once every few months or even years.

If the assumption is that the user is looking for the cheapest option regardless of how he's been treated, this strategy makes sense.

But that leaves you at the mercy of the competition, which in an open data business (I mean, airlines are more than happy to tell you which flights they have available) implies that your product is undifferentiated. So eventually you resort to this tactics: as soon as a user gets in, do anything that's humanly possible to convert that sucker.

That's in essence what's wrong with Booking. This trickles down, unfortunately: Ryanair hides most of its costs in effectively forcing you to upgrade to Premium in order to be treated as something just a bit more than lifestock, because the assumption that travelers' main concern is price pervades the industry, even if the price they are shown isn't the price they pay in the end.

> maximize for what’s easiest to measure

Just like the unholy abomination that professional project managers have turned “agile methodologies” into.

If everything in this post and comment threads are true, I'm not sure what good it does to post on HN.

I bet a "How to dox and stalk people with Python" post would be flagged down, so maybe I'm just complaining about the prevailing ethics on the site.

That's putting it lightly. It's an abomination of a website. Truly an assault on the senses. Just booked through them yesterday, to save a few bucks-- never again.

Now I understand why it's so bad-- "user interface optimization models"

I refuse to believe this brings real value. The more plausible reality is, they have fantastic SEO and a tightening stranglehold on marketshare, and some AI to squeeze a few more pennies out along the way. Whatever metrics they are seeing, it won't be worth it in the long run. This kind of UX and product won't last.

Apart from the seo, they use a lot of money from the cut they get from the hotel to outbid the hotel on paid ads. They are more than anything a marketing agency.

I've learned to ignore the dark patterns. Still use them because their free cancellation booking process takes a lot of the pain out of picking a hotel.

What do you mean by "dark patterns"? I'm not familiar with that term

Deceiving, tricking and pressuring users into taking actions.

For example, LinkedIn having a flow that has an e-mail and password box, which will get a less attentive user to just re-enter their LinkedIn credentials. But it's actually a phishing form for your e-mail, so if your LinkedIn and e-mail password is the same, you have now "consented" to have your address book scraped and your contacts spammed.

Or, in the case of Booking.com:

* Every step has items designed to pressure you to book NOW because it'll be too late otherwise:

- "booked x times in the last x hours" on the listing, or

- "Only 1 room left!" (they now add "on our site" after they lost a consumer protection lawsuit)

- Showing booked-out listings "You missed it"

- Various notifications like "last booked X minutes ago" and "limited supply" popping in while you're scrolling to raise the pressure

* Misleading or deceptive claims

- "Jackpot, this is the cheapest price you've seen" (emphasis should be on "you've seen", this will be shown even if you look at overpriced properties)

- They seem to have stopped the "one person looking at this property" thing (to make you think that it may be gone if you don't book now - that one person is you), probably after being forced to do so by court

- a misleading rating system (the lowest possible rating is 2.5/10, and you rate category-by-category, which means that if the staff is friendly and the hotel is in a good location etc. but the rats and cockroaches ate your luggage while you slept, that's an 8/10 property - in practice, you should assume that anything below 8 is not good, below 7.5 is bad, below 7 is catastrophic, below 6 you may not survive)

- I'd also assume that they mess with the reviews in various ways, like showing mostly positive ones etc., but I haven't verified that.

Overall, I like to compare the booking experience with a drill sergeant yelling into your ear to convert (book) right now, NOW, DO IT, NOW, YOU MAGGOT! They seem to have improved significantly over previous experiences with them, probably due to a combination of me getting used to ignore the yelling, or because they realized that such a bad experience pushes customers away, or because their practices got banned one by one.

It's a shame, because other than the drill sergeant, their site is great.

For a while, somebody (not me) in the infrastructure department was maintaining a greasemonkey (I know) script that would remove the urgency messaging elements from the site. They used it both for themselves and to make a point about how much more pleasant the site was without them.

Reviews have changed now and you can leave an overall rating (at least somewhere, they may be A/B testing). The reaction from Hosts has been negative. They should read your comment as they fail to understand the logic behind the change.

Dark Patterns are tricks used in websites and apps that make you do things that you didn't mean to, like buying or signing up for something.

See more examples here: https://www.darkpatterns.org/

It's amazing how once you put a label on something, you start noticing it _everywhere_.

Someone pointed out "confirmshaming" to me a few years ago...and since then I feel like it shows up on > 50% of the sites I visit.


How is confirm shaming a dark pattern? It puts more factual detail near the yes or no options so people know what they're agreeing to or declining.

> How is confirm shaming a dark pattern? It puts more factual detail near the yes or no options so people know what they're agreeing to or declining.

In its best form, perhaps. But more likely than not it looks something like this:

[ ] YES, I want to fight racism by subscribing to CrappyPublisher.net's twice-daily newsletter! [ ] NO, I am a racist (and also a pedophile)!

It's emotional manipulation. Confirm shaming uses fear.

If I'm cancelling amazon prime because it "costs too much" but you say "are you sure you want to miss out on all the fast shipping" someone who is easily manipulated may continue to subscribe because they are weak willed.

You could instead ask: "Why are you cancelling?":

- Cost

- Don't use it enough

- ...

- Other (please specify:)

Would you like to learn more about dark patterns? YES or LATER?


I would like you to remind me every day until I submit to your annoyance, I mean tomorrow please

Ben Edelman (Harvard, Microsoft) published a study [0] about how dark patterns in the online travel industry help them reach margins up to 25%. He also mentions the consolidation where most well-known booking sites are owned by just two large groups.

[0] http://www.benedelman.org/impact-of-ota-bias-and-consolidati...

Sounds like travel agents are becoming more and more cost effective every day. I mean, what does the Hyatt website do that a phone call 30 years ago couldn't?

Google's right over there.

i just want to note that "dark patterns" appear because it could be just one more A/B test that booking.com obsessively implements again and again. If the tested pattern does not help to increase conversion rate, it will be shut down. what's wrong with that?

I think both of those statements are true. They do try to manipulate their users, and the listings you see on Booking.com have a lot of content too!

Experimental design is just a t-test? At least accord to that picture it seems that way. There are no ANOVA or interaction test?

Do websites usually just use t-test only? Like adding one feature at a time?

It's even worst than that. Most of the time the validity of the t-test that they are running is questionable. They are technically running an online t-test and as soon as they find significance they stop. This is fundamentally wrong, and not conclusive at all.

A few years ago when I was still working there, involved in the experimentation tooling among other things, we largely excised that particular behavior. What are you basing your assertion on?

Cheating on a hypothesis test... that's terrible.

People should be using more than one design, but that's not as commonly taught as it should be. I'm going to give a talk about that to my company's ML group in about a month, and hopefully that'll improve things where I am.

Not sure if you'd be willing, but I'd love a quick rundown of the high level takeaways if you'd be willing to drop them here.

Are you talking about more than one experimental design in terms of comparing the exp/control distributions or something else?

Not the OP, but I work on similar problems, albeit in a different setting (healthcare, millions+ of patients). The gist is that you have to bake experimental design into the deployment of your ML model, but in many cases a simple RCT or A/B test just won't cut it. This is largely because when you deploy a model, no matter how sophisticated or accurate, there's no guarantee that it'll actually move the needle in terms of the outcomes you care about—hence you need to run some kind of trial. At the same time, you want to maximize overall utility by not having to allocate more subjects to your control arm (or harmful, or resource-intensive and ineffective treatment arms) than you need to. This latter point is much more of a problem in medicine than it is in other settings, as you can imagine. These considerations point to adaptive designs that balance exploration/exploitation, e.g. those based on multi-armed bandits. Currently working on some cool (in my opinion) variations of MABs that incorporate domain-specific knowledge, so I could talk about this all day!

Would be really keen to hear you speak about the subject in greater detail actually. Love having to balance the practicalities of implementing a model in production and validating the outcome while not missing out on utility.

Do you know a good introduction to adaptive/sequential designs? I'm looking for something along the lines of a textbook aimed at a graduate level seminar.


I'm just going to present the absolute basics of the topic. It'll be a high-level overview of something along the lines of chapters 1-4 of https://www.amazon.com/Analysis-Experiments-Chapman-Statisti....

Thanks for the reference!

say you want to try adding two features, which don't you think interact with each other, e.g. a change to the "pick this room button" and a change to the checkout flow. then, you can randomly assign users to two experiments, independently. your t-test results should then be valid if the two features are independent.

> which don't you think interact with each other,

> your t-test results should then be valid if the two features are independent.

Assuming that your assumption are correct on interaction effect.

You can do a hypothesis test on that assumption while including both factors (the two features). Which will clear away any doubt with a 95% confidence or hire a statistician =).

As an aside,

> developing an organisational capability to design, build, and deploy successful machine learned models in user-facing contexts is, in my opinion, as fundamental to an organisation’s competitiveness

You hear that, right? In 2019 already you have to have AI and do it well to be competitive. I just wanted to point out how cyberpunk that is.

>> developing an organisational capability to design, build, and deploy successful machine learned models in user-facing contexts is, in my opinion, as fundamental to an organisation’s competitiveness

>... I just wanted to point out how cyberpunk that is.

Nah, that is corporate flavor of the month/year/etc. It's not the 90s, so they're not "synergizing" any more but otherwise, whatever.

I believe booking.com ran on perl for a very long time. Maybe still does. ~relevant quotes from https://github.com/globalcitizen/taoup ...

In #devops is turtle all way down but at bottom is perl script. - @devops_borat

Comedy: You, trying to launch a startup from scratch using Java. Tragedy: Me, trying to debug 27k lines of legacy Perl that brings $113MM/yr - @NeckbeardHacker

I read a lengthy blog post on how Booking.com basically has people code live in production (slight exaggeration) and they're fine with it, due to some monster of a monitoring test suite.

It still does. Booking and ZipRecruiter are easily the two largest employers of Perl programmers.

Booking has been around forever but I thought the later was a newish company? Did they consciously pick Perl in 2010 when they launched? I guess that makes sense if that’s the founders’ background and their business model is practically (crawling and) extracting and reporting job postings from all over the web.

That gave me a flashback to the early 2000s AIML chatbot craze. "What, you have an online store and no customer chatbot? What are you doing?!"

Flipside is that this ML (whether you consider it AI or not) really is delivering huge value to the businesses that deploy them.

Not really.

Today's ML is really good at speech and image recogniziton, which makes for some very eye-popping layman demos.

Whereas what businesses really want is time series prediction, and modern ML really sucks balls at solving this problem.

Deep learning or machine learning?

I agree on the former, and quite strongly disagree on the latter, even if it means redefining ML to be dressed-up statistics.

Forecasting in 2019 is still using techniques from 1950. You can call that "machine learning", but only if you really want to make people confused by marketing speak.

> Model performance is not the same as business performance

This is interesting. Sometimes some people from business side consider that AI is the solution to all problems (as if there was just one catch-them-all AI solution) and some academic people think that the top-performance model for some classification task is the must-go, and all they forget that the goal is to earn money.

That was an interesting result from the original Netflix challenge.

First of all it turned out that the winner wasn't actually all that useful for various reasons such as computational intensity.

But, more interestingly, it also turned out that the goals of the model--"best" recommendations--isn't actually the goal of Netflix at all which is much more interested in customer retention and similar metrics. The two things may be correlated but they're certainly not the same thing.

I don't remember all the details but I thought it was a really good insight at the time.

The big problem for Netflix was that their data was all from DVD rentals, but by the end of the contest, their business model was very streaming oriented. As you might imagine, people have beliefs about what they'll want to watch in a few days that don't exactly match up with what they want to watch now. That difference killed the model.

another problem was that their "user" was actually a household, with a range of (sometimes conflicting) likes and interests - which could result in strange recommendations. They finally added profiles to fix that data problem.

Thanks, this makes more sense than any other explanation I've heard for why they didn't use the Netflix prize model in production.

> ... they forget that the goal is to earn money.

Relevant news.yc discussion from a month ago: https://news.ycombinator.com/item?id=20876158

> the goal is to earn money

Yep, from my experience with booking.com it seems that instead of using highly trained AIs the decision was made to simply slap every dark pattern known to man onto the site and auto-subscribe every customer to a dozen newsletters.

Nice to see that I am not only one hating booking.com with a passion.

But what really amazes me is the market failure that hotels and other accomodation providers can't come up with a co-op booking site. I am sure there are issues that are difficult to solve from competition point of view, but are they really so difficult to solve that the rent seeking fees of current booking sites are justified?

Yes it is. You are maybe not considering the challenge of putting all those accommodations across the globe in one place. Accommodation providers don’t care about how they fill up their rooms as long as they get filled. Booking, contrary to many other tech companies, is very successful financially with 14.5b of revenues and something like 25+% EBIT. I find hard to believe that a coop is going to be able to build such service worldwide.

Sometimes you see initiatives where the tourist board of a city or a region provides a convenient site for booking a stay. But of course they don't come near the SEO power of Booking.com, so the accommodations also put their rooms there. For the traveler, it is easier to just always use Booking.com.

And people like me and you, who really don't want to book at Booking.com and make an effort to book elsewhere, are often in bad luck, because they own a bunch of other booking sites too.

>>Booking.com go to some lengths to minimise the latency introduced by models, including horizontally scaled distributed copies of models, a in-house developed custom linear prediction engine, favouring models with fewer parameters, batching requests, and pre-computation and/or caching.

Any idea what these are ? especially the pre-computation/caching and batching. I'm not able to see what advantage does batching bring...or how you can really cache a prediction request

Here is an overly simple example:

Pre-compute the recommended hotels for my top users every night. Now when that user comes back, they see a slightly stale recommendation, but it's lightning fast.

You can also pre-compute and cache some of the inputs cheer model, like maybe a vector representation of the description of a hotel.

Batching just uses hardware capabilities efficiently. Either vector instructions in CPUs or batch capabilities in GPUs.

For the same hardware load, you can process several samples instead of just one.

Pre-computation means running your model on samples in advance, before the model result is needed, so it's ready to use instantly when needed.

Caching works probably because there are model results that are reused again and again, so it makes sense to cache them. For example, there are deep models that process room pictures, room and customer characteristics. Only customer characteristics change between customers, so it makes sense to cache the features output by the deep CNN that processes the room pictures.

Once you start doing prediction at scale, there are lots of these optimizations to pick up.

In my company we use this approach, instead of infer online, we just run all our models overnight and save the results on a database that we serve trough an API, that gives you constant time on the latency, its a shotgun approach as much of the recommendations aren't served (specially if are user facing and not item facing ones) but works really well.

All our models are balanced using multi-armed bandits, so for our recommendations engine, we run lots of arms that depends on the incoming channel, were in the app is being shown the recommendation, etc and just combine the outputs of the models.

> Once deployed, beyond the immediate business benefit they often go on to become a foundation for further product development.

This is one of the reasons I am a big believer in having a system to track model research and deployment lineage. (I personally use Domino Data Lab for this. I also work for Domino, but use it in my own modeling work and that of others I mentor.) No matter which system you use to track lineage, I've found it important to have a strict history of retraining, versioning, and experimentation. When models are used in downstream systems from the one they were originally intended, it becomes even more critical to able to explain and reproduce the 'research' that led up to deployment.

I’m glad they highlighted inference latency. This is a big issue that I’ve started running into at scale.

Coming from the math side, I don't really get this. Isn't all of the latency introduced by the learning side of things? Shouldn't the answer side be entirely decoupled from the learning, and simply be plugging data into an equation with a bunch of constant parameters (with values discovered by your learning system, updated at a less than realtime frequency)?

The "equation with a bunch of constant parameters" generated by an ML model can be huge, with thousands of inputs or more. Evaluating that equation for a specific observation can require a huge number of computations, which is why there's a boom in ML inference hardware right now.

Yeah, I don't buy that. It's still just a matrix multiplication (for the linear bits). Incredibly fast. Besides, the old physics rule of thumb is that any real world equation with a bunch of parameters only has 5-7 that actually matter, and only 3 that matter a lot. Everything else can be set to zero without noticeable change in the result.

If you’re making decisions that involve multiple variables you may be doing hundreds to thousands of inferences for a single page load. Keeping latency under 50ms becomes a real challenge.

But it comes down to this doesn't it:

x1 * a + x2 * b + x3 * c + ... + x1000 * zzz + ...

If a, b, c ... zzz, are all fixed constants already discovered by your learning algorithm. That's a very fast calculation, and doesn't take anything like 50ms.

Also, in the real world, you can establish a significance cutoff for a lot of these constants and get something like this as your final equation:

x13 * m + x523 * cdf + x777 * wdc + x893 * ydz

The inputs to those functions might be coming from external data sources, or aggregated. These have a cost too. But mostly, it just adds up. At a thousand features, you have a 0.05ms budget for each. Without taking into account network latency since you won’t be running those models inside the application server.

So why not load the calculated constants to the application server to reduce network latency?

And the learning side of things should have culled that list of thousand features down to a list of 5 - 10 that mattered.

It really sounds like the off-the-shelf stuff isn't built for efficiency.

Nobody's saying this is an impossible problem. The paper shows how much additional work is required beyond a traditional data science workflow.

The team behind the paper built a model that had good performance on training data. They're a smart lot so they knew they needed to cross-validate. The results held up in cross-validation! Hooray, the model works! ...right?

That's as far as a lot of data scientists go. This paper points out that you need to have a model that does (at least) three things: 1. Generates good scores with training and testing data 2. Outperforms existing models in the real world 3. Runs really really quickly There are a lot of data scientists who have no idea how to do #2 and #3. This paper says "These parts are really important!!!"

The feature engineering is probably where most of the performance comes from so there is likely a lot of code that turns the raw data into features.

Features likely include more than just the single users history, so they need to be updated often enough for the model to do fast predictions. E.g. you want your model to capture if many people are booking from the same area at once because there were results from a sports game etc, but you dont want to run an expensive query for every user of the page.

Definitely not the first thing to worry about in a startup, but better performance at Booking.com's scale is serious $$$.

Could you expand on that a little? Are you mostly fighting with high latency for deep learning models for imaging/audio or for traditional ML models on tabular data too? What sorts of latency SLAs do you aim for? (I do some work in this area and am always interested to hear war stories of inference issues.) Thanks!

Any other good resources on production level deep learning practices?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact