McDonalds Event Driven Architecture

andy800 · on Dec 7, 2022

I am, for better or worse, a "heavy user" of the McDonald's app. And unfortunately, all this architecture somehow still misses so many usability issues. The most annoying one, is that the app always defaults to the closest location "as the bird flies." Despite the fact that I have "favorited" a different store, and also my last 5 or 10 orders were placed at this favorite store, the app ALWAYS defaults to the first store, which may be a tenth of a mile closer but has three additional traffic lights and twice the traffic.

The app is also wildly inaccurate with store open/close times (in both directions -- sometimes it tells you a 24-hour store is "currently closed"), and the in-store employees are often confused how to ring up or serve a mobile order.

But the deals are good. BOGO QPCheese just about every day.

wincy · on Dec 7, 2022

I also use it a lot (I have a disabled kid who went from feeding tube to chicken nuggets and LOST weight, whatever she wants she gets).

After you order and drive away, I’ll find the app hours later still tracking me and using GPS. I almost always have to force close the app.

Then the next time I use it the previous order won’t have cleared out despite me picking it up, and I’ll have to go in and “cancel” the order.

The coupons are good though. For a week or two it was giving 30% off, and almost always has a 20% off coupon.

andy800 · on Dec 7, 2022

Very similar experience, I often open the app and it is still hung up on my previous (completed) order and I have to clear out the cart. And like you said, the "offers" are good (actually McD has raised prices quite a bit, now with the offers I'm paying about the same as face value was only a few years ago).

I will also say, despite the issues, it's better than most competitors. BK and Wendy's apps are both full of problems as well. I have had multiple accepted and paid orders on the Burger King app, when I got to the store they didn't have the item, weren't accepting mobile orders, or were totally closed.

_jxdz · on Dec 7, 2022

> Then the next time I use it the previous order won’t have cleared out despite me picking it up, and I’ll have to go in and “cancel” the order.

Yes, I have noticed that here in the UK. I have to keep cancelling the previous order even though I picked it up and paid for it.

toast0 · on Dec 7, 2022

> I’ll have to go in and “cancel” the order.

You can do that?? I got an order stuck and did delete and reinstall (clear local data probably would have worked too).

Since then, I try to avoid it getting stuck by making sure the app is frontmost with the screen on while I drive through and away.

Seriously, are you iOS or Android, and where do you cancel an order either way?

tzs · on Dec 7, 2022

> The most annoying one, is that the app always defaults to the closest location "as the bird flies."

I've got a similar annoyance with basically every app or website that offers to find things within N miles of me. I live in Western Washington on the other side of Puget Sound from Seattle.

Say the first two options they give me for N are 10 miles and 25 miles.

10 miles is too small. I need more than that to include Bremerton, Silverdale, and Poulsbo, which are the three most likely places that I'll find whatever I'm looking for if it exists over here.

But 25 miles means the circle reaches past Puget Sound and includes Seattle, Bellevue, and even some of Redmond.

For example if I'm searching for Target stores and enter my zip code, their search lists my nearest store first, then 13 stores on the Seattle side (5 in Seattle, 2 in Bellevue, and 1 each in Lynnwood, Everett, Redmond, Renton, Tukwila, and Woodinville) before the store that is actually second closest to me as far as actually driving goes (Gig Harbor).

withinboredom · on Dec 7, 2022

I used to sell store locator software and this is always a problem. We worked on allowing “masks” over water that basically added a penalty for any vector that passed over the mask. It’s a hard problem at any kind of scale though and even this solution wasn’t perfect (imagine a river with no bridges or ferries only)

initplus · on Dec 7, 2022

The correct solution is to use an isochrone rather than a basic radius search. Or to simply sort all the shores within the max search radius using a vehicle routing library/api.

withinboredom · on Dec 7, 2022

That doesn’t work at scale unless you have an api with unlimited request throughput or knowledge of where water is (again needing access to the raw map or unlimited api access). In this case, the stores had to mark their own masks on the map.

inkyoto · on Dec 7, 2022

Even the best event / data streaming architecture (or, more generally, a backend architecture) in the universe can be negated by a lousy frontend app / frontend layer architecture. I have never ordered anything from a Macca's… but I surmise that is the consequences that McDonald app users are facing.

The article lacks any details material to the event streaming architecture instantiation, and defers the details to a follow-up blog post, so it is hard to draw meaningful conclusions from it.

tstrimple · on Dec 7, 2022

The one and only time I've used the app was when I thought it would help them during the pandemic when drive thru's were actually slammed with demand while they didn't have enough workers. After waiting in the parking lot for 30 minutes while literally dozens of cars went through the drive through I went in to check on the order and the employees literally couldn't look up the status of my order without an order number which the app didn't give me at the time. An incredibly frustrating experience which turned me off to the concept altogether.

kaashif · on Dec 7, 2022

> BOGO

I always find this phrase funny. Isn't buying one and getting one the normal state of affairs?

toyg · on Dec 7, 2022

In UK we actually use BOGOF, which makes sense. No idea why one would drop the all-important F.

rat9988 · on Dec 7, 2022

Not an english speaker, but I can't imagine "get" used for anything non free, especially after a "buy".

hennell · on Dec 7, 2022

"Can you get me a pint?", "sigh - I've got to get a new phone", "buy a canoe get a paddle half price".

Buy implies buying, get implies buying/receiving, possibly at a cost, possibly an offer, all we know is you have it. Buy one get one free is standard here, BOGO would cause customer service staff to go mad from "do you mean 'Buy one, get two?'" inquiries.

boomboomsubban · on Dec 7, 2022

"Buy one get another half off" or "I'm going to buy it online then go get it." The second may be using a different "get," grammar lessons were a long time ago.

tinus_hn · on Dec 7, 2022

No, if you get something you aren’t buying it. Either you say you pay for an item and get it, or you say you buy it. Or you get it for free.

e-clinton · on Dec 7, 2022

BOGT doesn’t sound as good

throwaway292939 · on Dec 7, 2022

should probably be BOGOF - buy one get one _free_

smegsicle · on Dec 7, 2022

but BOGO sounds fun, while BOGOF sounds like an insult, or maybe an alien civilization with terrible poetry

toyg · on Dec 7, 2022

> while BOGOF sounds like an insult

That's a feature, not a bug.

jackmoore · on Dec 7, 2022

How about BOGA for buy one, get another

wkjagt · on Dec 7, 2022

> The most annoying one, is that the app always defaults to the closest location "as the bird flies."

So many apps do this and it's really annoying. There are quite a lot of commercial activity on the other side of the river from where I live, and every store's site I visit, or app I use, always recommends a couple of locations that are closest "as the bird flies", but require me to take a ferry, or drive an hour and a half, whereas the store that's actually the quickest to get to for me is just 15 minutes away from me in the next town over on my side of the river.

It's understandable though, because it's so much easier to calculate straight line distance, and I guess in most cases it gives the right, or close to the right answer. I've just accepted that I live in an edge case area for how the majority of commerces calculates these things.

user3939382 · on Dec 7, 2022

MCD drive thru:

> Hi do you have a mobile order today?

< Yes.

> Inexplicable silence for 10-15 seconds

> (New voice) How can I help you?

< Hi I’m picking up a mobile order (??)

> What’s the code?

Their process of using the app at the drive through is missing a step and it’s annoying.

spiffytech · on Dec 7, 2022

When I worked in fast food forever ago, our drive thru had an automatic greeting that played when a car drove up, regardless of whether a cashier was ready to talk to you.

Sometimes the cashier was too busy to catch anything you said to the auto greeting.

xattt · on Dec 7, 2022

No concerns about privacy issues? The Canadian iOS version of the app insists on running in background and complains of you disable precise location. This is slimy.

quickthrowman · on Dec 8, 2022

You give up location data in exchange for discounts, that’s the bargain you agree to by using the app, whether you realize it or not.

I refuse to play this game. When I go to McDonalds I pay in cash and don’t use the app, it’s insulting to be offered 20% off to allow McDonalds to track my precise location, no thanks.

rbosinger · on Dec 7, 2022

Both my partner and I always have trouble with the McDonald's app (for ordering). I use Android, she uses iOS. As a developer, I've said to myself "This feels like a React Native app that's calling into a mess of microservices" (having worked on that type of project more than once myself).

Anyway, I only skimmed the article, but I had a chuckle seeing the title of this article pop up on HN at all.

russelg · on Dec 7, 2022

I can't speak for other region's McDonald's apps, but the mymacca's app (Australia) is entirely native Java on Android, and still runs like absolute crap.

tharkun__ · on Dec 7, 2022

Completely unrelated probably but who knows but this reminded me of the time when my McDonald's receipt printout from that self-ordering terminal started with a bunch of XML, then the regular receipt part and then ended in a bunch of XML. That was fun.

halfer53 · on Dec 7, 2022

backend of McDonald's app is handled by a third party new Zealand company plexure

https://www.plexure.com/

not sure what their architecture is

rsstack · on Dec 7, 2022

Plexure are only used in select McDonald's markets. Almost all Hacker News readers are in markets where the McDonald's app is developed in-house by McDonald's.

geysersam · on Dec 7, 2022

I wished more people used a regular website for these kinds of applications.

It's a form. What's the problem?

quickthrowman · on Dec 8, 2022

A web form cannot track your precise location 24 hours a day, but an app can.

tom_walters · on Dec 7, 2022

Did AWS just pay them to write this? I was hoping for an interesting article exploring the complexities of a massively distributed and high-throughout system, but I just read “we connected these managed AWS services together and it’s cool”

Hopefully further instalments might actually talk about the problems they faced building this out, and their unique challenges.

toyg · on Dec 7, 2022

> Did AWS just pay them to write this?

This is part of their technical blog ( https://medium.com/mcdonalds-technical-blog ), which looks like an outreach attempt to help with recruitment (hackathons etc). Probably someone went to a PM saying "we need to talk about our stack, how do we make it sound sexy and cool?" and this is what they got back.

foolfoolz · on Dec 7, 2022

> Standby Event Store: To avoid loss of messages in the event the MSK is unavailable, the platform is wired with a standby data store, where it writes events onto a database . The architecture provides tools and utilities to read messages and publish them back onto MSK, once it’s available.

on one hand aws msk is good enough for an enormous application like mcdonald’s. on the other they need a backup database just to get around it not being available? what’s the real story here. interested to see where this goes

chiph · on Dec 7, 2022

Probably nothing more than a requirement for resiliency. With thousands of restaurants, chances of a couple of them losing connectivity each day are going to be pretty good, through traditional interruptions like backhoes digging up cables and drunk drivers taking out telephone poles.

If those messages are discarded because the store can't talk to MSK (or MSK is unavailable), then things like automatic replenishment based on order volume couldn't happen. The store manager would have to do a daily physical inventory count to know how many bags of fries, boxes of drink straws, etc. to reorder.

KptMarchewa · on Dec 7, 2022

But it's more likely that they'll lose connectivity to AWS altogether rather than MSK going down itself. I'd not try to write from each store network directly, but through some gateway that would take care of writing to Kafka and that separate store if we still have availability issues.

nick0garvey · on Dec 7, 2022

Cloud services do go down and it's out of your control when they do. If something must work, you need redundancy.

Kafka is often used for financial applications that must not miss events, so having a backup buffer is a reasonable strategy for those use cases. Things like tracking data is likely not worth backing up due to high data volume and low external visibility when data is dropped.

hamandcheese · on Dec 7, 2022

We use kinesis a lot at work, and some services are architected to write events to a Postgres table which eventually gets dumped to kinesis, while other services write directly to kinesis.

Guess which services fared better during the last kinesis outage?

cgio · on Dec 7, 2022

What's the point of using kinesis in this case? Why can you not consume from postgres. My understanding would be that you use kinesis for impedance matching on writes. If you can already write everything on Postgres reliably I cannot see any immediate use case for Kinesis. I could see the other way around, i.e. having kinesis as a resilience backup for postgres being a more valid concern.

doctor_eval · on Dec 7, 2022

I'm not the person you're replying to, but I've used a similar pattern, and there are a couple of reasons.

First, if you have lots of databases and other applications, then you are talking about a mesh of event busses - which defeats the purpose. Pushing the messages out of the various databases and into a central message bus makes the messages more easily consumable without having to know where they come from.

Second, by writing messages to a PG table first, those messages become part of the update transaction. This means you can post messages at any time during your business logic processing, but if you hit an error and roll back, those messages (which would, presumably, no longer be valid) are also rolled back.

Combine this with message idempotence and you get a very reliable messaging environment.

manv1 · on Dec 7, 2022

Really, their app is a mess, but from a delivery point of view their stuff is pretty good.

Just yesterday I was going to redeem a deal and the store was closed to walk-ins, so I had to use the drive thru. I couldn't use the deal anymore because I had used the code already, and you can't apparently switch between walk in/drive thru mode. Luckily the drive-thru was so slow I was able to use the code (there's a timer on the code).

That said, the backend worked great; I ordered on the mobile in-line and the order was in-store once I got to the drive thru order speaker.

People forget how hard and expensive it was 10 years ago to do a realtime architecture. Today, McD gets information from your phone to wherever and down to the stores with maybe a few seconds of delay, so you can use your code at the in-store kiosk. And it has to integrate with their in-store order and payment systems.

I'm disappointed at the article, because it doesn't talk about any of this stuff; it's just a laundry list of AWS services. That isn't the important stuff; the important stuff is really how they got all this legacy (in-store) and new stuff to work together.

achrono · on Dec 6, 2022

While this satisfies some idle curiosity, I don't really get what justifies spending the resources in posting such stuff.

Is this a recruitment tactic somehow for such companies?

bee_rider · on Dec 7, 2022

Probably, yeah.

But I’m always shocked for some inexplicable reason to hear that places like McDonalds and Walmart Labs are interested in solving tech problems. But I mean obviously, of course they are.

Good to be reminded now and then that there are alternatives to contributing to the ad/social media panopticon, right?

rjh29 · on Dec 7, 2022

Yeah, you can work on making people fat or destroying labour unions! /s

dbetteridge · on Dec 7, 2022

It is often an OKR for senior/staff developers to write blogs or tech articles to promote the companies engineering culture, especially for organisations that are not 'traditional' tech firms.

selcuka · on Dec 7, 2022

TIL that OKR means "Objectives and Key Results".

bavell · on Dec 7, 2022

You're one of today's lucky 10k :)

selcuka · on Dec 7, 2022

I understood that reference :)

joezydeco · on Dec 6, 2022

I have a hunch that's part of it. I also think they're trying to show that they're as technically competent as Chik-fil-A, who has been doing this for a while now.

https://medium.com/chick-fil-atech

mbg721 · on Dec 7, 2022

Thinking about it, that makes sense, given how gargantuan Chick-Fil-A is, but the two chains leave very different impressions to me. Chick-Fil-A feels like they're going out of their way to make it seem like there are friendly humans running everything, to the point of having employees with tablets stand outside at the drive-thru to take your order; lately, they've even been laying off the robotic "my pleasure" thing. McDonald's, on the other hand, feels like a clumsy extraterrestrial sidling up and saying "What is up, my fellow human, amirite?" The one by me is obsessed with pushing the little how-are-we-doing survey on the receipt; I don't think I've ever been there and not been given one. When I see their TV commercials, I keep expecting a flurry of tentacles to burst out of the smiley-person suits. But under the hood, they're both furiously optimizing everything they can.

filmgirlcw · on Dec 7, 2022

For better or worse, Chick-fil-A is very obsessive about its standards and its quality. I grew up in Atlanta (where CFA was founded/based) and have had lots of friends who have worked for CFA over the years in various positions. Even in high school, CFA was considered the best fast-food job you could get (and better than some retail jobs) because it paid better than almost anyone else. The downside was there was a strict dress code (no visible piercings or tattoos, short hair for men, etc.) and an understanding that everyone had to be very well-behaved, but like, that comes through in the service.

Beyond that, with a handful of exceptions (I think mostly airports and college campuses), CFA owns all their restaurants, unlike the typical franchise arrangement that McDonald's popularized. You still have independent operators, but they don’t own the restaurant and have to go through a rigorous selection process to even get one. And then, the franchisee only pays like $10,000 upfront, and CFA pays the rest of the costs to build the restaurant. Individual restaurants control their own hiring and marketing and whatnot, but this is all a very top-down approach that needs to align with corporate.

McDonald's outright owns some of its restaurants, but most are owned and run by franchisees, and usually a franchisee owns more than one location. They license the brand, food, and process, but a lot of the day to day standards are decided by the franchise groups. Whereas CFA operators have less leeway for such things.

joezydeco · on Dec 7, 2022

It's my understanding that CFA franchisees are only allowed to own one store (unlike MCD), and must work in their restaurant full-time.

That's a significant difference from how MCD stores are operated and it shows.

bitshiftfaced · on Dec 7, 2022

I would never had guessed that Chic-fil-A was considered a contender to McDonald's in technical competency. I drive up and a person standing outside takes my order and asks my name. Then I drive up some more and another person asks me my name and punches something into a tablet, and then I drive up some more and get to the window.

This is 2022. Businesses already have queueing software that automatically takes images of the vehicle to match with your order. I get that some customers like having the "human factor," but there's definitely room for improvement.

filmgirlcw · on Dec 7, 2022

They have humans do it because it is faster. Significantly faster, in fact, to go car to car, rather than have each car drive up to a window to order. It also allows them to have multiple lines running at once. And it is probably just as fast to have to person handing you food just ask your name so they can make sure to pass it off efficiently.

Chick-fil-A is the only chain where you will reliably see a massive line (often backing up to the freeway) on any given lunch hour (or a massive in-person line in New York City) that will still get you out and get you served more efficiently than another drive-thru (or walk up shop) that is 1/4 as busy.

bitshiftfaced · on Dec 8, 2022

But you can have multiple lines running at once. McDonald's already does this at busy locations. I find it dubious to say that they get your order out more efficiently. For one, having a massive line is a potential sign of inefficiency. Also, there's no technical reason I can see why asking your name from a second person to order your queue would be faster than handling this through an automated system. People like to say they get you through fast, but I personally have experienced long "absolute" wait times, even if it seems like the line is going quickly, to the point of I rarely like to go there unless I have time to kill. This almost never happens to the same extent in a McDonald's line, which is shorter, maybe slower moving at times, but lower total wait times.

joezydeco · on Dec 7, 2022

CFA has found lots of small incremental improvements that speed up the drive thru process significantly.

MCD did the double drive thru ordering first, but my local CFA now has two pickup lines as well, with a heated covered roof. The window has been replaced with a wide open door and employees walk the orders to each car. Simple and it works great.

hamandcheese · on Dec 7, 2022

Chick fil an optimizes the process, not the tech, and it very much shows. Which is what any good business should be doing, IMO.

bavell · on Dec 7, 2022

Mine has never asked for the name twice but they recently started slipping numbered cards under your wipers after placing your order.

I remember CFA posting some articles a few years back on their k8s setup they use for store and inventory management.

annoyingnoob · on Dec 7, 2022

Can I get promoted to Manager if I get a degree in Hamburger-data-ology?

From what I can tell the career path is into Corporate hell.

https://www.uopeople.edu/blog/hamburger-education-inside-mcd...

greedo · on Dec 7, 2022

If you want to be a franchisee, the easiest way (other than having $3-$10m lying around) is working for McDonalds, taking AOC at Hamburger U, and then joining their franchise program where they "lease" you everything for a larger take of the annual sales than if you pony up the franchise fee and buy a franchise location.

ripley12 · on Dec 7, 2022

What's AOC in this context?

bckygldstn · on Dec 7, 2022

Advanced Operations Course

(Found by googling "AOC Hamburger University -cortez")

1123581321 · on Dec 7, 2022

Recruitment but also retention. Some engineers want to publicly communicate their work at a recognizable company, or aspire to become the kind of company good enough to have a well-liked engineering blog. Providing that opportunity doesn’t cost much and means a lot.

cco · on Dec 7, 2022

From the company's perspective, the hoped for result is mostly recruiting oriented, a little bit is related to offering career advancement (public visibility of your work) to individuals that write or are featured in the blog.

itisit · on Dec 6, 2022

That's all it is. These EDA transformation journey blog posts are de rigeur for any company looking to have its engineering culture perceived in a modern light, at least in the eyes of a junior-to-intermediate developer.

spacehunt · on Dec 7, 2022

Is it really scalable? Every time the local McDonald's have some cross-promotion going on with a certain boy band (for example this week's collectable cards), their app's backend always crash right at 11am when the promotion starts.

mark_sz · on Dec 7, 2022

Unfortunately McD app doesn't make a good impression. It's quite buggy (iOS and Android) and has some obvious usability issues.

So it doesn't matter what architecture is behind McD systems if customer facing software doesn't work correctly.

j-bos · on Dec 7, 2022

It feels so dissonant to see a giant company posting it's technical blog on Medium.

keepquestioning · on Dec 7, 2022

You mean discordant

weeniehutjrdev · on Dec 7, 2022

I think both work, but it depends on what they mean?

bcjordan · on Dec 7, 2022

Wanted to learn more about the difference so asked ChatGPT for a rundown with the context, pretty neat in-depth analysis (assuming it's accurate):

> It's possible that either word could be used in the context you've provided. Both "discordant" and "dissonant" can refer to things that are unpleasant or conflicting, so either word could be used to describe the feeling of seeing a large company using Medium for its technical blog.

> However, there is a subtle difference between the two words. "Dissonant" typically refers to things that are in conflict because of their individual qualities, while "discordant" typically refers to things that are in conflict because of their relationship to each other. In the context of your sentence, "dissonant" might be a slightly better fit because it emphasizes the individual qualities of the company (i.e. its size) and the platform (i.e. Medium) that are in conflict.

keepquestioning · on Dec 7, 2022

Ugh holy fuck

j-bos · on Dec 7, 2022

Thanks :)

nycdotnet · on Dec 7, 2022

I find this blog post to be a funny snapshot of how event driven architecture initiatives can be sold internally. “Oh it’s going to give us all these great advantages and be super robust and performant. Here’s my diagram of producers and consumers with our new event gateway project in the middle. How does it actually work? Um… I’ll tell you later”

georgeburdell · on Dec 7, 2022

Call me a luddite but I still find it easier to tell my order to a cashier (I usually remove a few items like the godawful special sauces)

switch007 · on Dec 7, 2022

They penalise you with enforced waiting time here in the UK (if it’s not per training to ignore you for a set amount of time, I’d be very surprised)

Still worth it though. Their machines are awful - way too big and bad at input

lleb97a · on Dec 7, 2022

I'm with you. I'll use the self service in shop but I never use the app.

Rebelgecko · on Dec 7, 2022

McDonalds by you still have human cashiers?

yurishimo · on Dec 7, 2022

Most of the ones in the US have humans. They also have kiosks, but if you really want to talk to a person, you can.

barbarbar · on Dec 8, 2022

I have not experienced any improvements with event driven architecture despite it was advertised with the same wording as this.

guyzero · on Dec 7, 2022

Serious question, isn't think just rebuilding the same thing that commercial cloud pubsub offerings already provide?

BoorishBears · on Dec 7, 2022

I mean I'd expect if they're already all-in on AWS they'd just use Firehose with some deduplication instead of whatever home-brew fallback solution they described, but other than that it doesn't seem like they built much?

What's impressive to me is that they need all that architecture. Mcdonalds sells under 100 burgers a second from what I can find, their order load is probably bursty, so assume maybe all the orders come in the same third of the day, so 300 per second, and every burger is it's own order... that's still not that much.

One order is more than one operation when you're dealing with everything a place at McDonald's scale, but even if you multiply by a factor of 10 to account for analytics, compliance, etc. 3,000 operations per second? Does that really require an entire Kafka-driven event architecture?

andy800 · on Dec 7, 2022

I'd guess that 100x database interactions per order may be more like it. Upon startup, there's a whole login, check app version, payment cards still valid, geo query sequence. Check user's point total, rewards, custom offers. Load menu based on chosen store availability and prices. Every menu item has multiple options (mcnugget sauce, burgers without tomato, type of soda). Add to cart. Remove from cart. Add something different to cart. Repeat. Ok, order ready? Check taxes in local jurisdiction. Adjust total based on offers/rewards. Delivery or pickup? Pickup in-store, curbside, drive-thru? Communicate order to store. Update order status based on customer arrival. Send code to customer. Update status upon pickup/delivery. Lots more in-between I skipped, and that's just user-facing, nothing about analytics, accounting, loyalty club, etc.

I'm not defending McDonald's or it's architecture -- as I stated elsewhere, the app is far from perfect, and an entirely different architecture could very easily work much better. But I do think you are severely downplaying the number of interactions or transactions required to run an app like theirs.

BoorishBears · on Dec 7, 2022

I mean you wrote all this justification, but 30k per second is still practically nothing compared to the complexity described in the article?

Taking something like Postgres and sprinkling in some strategic use of Redis would handle their usecase with horizontal scaling pretty reliably...

What it wouldn't do is let you add Kafka to your resume.

andy800 · on Dec 7, 2022

To be clear, I was not justifying the current architecture, I specifically wrote "an entirely different architecture could very easily work much better."

I was pointing out, however, that, as is often the case, the initial estimates in a typical "why do they need all this stuff" post, likely underestimated the transaction volume by possibly 10x. Perhaps 3000 or 30000 transactions per second could run on the same system -- I'm not an expert at that scale. But I doubt you'd find any Fortune 100 company relying solely on Postgres and Redis.

BoorishBears · on Dec 7, 2022

I mean if you got nitpicky 3000 or 30000 transactions doesn't tell you anything... but in this kind of evaluation you need to think dimensionally. That's why I intentionally assumed all of their traffic shows up in one contiguous block of 8 hrs across all locations every day: that added a massive fudge factor even bigger than the number you're focusing on...

> But I doubt you'd find any Fortune 100 company relying solely on Postgres and Redis.

I mean, yeah?

Across every system they use of course that wouldn't be it: what would be generating the data that goes into them? Where would the data that goes in be going out?

I'm simply referring to their "glue" for day to day operations, which here is a pubsub system built on Kafka. Most organizations of a certain size start to pick up some set of technology that new efforts default to being built on top of if only to have access to what everyone else is doing... that's essentially what AWS started off as before it was spun out from internal usage

-

But more importantly, Fortune 100 is a very random pairing of problem spaces. I mean you won't find any built solely on Postgres and Redis for the very obvious reason I mentioned above... but you will find billions of dollars in revenue on even more boring stuff than that. The number of Oracle shops using repackaged technology that makes Postgres look like Cloud Spanner is staggering.

I find the opposite of what you do, that people tend to overestimate what it takes to handle large amounts of data reliably. And I think it's because you need some experience with this stuff to understand why you can't just think in terms of "underestimated the transaction volume by possibly 10x"(hint: 10x can mean 3 million => 30 million).

What happens is people hear that system A is going to need to go from 3,000 to 30,000, then start to architecture the way someone going from 3 million to 30 million should have, and suddenly you're building out a system that's less reliable, more expensive, and just generally worse except for what shows up on resumes.

ehnto · on Dec 7, 2022

I think you are underestimating how liberal some applications are, especially as analytics is one of their requirments. It's probably multiple events per thing you do. I wouldn't be surprised by 100+ events before even placing an order.

BoorishBears · on Dec 7, 2022

Forest for the trees, I already practically doubled my numbers and assumed McDonalds gets all their orders in the same 8 hour period! And even then multiplying them by 100 doesn't get you into the realm of "we couldn't build this on a monolithic horizontally scaling application".

If anything, if you're at McDonalds scale and still can't find the engineering skill to build a monolith that can handle 30k operations per second, you're playing with fire building a distributed system.

(if you're a nascent startup, then by all means stand on the shoulder of giants and don't sweat that you don't have a full blown cloud engineering org, but that's definitely not where McDonalds should be...)

ehnto · on Dec 7, 2022

Architecture reflects organization structure more often than not. What I see is not an attempt to handle requests rates, but an attempt to service a widely distributed set of applications and handle the inevitable churn of client applications requirements.

I am on team monolith, but I also don't see any issue with this approach if you are happy accepting it's caveats and vendor lock in, which they it seems they were.

alanhaha · on Dec 7, 2022

I still remember my teacher uses McDonald to explain Instruction pipelining...

lokar · on Dec 7, 2022

I was expecting a parody

marcosdumay · on Dec 7, 2022

I'm still not sure you were wrong.

b20000 · on Dec 7, 2022

why does everything have to be microservice oriented and what's wrong with a simple monlithic application that runs the whole thing?

also, when are those ice cream machines going to get fixed?

toomuchtodo · on Dec 7, 2022

To your point, Shopify's monolith was handling 1.27M requests/sec on Black Friday: https://twitter.com/ShopifyEng/status/1597983918900510720

Thread: https://threadreaderapp.com/thread/1597983918900510720.html

aeyes · on Dec 7, 2022

Is running a separate instance of the monolith with a separate database for every store really "a monolith" withstanding 1.27M rps?

It's a bit like saying Magento served 1M rps on Black Friday leaving out the small but important detail that each individual store has separate infrastructure and manageable load.

Divide and conquer works, congrats to the Shopify team that their design decisions worked out for their use case. And obviously some parts of the system are still shared but my guess is that they are not part of the monolith.

tiffanyh · on Dec 7, 2022

> averaging 3 Terabytes per minute of egress traffic across our infrastructure. That’s 4.3 Petabytes per day!

I’d hate to get that AWS bandwidth bill.

ip26 · on Dec 7, 2022

On the other hand, 4.3 Petabytes of VISA/MasterCard traffic is a bill you'd probably be happy to pay. (It's Shopify after all, not Flickr)

spoils19 · on Dec 7, 2022

Exactly. I've written and maintained monoliths that handle close to ~20M requests per second, and it only went down once (when somebody tripped over the power cable!)

gscho · on Dec 7, 2022

But ruby is too slow!

pxue · on Dec 7, 2022

Lots of their code is in Go

ecshafer · on Dec 7, 2022

Way way way more of our code in Ruby.

Go (and now Rust) is really only used for very low level services with a high SLA (Like Infrastructure). Almost all business logic is Ruby + Rails.

vips7L · on Dec 7, 2022

True but you’re still putting significant investment into ruby to speed it up. Ruby is slow.

I don’t think any other company could do that.

ramenmeal · on Dec 7, 2022

In your monolith if you offload a request to a queue and have a background process processing the queue messages, what do you call that module processing messages? Is it a component of your monolith? I think a lot of teams would call it a microservice, I could see some people considering it a component of a service.

Unless you're confounding microservices w/ async architectures and saying to drop asynchronous patterns like this all together.

b20000 · on Dec 7, 2022

I don't need another process. I can use a thread.

If the teams you are talking about never heard of threads and only know about microservices, then there is something seriously wrong with their CS education. Maybe they all were hired via leetcode. That could explain it.

I'm not confounding anything. Distributed programming has its applications and uses, but if you don't have a good reason to use it, then don't, and use a thread in a single process for background processing.

oflor · on Dec 7, 2022

And how does it recover incomplete tasks in case of sudden power outage? Microservices use persistent message brokers for that, which are not there in threads. Or are these monoliths all treated as pets with redundant power supply and network lines?

sebazzz · on Dec 7, 2022

Thread does not imply no queue or no persistent storage. In fact, if you use something like Hangfire you already have that.

strokirk · on Dec 7, 2022

My CS education (Stockholm, KTH) didn’t include anything about web service architecture, and information about threads was about how they are implemented in the OS at a low level, not how to use them effectively. I think this stuff is normally picked up after working in the industry.

esprehn · on Dec 7, 2022

The background process is just threads in the monolith, that's not a microservice. That a different pod running the same code might pick up the async task doesn't make it less of a monolith either.

ecshafer · on Dec 7, 2022

I have never seen what you are describing called a microservice. Microservice always means some kind of independent application in its own runtime.

marcosdumay · on Dec 7, 2022

What are you talking about?

What makes it a service soup is the soup of services on the other side of the queue processor. If those components weren't services, the application would be a monolith.

Anyway, there many reasons to organize your code on services, and McDonalds is large enough for them to be perfectly valid. But if you take a closer look, those components on the article aren't the ones that do anything, they are just new queue processors that may or may not finally deliver your messages to the destination. That's an irksome architecture.

nijave · on Dec 7, 2022

>with globally distributed teams of developers with diverse skill levels

The last place I worked with a monolith (~100 developers) put quite a bit of work into making sure everyone didn't step on everyone else's toes. This mostly propagated as optimizing CI and improving test quality (since a single flakey test could derail everyone's build)

As to why "microservices" versus a few "normal" sized services

I'm not sure why it's always "monolith" or "microservices"

thiht · on Dec 7, 2022

At my previous job we used "microservices" for lack of a better term, but really they were "business services". We tried calling them "business services", "macroservices" or just "services", but it was confusing so in the end we just stuck with "microservices".

nijave · on Dec 10, 2022

I think we (as an industry) are basically just revisiting SOA https://en.m.wikipedia.org/wiki/Service-oriented_architectur... (but maybe with a little less Java this time)

kemiller2002 · on Dec 7, 2022

Well, I mean really most of the micro service architectures are just monoliths with network calls between the components.

411111111111111 · on Dec 7, 2022

That's called distributed monolith. It's pretty much guaranteed to happen if the developers don't introduce a lot of redundancy while splitting up the services.

adam_arthur · on Dec 7, 2022

If your services need to be chatty with eachother then the boundaries of the services are wrong.

Hopefully by redundancies, you dont mean sharing data access logic

b20000 · on Dec 7, 2022

yeah, so what about locking? it must be a nightmare to make sure that the components all work together as intended...

xwolfi · on Dec 7, 2022

So, normally, you do not need to lock. If all your services are single threaded, and you have a good transactional model, you only need to duplicate services to create parallel routes.

You need to lock when you write a shared area from multiple sources with no opinion on write ordering.

But say your pipeline is client-> decorator -> processor -> observer with client publication -> external partner, each input will go into a set of instances different from the previous and next one, and rejoin at the output who will queue and order them. You have parallel heavy work and sequential light result publication. Your simple output must be as fast as the sum of your parallel routes to minimize queueing.

Ofc it s more complex, and I prefer 0 network hop myself, but I work on a large investment bank micro service system and we do not lock, and the component are both simple and complex enough that when one disappear, everything else waits or rebalances, and when it reappears it can catch up automatically, and go on. It consumes large amount of memory to keep a duplicated state in each component and persistence is not guaranteed to be on time (in fact, our persistence layer was 30 minutes behind by mid day, for years, until we dug into the 30yo sql)

b20000 · on Dec 7, 2022

so why do you need this to be distributed and why can you not use a monolithic server application, perhaps with background threads?

codebje · on Dec 7, 2022

One reason come to mind: Conway's law. If you have a monolithic team a monolithic server is usually a pretty good solution to a problem. If you have multiple teams you're likely to wind up with an architecture that lets them work more independently, leading to separate release schedules.

... which perhaps just shifts the question to "why do you need multiple independent teams and not just use a monolithic development team?"

Kamq · on Dec 7, 2022

> yeah, so what about locking?

POST /lock

This is only a slight exaggeration over some of the stuff I've seen people try to pull.

nathanaldensr · on Dec 7, 2022

This is a very informative YouTube video on that very subject (ice cream machines): https://www.youtube.com/watch?v=SrDEtSlqJC4

dboreham · on Dec 7, 2022

Because Conway.

itake · on Dec 7, 2022

Their event driven infra has race condition exploits.

campbel · on Dec 7, 2022

ITT microservices vs monoliths dogma.