Hacker News new | past | comments | ask | show | jobs | submit login
McDonalds Event Driven Architecture (medium.com/mcdonalds-technical-blog)
119 points by rammy1234 on Dec 6, 2022 | hide | past | favorite | 128 comments



I am, for better or worse, a "heavy user" of the McDonald's app. And unfortunately, all this architecture somehow still misses so many usability issues. The most annoying one, is that the app always defaults to the closest location "as the bird flies." Despite the fact that I have "favorited" a different store, and also my last 5 or 10 orders were placed at this favorite store, the app ALWAYS defaults to the first store, which may be a tenth of a mile closer but has three additional traffic lights and twice the traffic.

The app is also wildly inaccurate with store open/close times (in both directions -- sometimes it tells you a 24-hour store is "currently closed"), and the in-store employees are often confused how to ring up or serve a mobile order.

But the deals are good. BOGO QPCheese just about every day.


I also use it a lot (I have a disabled kid who went from feeding tube to chicken nuggets and LOST weight, whatever she wants she gets).

After you order and drive away, I’ll find the app hours later still tracking me and using GPS. I almost always have to force close the app.

Then the next time I use it the previous order won’t have cleared out despite me picking it up, and I’ll have to go in and “cancel” the order.

The coupons are good though. For a week or two it was giving 30% off, and almost always has a 20% off coupon.


Very similar experience, I often open the app and it is still hung up on my previous (completed) order and I have to clear out the cart. And like you said, the "offers" are good (actually McD has raised prices quite a bit, now with the offers I'm paying about the same as face value was only a few years ago).

I will also say, despite the issues, it's better than most competitors. BK and Wendy's apps are both full of problems as well. I have had multiple accepted and paid orders on the Burger King app, when I got to the store they didn't have the item, weren't accepting mobile orders, or were totally closed.


> Then the next time I use it the previous order won’t have cleared out despite me picking it up, and I’ll have to go in and “cancel” the order.

Yes, I have noticed that here in the UK. I have to keep cancelling the previous order even though I picked it up and paid for it.


> I’ll have to go in and “cancel” the order.

You can do that?? I got an order stuck and did delete and reinstall (clear local data probably would have worked too).

Since then, I try to avoid it getting stuck by making sure the app is frontmost with the screen on while I drive through and away.

Seriously, are you iOS or Android, and where do you cancel an order either way?


> The most annoying one, is that the app always defaults to the closest location "as the bird flies."

I've got a similar annoyance with basically every app or website that offers to find things within N miles of me. I live in Western Washington on the other side of Puget Sound from Seattle.

Say the first two options they give me for N are 10 miles and 25 miles.

10 miles is too small. I need more than that to include Bremerton, Silverdale, and Poulsbo, which are the three most likely places that I'll find whatever I'm looking for if it exists over here.

But 25 miles means the circle reaches past Puget Sound and includes Seattle, Bellevue, and even some of Redmond.

For example if I'm searching for Target stores and enter my zip code, their search lists my nearest store first, then 13 stores on the Seattle side (5 in Seattle, 2 in Bellevue, and 1 each in Lynnwood, Everett, Redmond, Renton, Tukwila, and Woodinville) before the store that is actually second closest to me as far as actually driving goes (Gig Harbor).


I used to sell store locator software and this is always a problem. We worked on allowing “masks” over water that basically added a penalty for any vector that passed over the mask. It’s a hard problem at any kind of scale though and even this solution wasn’t perfect (imagine a river with no bridges or ferries only)


The correct solution is to use an isochrone rather than a basic radius search. Or to simply sort all the shores within the max search radius using a vehicle routing library/api.


That doesn’t work at scale unless you have an api with unlimited request throughput or knowledge of where water is (again needing access to the raw map or unlimited api access). In this case, the stores had to mark their own masks on the map.


Even the best event / data streaming architecture (or, more generally, a backend architecture) in the universe can be negated by a lousy frontend app / frontend layer architecture. I have never ordered anything from a Macca's… but I surmise that is the consequences that McDonald app users are facing.

The article lacks any details material to the event streaming architecture instantiation, and defers the details to a follow-up blog post, so it is hard to draw meaningful conclusions from it.


The one and only time I've used the app was when I thought it would help them during the pandemic when drive thru's were actually slammed with demand while they didn't have enough workers. After waiting in the parking lot for 30 minutes while literally dozens of cars went through the drive through I went in to check on the order and the employees literally couldn't look up the status of my order without an order number which the app didn't give me at the time. An incredibly frustrating experience which turned me off to the concept altogether.


> BOGO

I always find this phrase funny. Isn't buying one and getting one the normal state of affairs?


In UK we actually use BOGOF, which makes sense. No idea why one would drop the all-important F.


Not an english speaker, but I can't imagine "get" used for anything non free, especially after a "buy".


"Can you get me a pint?", "sigh - I've got to get a new phone", "buy a canoe get a paddle half price".

Buy implies buying, get implies buying/receiving, possibly at a cost, possibly an offer, all we know is you have it. Buy one get one free is standard here, BOGO would cause customer service staff to go mad from "do you mean 'Buy one, get two?'" inquiries.


"Buy one get another half off" or "I'm going to buy it online then go get it." The second may be using a different "get," grammar lessons were a long time ago.


No, if you get something you aren’t buying it. Either you say you pay for an item and get it, or you say you buy it. Or you get it for free.


BOGT doesn’t sound as good


should probably be BOGOF - buy one get one _free_


but BOGO sounds fun, while BOGOF sounds like an insult, or maybe an alien civilization with terrible poetry


> while BOGOF sounds like an insult

That's a feature, not a bug.


How about BOGA for buy one, get another


> The most annoying one, is that the app always defaults to the closest location "as the bird flies."

So many apps do this and it's really annoying. There are quite a lot of commercial activity on the other side of the river from where I live, and every store's site I visit, or app I use, always recommends a couple of locations that are closest "as the bird flies", but require me to take a ferry, or drive an hour and a half, whereas the store that's actually the quickest to get to for me is just 15 minutes away from me in the next town over on my side of the river.

It's understandable though, because it's so much easier to calculate straight line distance, and I guess in most cases it gives the right, or close to the right answer. I've just accepted that I live in an edge case area for how the majority of commerces calculates these things.


MCD drive thru:

> Hi do you have a mobile order today?

< Yes.

> Inexplicable silence for 10-15 seconds

> (New voice) How can I help you?

< Hi I’m picking up a mobile order (??)

> What’s the code?

Their process of using the app at the drive through is missing a step and it’s annoying.


When I worked in fast food forever ago, our drive thru had an automatic greeting that played when a car drove up, regardless of whether a cashier was ready to talk to you.

Sometimes the cashier was too busy to catch anything you said to the auto greeting.


No concerns about privacy issues? The Canadian iOS version of the app insists on running in background and complains of you disable precise location. This is slimy.


You give up location data in exchange for discounts, that’s the bargain you agree to by using the app, whether you realize it or not.

I refuse to play this game. When I go to McDonalds I pay in cash and don’t use the app, it’s insulting to be offered 20% off to allow McDonalds to track my precise location, no thanks.


Both my partner and I always have trouble with the McDonald's app (for ordering). I use Android, she uses iOS. As a developer, I've said to myself "This feels like a React Native app that's calling into a mess of microservices" (having worked on that type of project more than once myself).

Anyway, I only skimmed the article, but I had a chuckle seeing the title of this article pop up on HN at all.


I can't speak for other region's McDonald's apps, but the mymacca's app (Australia) is entirely native Java on Android, and still runs like absolute crap.


Completely unrelated probably but who knows but this reminded me of the time when my McDonald's receipt printout from that self-ordering terminal started with a bunch of XML, then the regular receipt part and then ended in a bunch of XML. That was fun.


backend of McDonald's app is handled by a third party new Zealand company plexure

https://www.plexure.com/

not sure what their architecture is


Plexure are only used in select McDonald's markets. Almost all Hacker News readers are in markets where the McDonald's app is developed in-house by McDonald's.


I wished more people used a regular website for these kinds of applications.

It's a form. What's the problem?


A web form cannot track your precise location 24 hours a day, but an app can.


Did AWS just pay them to write this? I was hoping for an interesting article exploring the complexities of a massively distributed and high-throughout system, but I just read “we connected these managed AWS services together and it’s cool”

Hopefully further instalments might actually talk about the problems they faced building this out, and their unique challenges.


> Did AWS just pay them to write this?

This is part of their technical blog ( https://medium.com/mcdonalds-technical-blog ), which looks like an outreach attempt to help with recruitment (hackathons etc). Probably someone went to a PM saying "we need to talk about our stack, how do we make it sound sexy and cool?" and this is what they got back.


> Standby Event Store: To avoid loss of messages in the event the MSK is unavailable, the platform is wired with a standby data store, where it writes events onto a database . The architecture provides tools and utilities to read messages and publish them back onto MSK, once it’s available.

on one hand aws msk is good enough for an enormous application like mcdonald’s. on the other they need a backup database just to get around it not being available? what’s the real story here. interested to see where this goes


Probably nothing more than a requirement for resiliency. With thousands of restaurants, chances of a couple of them losing connectivity each day are going to be pretty good, through traditional interruptions like backhoes digging up cables and drunk drivers taking out telephone poles.

If those messages are discarded because the store can't talk to MSK (or MSK is unavailable), then things like automatic replenishment based on order volume couldn't happen. The store manager would have to do a daily physical inventory count to know how many bags of fries, boxes of drink straws, etc. to reorder.


But it's more likely that they'll lose connectivity to AWS altogether rather than MSK going down itself. I'd not try to write from each store network directly, but through some gateway that would take care of writing to Kafka and that separate store if we still have availability issues.


Cloud services do go down and it's out of your control when they do. If something must work, you need redundancy.

Kafka is often used for financial applications that must not miss events, so having a backup buffer is a reasonable strategy for those use cases. Things like tracking data is likely not worth backing up due to high data volume and low external visibility when data is dropped.


We use kinesis a lot at work, and some services are architected to write events to a Postgres table which eventually gets dumped to kinesis, while other services write directly to kinesis.

Guess which services fared better during the last kinesis outage?


What's the point of using kinesis in this case? Why can you not consume from postgres. My understanding would be that you use kinesis for impedance matching on writes. If you can already write everything on Postgres reliably I cannot see any immediate use case for Kinesis. I could see the other way around, i.e. having kinesis as a resilience backup for postgres being a more valid concern.


I'm not the person you're replying to, but I've used a similar pattern, and there are a couple of reasons.

First, if you have lots of databases and other applications, then you are talking about a mesh of event busses - which defeats the purpose. Pushing the messages out of the various databases and into a central message bus makes the messages more easily consumable without having to know where they come from.

Second, by writing messages to a PG table first, those messages become part of the update transaction. This means you can post messages at any time during your business logic processing, but if you hit an error and roll back, those messages (which would, presumably, no longer be valid) are also rolled back.

Combine this with message idempotence and you get a very reliable messaging environment.


Really, their app is a mess, but from a delivery point of view their stuff is pretty good.

Just yesterday I was going to redeem a deal and the store was closed to walk-ins, so I had to use the drive thru. I couldn't use the deal anymore because I had used the code already, and you can't apparently switch between walk in/drive thru mode. Luckily the drive-thru was so slow I was able to use the code (there's a timer on the code).

That said, the backend worked great; I ordered on the mobile in-line and the order was in-store once I got to the drive thru order speaker.

People forget how hard and expensive it was 10 years ago to do a realtime architecture. Today, McD gets information from your phone to wherever and down to the stores with maybe a few seconds of delay, so you can use your code at the in-store kiosk. And it has to integrate with their in-store order and payment systems.

I'm disappointed at the article, because it doesn't talk about any of this stuff; it's just a laundry list of AWS services. That isn't the important stuff; the important stuff is really how they got all this legacy (in-store) and new stuff to work together.


While this satisfies some idle curiosity, I don't really get what justifies spending the resources in posting such stuff.

Is this a recruitment tactic somehow for such companies?


Probably, yeah.

But I’m always shocked for some inexplicable reason to hear that places like McDonalds and Walmart Labs are interested in solving tech problems. But I mean obviously, of course they are.

Good to be reminded now and then that there are alternatives to contributing to the ad/social media panopticon, right?


Yeah, you can work on making people fat or destroying labour unions! /s


It is often an OKR for senior/staff developers to write blogs or tech articles to promote the companies engineering culture, especially for organisations that are not 'traditional' tech firms.


TIL that OKR means "Objectives and Key Results".


You're one of today's lucky 10k :)


I understood that reference :)


I have a hunch that's part of it. I also think they're trying to show that they're as technically competent as Chik-fil-A, who has been doing this for a while now.

https://medium.com/chick-fil-atech


Thinking about it, that makes sense, given how gargantuan Chick-Fil-A is, but the two chains leave very different impressions to me. Chick-Fil-A feels like they're going out of their way to make it seem like there are friendly humans running everything, to the point of having employees with tablets stand outside at the drive-thru to take your order; lately, they've even been laying off the robotic "my pleasure" thing. McDonald's, on the other hand, feels like a clumsy extraterrestrial sidling up and saying "What is up, my fellow human, amirite?" The one by me is obsessed with pushing the little how-are-we-doing survey on the receipt; I don't think I've ever been there and not been given one. When I see their TV commercials, I keep expecting a flurry of tentacles to burst out of the smiley-person suits. But under the hood, they're both furiously optimizing everything they can.


For better or worse, Chick-fil-A is very obsessive about its standards and its quality. I grew up in Atlanta (where CFA was founded/based) and have had lots of friends who have worked for CFA over the years in various positions. Even in high school, CFA was considered the best fast-food job you could get (and better than some retail jobs) because it paid better than almost anyone else. The downside was there was a strict dress code (no visible piercings or tattoos, short hair for men, etc.) and an understanding that everyone had to be very well-behaved, but like, that comes through in the service.

Beyond that, with a handful of exceptions (I think mostly airports and college campuses), CFA owns all their restaurants, unlike the typical franchise arrangement that McDonald's popularized. You still have independent operators, but they don’t own the restaurant and have to go through a rigorous selection process to even get one. And then, the franchisee only pays like $10,000 upfront, and CFA pays the rest of the costs to build the restaurant. Individual restaurants control their own hiring and marketing and whatnot, but this is all a very top-down approach that needs to align with corporate.

McDonald's outright owns some of its restaurants, but most are owned and run by franchisees, and usually a franchisee owns more than one location. They license the brand, food, and process, but a lot of the day to day standards are decided by the franchise groups. Whereas CFA operators have less leeway for such things.


It's my understanding that CFA franchisees are only allowed to own one store (unlike MCD), and must work in their restaurant full-time.

That's a significant difference from how MCD stores are operated and it shows.


I would never had guessed that Chic-fil-A was considered a contender to McDonald's in technical competency. I drive up and a person standing outside takes my order and asks my name. Then I drive up some more and another person asks me my name and punches something into a tablet, and then I drive up some more and get to the window.

This is 2022. Businesses already have queueing software that automatically takes images of the vehicle to match with your order. I get that some customers like having the "human factor," but there's definitely room for improvement.


They have humans do it because it is faster. Significantly faster, in fact, to go car to car, rather than have each car drive up to a window to order. It also allows them to have multiple lines running at once. And it is probably just as fast to have to person handing you food just ask your name so they can make sure to pass it off efficiently.

Chick-fil-A is the only chain where you will reliably see a massive line (often backing up to the freeway) on any given lunch hour (or a massive in-person line in New York City) that will still get you out and get you served more efficiently than another drive-thru (or walk up shop) that is 1/4 as busy.


But you can have multiple lines running at once. McDonald's already does this at busy locations. I find it dubious to say that they get your order out more efficiently. For one, having a massive line is a potential sign of inefficiency. Also, there's no technical reason I can see why asking your name from a second person to order your queue would be faster than handling this through an automated system. People like to say they get you through fast, but I personally have experienced long "absolute" wait times, even if it seems like the line is going quickly, to the point of I rarely like to go there unless I have time to kill. This almost never happens to the same extent in a McDonald's line, which is shorter, maybe slower moving at times, but lower total wait times.


CFA has found lots of small incremental improvements that speed up the drive thru process significantly.

MCD did the double drive thru ordering first, but my local CFA now has two pickup lines as well, with a heated covered roof. The window has been replaced with a wide open door and employees walk the orders to each car. Simple and it works great.


Chick fil an optimizes the process, not the tech, and it very much shows. Which is what any good business should be doing, IMO.


Mine has never asked for the name twice but they recently started slipping numbered cards under your wipers after placing your order.

I remember CFA posting some articles a few years back on their k8s setup they use for store and inventory management.


Can I get promoted to Manager if I get a degree in Hamburger-data-ology?

From what I can tell the career path is into Corporate hell.

https://www.uopeople.edu/blog/hamburger-education-inside-mcd...


If you want to be a franchisee, the easiest way (other than having $3-$10m lying around) is working for McDonalds, taking AOC at Hamburger U, and then joining their franchise program where they "lease" you everything for a larger take of the annual sales than if you pony up the franchise fee and buy a franchise location.


What's AOC in this context?


Advanced Operations Course

(Found by googling "AOC Hamburger University -cortez")


Recruitment but also retention. Some engineers want to publicly communicate their work at a recognizable company, or aspire to become the kind of company good enough to have a well-liked engineering blog. Providing that opportunity doesn’t cost much and means a lot.


From the company's perspective, the hoped for result is mostly recruiting oriented, a little bit is related to offering career advancement (public visibility of your work) to individuals that write or are featured in the blog.


That's all it is. These EDA transformation journey blog posts are de rigeur for any company looking to have its engineering culture perceived in a modern light, at least in the eyes of a junior-to-intermediate developer.


Is it really scalable? Every time the local McDonald's have some cross-promotion going on with a certain boy band (for example this week's collectable cards), their app's backend always crash right at 11am when the promotion starts.


Unfortunately McD app doesn't make a good impression. It's quite buggy (iOS and Android) and has some obvious usability issues.

So it doesn't matter what architecture is behind McD systems if customer facing software doesn't work correctly.


It feels so dissonant to see a giant company posting it's technical blog on Medium.


You mean discordant


I think both work, but it depends on what they mean?


Wanted to learn more about the difference so asked ChatGPT for a rundown with the context, pretty neat in-depth analysis (assuming it's accurate):

> It's possible that either word could be used in the context you've provided. Both "discordant" and "dissonant" can refer to things that are unpleasant or conflicting, so either word could be used to describe the feeling of seeing a large company using Medium for its technical blog.

> However, there is a subtle difference between the two words. "Dissonant" typically refers to things that are in conflict because of their individual qualities, while "discordant" typically refers to things that are in conflict because of their relationship to each other. In the context of your sentence, "dissonant" might be a slightly better fit because it emphasizes the individual qualities of the company (i.e. its size) and the platform (i.e. Medium) that are in conflict.


Ugh holy fuck


Thanks :)


I find this blog post to be a funny snapshot of how event driven architecture initiatives can be sold internally. “Oh it’s going to give us all these great advantages and be super robust and performant. Here’s my diagram of producers and consumers with our new event gateway project in the middle. How does it actually work? Um… I’ll tell you later”


Call me a luddite but I still find it easier to tell my order to a cashier (I usually remove a few items like the godawful special sauces)


They penalise you with enforced waiting time here in the UK (if it’s not per training to ignore you for a set amount of time, I’d be very surprised)

Still worth it though. Their machines are awful - way too big and bad at input


I'm with you. I'll use the self service in shop but I never use the app.


McDonalds by you still have human cashiers?


Most of the ones in the US have humans. They also have kiosks, but if you really want to talk to a person, you can.


I have not experienced any improvements with event driven architecture despite it was advertised with the same wording as this.


Serious question, isn't think just rebuilding the same thing that commercial cloud pubsub offerings already provide?


I mean I'd expect if they're already all-in on AWS they'd just use Firehose with some deduplication instead of whatever home-brew fallback solution they described, but other than that it doesn't seem like they built much?

What's impressive to me is that they need all that architecture. Mcdonalds sells under 100 burgers a second from what I can find, their order load is probably bursty, so assume maybe all the orders come in the same third of the day, so 300 per second, and every burger is it's own order... that's still not that much.

One order is more than one operation when you're dealing with everything a place at McDonald's scale, but even if you multiply by a factor of 10 to account for analytics, compliance, etc. 3,000 operations per second? Does that really require an entire Kafka-driven event architecture?


I'd guess that 100x database interactions per order may be more like it. Upon startup, there's a whole login, check app version, payment cards still valid, geo query sequence. Check user's point total, rewards, custom offers. Load menu based on chosen store availability and prices. Every menu item has multiple options (mcnugget sauce, burgers without tomato, type of soda). Add to cart. Remove from cart. Add something different to cart. Repeat. Ok, order ready? Check taxes in local jurisdiction. Adjust total based on offers/rewards. Delivery or pickup? Pickup in-store, curbside, drive-thru? Communicate order to store. Update order status based on customer arrival. Send code to customer. Update status upon pickup/delivery. Lots more in-between I skipped, and that's just user-facing, nothing about analytics, accounting, loyalty club, etc.

I'm not defending McDonald's or it's architecture -- as I stated elsewhere, the app is far from perfect, and an entirely different architecture could very easily work much better. But I do think you are severely downplaying the number of interactions or transactions required to run an app like theirs.


I mean you wrote all this justification, but 30k per second is still practically nothing compared to the complexity described in the article?

Taking something like Postgres and sprinkling in some strategic use of Redis would handle their usecase with horizontal scaling pretty reliably...

What it wouldn't do is let you add Kafka to your resume.


To be clear, I was not justifying the current architecture, I specifically wrote "an entirely different architecture could very easily work much better."

I was pointing out, however, that, as is often the case, the initial estimates in a typical "why do they need all this stuff" post, likely underestimated the transaction volume by possibly 10x. Perhaps 3000 or 30000 transactions per second could run on the same system -- I'm not an expert at that scale. But I doubt you'd find any Fortune 100 company relying solely on Postgres and Redis.


I mean if you got nitpicky 3000 or 30000 transactions doesn't tell you anything... but in this kind of evaluation you need to think dimensionally. That's why I intentionally assumed all of their traffic shows up in one contiguous block of 8 hrs across all locations every day: that added a massive fudge factor even bigger than the number you're focusing on...

> But I doubt you'd find any Fortune 100 company relying solely on Postgres and Redis.

I mean, yeah?

Across every system they use of course that wouldn't be it: what would be generating the data that goes into them? Where would the data that goes in be going out?

I'm simply referring to their "glue" for day to day operations, which here is a pubsub system built on Kafka. Most organizations of a certain size start to pick up some set of technology that new efforts default to being built on top of if only to have access to what everyone else is doing... that's essentially what AWS started off as before it was spun out from internal usage

-

But more importantly, Fortune 100 is a very random pairing of problem spaces. I mean you won't find any built solely on Postgres and Redis for the very obvious reason I mentioned above... but you will find billions of dollars in revenue on even more boring stuff than that. The number of Oracle shops using repackaged technology that makes Postgres look like Cloud Spanner is staggering.

I find the opposite of what you do, that people tend to overestimate what it takes to handle large amounts of data reliably. And I think it's because you need some experience with this stuff to understand why you can't just think in terms of "underestimated the transaction volume by possibly 10x"(hint: 10x can mean 3 million => 30 million).

What happens is people hear that system A is going to need to go from 3,000 to 30,000, then start to architecture the way someone going from 3 million to 30 million should have, and suddenly you're building out a system that's less reliable, more expensive, and just generally worse except for what shows up on resumes.


I think you are underestimating how liberal some applications are, especially as analytics is one of their requirments. It's probably multiple events per thing you do. I wouldn't be surprised by 100+ events before even placing an order.


Forest for the trees, I already practically doubled my numbers and assumed McDonalds gets all their orders in the same 8 hour period! And even then multiplying them by 100 doesn't get you into the realm of "we couldn't build this on a monolithic horizontally scaling application".

If anything, if you're at McDonalds scale and still can't find the engineering skill to build a monolith that can handle 30k operations per second, you're playing with fire building a distributed system.

(if you're a nascent startup, then by all means stand on the shoulder of giants and don't sweat that you don't have a full blown cloud engineering org, but that's definitely not where McDonalds should be...)


Architecture reflects organization structure more often than not. What I see is not an attempt to handle requests rates, but an attempt to service a widely distributed set of applications and handle the inevitable churn of client applications requirements.

I am on team monolith, but I also don't see any issue with this approach if you are happy accepting it's caveats and vendor lock in, which they it seems they were.


I still remember my teacher uses McDonald to explain Instruction pipelining...


I was expecting a parody


I'm still not sure you were wrong.


why does everything have to be microservice oriented and what's wrong with a simple monlithic application that runs the whole thing?

also, when are those ice cream machines going to get fixed?


To your point, Shopify's monolith was handling 1.27M requests/sec on Black Friday: https://twitter.com/ShopifyEng/status/1597983918900510720

Thread: https://threadreaderapp.com/thread/1597983918900510720.html


Is running a separate instance of the monolith with a separate database for every store really "a monolith" withstanding 1.27M rps?

It's a bit like saying Magento served 1M rps on Black Friday leaving out the small but important detail that each individual store has separate infrastructure and manageable load.

Divide and conquer works, congrats to the Shopify team that their design decisions worked out for their use case. And obviously some parts of the system are still shared but my guess is that they are not part of the monolith.


> averaging 3 Terabytes per minute of egress traffic across our infrastructure. That’s 4.3 Petabytes per day!

I’d hate to get that AWS bandwidth bill.


On the other hand, 4.3 Petabytes of VISA/MasterCard traffic is a bill you'd probably be happy to pay. (It's Shopify after all, not Flickr)


Exactly. I've written and maintained monoliths that handle close to ~20M requests per second, and it only went down once (when somebody tripped over the power cable!)


But ruby is too slow!


Lots of their code is in Go


Way way way more of our code in Ruby.

Go (and now Rust) is really only used for very low level services with a high SLA (Like Infrastructure). Almost all business logic is Ruby + Rails.


True but you’re still putting significant investment into ruby to speed it up. Ruby is slow.

I don’t think any other company could do that.


In your monolith if you offload a request to a queue and have a background process processing the queue messages, what do you call that module processing messages? Is it a component of your monolith? I think a lot of teams would call it a microservice, I could see some people considering it a component of a service.

Unless you're confounding microservices w/ async architectures and saying to drop asynchronous patterns like this all together.


I don't need another process. I can use a thread.

If the teams you are talking about never heard of threads and only know about microservices, then there is something seriously wrong with their CS education. Maybe they all were hired via leetcode. That could explain it.

I'm not confounding anything. Distributed programming has its applications and uses, but if you don't have a good reason to use it, then don't, and use a thread in a single process for background processing.


And how does it recover incomplete tasks in case of sudden power outage? Microservices use persistent message brokers for that, which are not there in threads. Or are these monoliths all treated as pets with redundant power supply and network lines?


Thread does not imply no queue or no persistent storage. In fact, if you use something like Hangfire you already have that.


My CS education (Stockholm, KTH) didn’t include anything about web service architecture, and information about threads was about how they are implemented in the OS at a low level, not how to use them effectively. I think this stuff is normally picked up after working in the industry.


The background process is just threads in the monolith, that's not a microservice. That a different pod running the same code might pick up the async task doesn't make it less of a monolith either.


I have never seen what you are describing called a microservice. Microservice always means some kind of independent application in its own runtime.


What are you talking about?

What makes it a service soup is the soup of services on the other side of the queue processor. If those components weren't services, the application would be a monolith.

Anyway, there many reasons to organize your code on services, and McDonalds is large enough for them to be perfectly valid. But if you take a closer look, those components on the article aren't the ones that do anything, they are just new queue processors that may or may not finally deliver your messages to the destination. That's an irksome architecture.


>with globally distributed teams of developers with diverse skill levels

The last place I worked with a monolith (~100 developers) put quite a bit of work into making sure everyone didn't step on everyone else's toes. This mostly propagated as optimizing CI and improving test quality (since a single flakey test could derail everyone's build)

As to why "microservices" versus a few "normal" sized services

I'm not sure why it's always "monolith" or "microservices"


At my previous job we used "microservices" for lack of a better term, but really they were "business services". We tried calling them "business services", "macroservices" or just "services", but it was confusing so in the end we just stuck with "microservices".


I think we (as an industry) are basically just revisiting SOA https://en.m.wikipedia.org/wiki/Service-oriented_architectur... (but maybe with a little less Java this time)


Well, I mean really most of the micro service architectures are just monoliths with network calls between the components.


That's called distributed monolith. It's pretty much guaranteed to happen if the developers don't introduce a lot of redundancy while splitting up the services.


If your services need to be chatty with eachother then the boundaries of the services are wrong.

Hopefully by redundancies, you dont mean sharing data access logic


yeah, so what about locking? it must be a nightmare to make sure that the components all work together as intended...


So, normally, you do not need to lock. If all your services are single threaded, and you have a good transactional model, you only need to duplicate services to create parallel routes.

You need to lock when you write a shared area from multiple sources with no opinion on write ordering.

But say your pipeline is client-> decorator -> processor -> observer with client publication -> external partner, each input will go into a set of instances different from the previous and next one, and rejoin at the output who will queue and order them. You have parallel heavy work and sequential light result publication. Your simple output must be as fast as the sum of your parallel routes to minimize queueing.

Ofc it s more complex, and I prefer 0 network hop myself, but I work on a large investment bank micro service system and we do not lock, and the component are both simple and complex enough that when one disappear, everything else waits or rebalances, and when it reappears it can catch up automatically, and go on. It consumes large amount of memory to keep a duplicated state in each component and persistence is not guaranteed to be on time (in fact, our persistence layer was 30 minutes behind by mid day, for years, until we dug into the 30yo sql)


so why do you need this to be distributed and why can you not use a monolithic server application, perhaps with background threads?


One reason come to mind: Conway's law. If you have a monolithic team a monolithic server is usually a pretty good solution to a problem. If you have multiple teams you're likely to wind up with an architecture that lets them work more independently, leading to separate release schedules.

... which perhaps just shifts the question to "why do you need multiple independent teams and not just use a monolithic development team?"


> yeah, so what about locking?

POST /lock

This is only a slight exaggeration over some of the stuff I've seen people try to pull.


This is a very informative YouTube video on that very subject (ice cream machines): https://www.youtube.com/watch?v=SrDEtSlqJC4


Because Conway.


Their event driven infra has race condition exploits.


ITT microservices vs monoliths dogma.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: