Edit: but yeah must’ve been tough 72+hrs i hope their version of reliability team can use this to bash the support they need out of the management and not get scapegoated instead
I remember one large company I consulted for in the mid-aughts. Before I got there, they had an outage so severe it was on CNN and caused a short but notable dip in their stock price. By the time I got there, the ops people were absolutely dominant. All changes had to go through their review board, and woe be unto any project that they raised an eyebrow at.
Of course, the real problem was that developers were scheduled out 18 months in advance to work on a list of projects they hadn't been involved in estimating and where often they hadn't seen the code before being swapped onto the project. Dates were absolutely not allowed to slip, so it came out in frantic overwork, which meant a lot of bad/confusing code. Meaning more likelihood of failure and no time to think about systemic issues and impacts.
"Tough on bugs, and tough on the causes of bugs!"
As they add even more time-sapping process to every release.
There was another incident that took down everything that was caused by a Spring cloud config being changed that no one in the company had access to.
Companies I’ve worked at that sucked had awful internal communication. It was all very friendly, but it was all platitudes and euphemisms, dumpster fire technology implementation.
Phone and web apps are basically librarian work these days. If a businesses tech stack is having issues it’s human communication that’s the real problem.
Honestly, I think part of the problem is people going around talking so much about soft skills and how people don't like to be treated harshly that they've forgotten you need hard skills and nearly every profession where they need real leadership you get told where you screwed up and how to fix it. Literally, at the company I work for, people screw up and all you hear about is positive things
Right. But the person deciding on those investments is a product owner who's looking to release new features because that's what management wants. At least that's my experience more often than not.
My guess is that the bias might be the opposite. Maybe these companies become more popular because they are reliable and have the mindset to focus on that.
In your experience, when something like this happens, what can leaders do to help in the moment? I'm an engineering manager, and watching people getting over-stressed is much easier to do when you're physically in the office. Working remotely, it's easy to slip in the trap of either assuming no news is good news, or the opposite approach of asking "are you OK?" every 2 minutes. The only solution I've found is to all sit on a call, and pay attention to people's tone of voice, but I don't know whether that actually contributes to people's stress levels.
Otherwise, it's mostly things you can do before and after an incident like this.
Depending on your role and place in hierarchy, yes DO sit on calls, even if you are quiet; to understand what's going on, to help prioritize if needed, to see if your experienced "spidey sense" catches some risk others may not even though they're closer to the problem (or precisely because of it), encourage, guide, and focus; but all of that without being overbearing or making them feel watched. You need to feel like an ally, part of the team, a friend as opposed to political officer. This again is best achieved ahead of time by demonstrating your value to the team by being supportive and effective always.
The main thing you can do is after all is over to make people feel awesome, give them recognition in all the ways possible: bonus, extra vacation days to compensate for the crunch time, etc
But if you still have your technical chops - jump in the trenches and debug!
Remote teams are usually concentrated in the same timezone too. In-office global companies make a point of having at least a few teams owning a service for this very reason.
I also think it is time that people should take a look again at Chaos Engineering  from Netflix. It is sometimes ironic that the best technology often comes from companies that aren't a technology company at all.
How Netflix got lumped into "FAANG" aka Big Tech is still a mystery to me.
In some sense, I agree - Netflix, Disney, and Pixar are media companies that use technology to distribute, to recommend, and to produce their content.
But in the same sense, Google is an advertising company that uses technology to sell and to show their ads (and they have a few loss-leading tech ventures like search, mail, a phone OS, and a browser - all of which they give away to more effectively show those ads). Facebook is another advertising company, they happen to show and profile their ads using their social network tech. Amazon is a retail sales, warehousing, and product distribution company that happens to use a lot of technology. They also have a "tech company" division that sells web services, but that's only 13% of their revenue.
Similarly, Walmart is a retail sales and warehousing company (as well as product distribution now, but most recognizable by their brick-and-mortar stores) that uses technology in every aspect of buying product, shipping it to stores, deciding where and how much of it to stock, how to price it, and when to restock it, just like Amazon. There was a time when Sam Walton walked around, squinted at the shelves, and made notes about what and how much to order on a clipboard, but that's long past - it's all tech now. Automotive manufacturers are other very big tech companies, they build and sell automobiles but every step of that design, fabricate, market, and sell process uses technology.
Apple, at least, sells technology hardware. Microsoft also sells technology hardware and software, but they did so before FAANG were big so they're not in the list either.
The list is arbitrary. Software and technology is eating - has eaten - the world, and management of it is critical to every business.
A critical question is, how and how much do they pay their developers (and other tech folks)? I believe Netflix compensates their tech workers more richly than the average media company. I am certain of it if you take an average over the past 10 years.
The other argument is, when we say tech, we really mean qualitative innovation. Did Netflix disrupt an existing 800 pound gorilla using technology? The answer is yes, they killed Blockbuster by embracing tech.
Before their original content I seriously considered shorting their stock. Their business model without original content was structurally flawed. They effectively had a maximum profit they could earn. Literally anything they did to increase their profit would be taken by the content producers they were forced to buy the licenses from.
Creating original content was a necessary step to survive just like pivoting from their original DVD mailing business to streaming. They're still a technology company though.
Jim Cramer announced this on Friday.
RxJava was opensourced by Netflix, the defacto standard for reactive programming in Java. They also open sourced a lot of their work for working with avif files, which has managed to find its way into Chrome, among other projects.
Among other things.
Otherwise it would be a slur.
I'd argue keeping basic city services and roads operational is more important than a gaming platform, but city and state governments in the US often seem to not agree and are more than happy to save money skimping on maintenance and resilience.
People are treating the Texas ice storm like it was some comeuppance event but the truth of the matter is that Texas rarely experiences this kind of winter weather and that’s the reason why it didn’t handle the big freeze the same way as a midwestern or northeastern state would.
The reason why cities like DFW do not maintain a fleet of snowplows isn't because of anti-government propaganda, but because it is completely uneconomical to do so. Even during the big freeze there wasn’t a foot of snow on the ground. The major roadways were salted, though, and you could drive on them.
Snow and icing events tend to only happen for a day or two each year in my part of Texas. And it is often the case that when temperatures do dip below zero there isn’t preceding rain or snow. But when that does happen, most people just ride it out for a day or two and wait for things to melt.
In the 2021 storm, hundreds of people died. Comeuppance is not the right word. I don't know the right one, but I believe it's closer to willful blindness or predatory delay.
Perhaps it would be possible to roll back everything in just a few minutes but some users might lose a lot of work. If such special cases are to be resolved, it may take a lot of time. You just cannot really infer quality of architecture by downtime alone.
Roblox has to start taking infrastructure a lot more seriously than they have so far.
Was "only" two days, but Fortnite just shutdown for 2 days as a part of an end-of-season event back in 2019 https://www.polygon.com/fortnite/2019/10/13/20911691/fortnit...
For now I prefer Zookeeper because at-least I have experienced most of it's failure modes. i.e they are probably all prone to blowing up but I have lost my eyebrows enough times in ZK explosions that atleast I know what I'm up against.
Consul isn't widely used enough for me to have the same confidence for now, same goes for Vault and Nomad tbh. I really like the design of Vault and it's dynamic secrets system but I am probably just going to implement something similar on k8s secrets so that I don't have to carry around something that might spontaneously combust on me.
Consul is insanely easy to fix if you keep backups. It is as simple as deploying a brand new cluster and restoring a known good snapshot.
Sadly, a surprising amount of practitioners don't keep backups.
But yeah, you should go with the one you know best, especially at any sort of scale.
Vault is the most popular secrets manager out there, and has a lot of advantages over something like Kubernetes secrets ( which aren't even encrypted). Nomad is a bit obscure but IMHO it seems to be gaining momentum.
Consul on the other hand isn't used widely outside of some startups and a few OSS systems like Grafana Cortex/Loki.
When it comes to which one has the most production hours under demanding workloads I think ZK comes out miles ahead of everything, then etcd (because of k8s) then in a very distant third is Consul.
etcd doesn't compare directly to ZK and Consul, because it's just a distributed KV store. You can build service discovery on top of it with custom tooling, but doesn't come out of the box. Consul does that + service discovery + service mesh + (since last week ) API gateway. On one hand, it's lots of stuff packed together, on the other hand it's useful. I started with it for SD, but when KV was also needed it was easy to add.
I'm not too familiar with ZK, but isn't it also mostly a distributed KV store? Does it have things like Consul's health checks, prepared queries, DNS interface, etc?
Consul is more widely used than that, like Cloudflare and Criteo at huge scale.
In every places I worked in the last 5y we did not even considered ZK because of the setup and all the components it requires.
In most other cases it's just people not turning on auto-purge and having disks fill up with snapshots.
So it varies depending on your teams experience with ZK and/or if you are using some sort of community managed deployment primitive that has sensible defaults baked in.
If you only need service discovery you can probably get away with whatever you platform provides (EC2 API, k8s API, hell DNS works, etc), similarly if you only need slow master election there are alternatives there too (DynamoDB w/conditional writes, k8s configmap).
However for what these systems do best, i.e very high performance distributed locking and consistent metadata with fast reads w/watches and relative fast writes there is no replacement. You either need one of ZK/etcd/consul/something else based on raft/paxos/zab.
EDIT: I realize now you might be talking about Nomad/Vault also.
So in this case I think they are running their own DCs on bare metal and they opted for Nomad over k8s. Vault is somewhat special because it has capabilities no other secrets system does out of the box, specifically it has integrations to create short-lived credentials on the fly for clients - so called dynamic secrets have many advantages and is why services like AWS STS are so popular (the magic sauce behind AssumeRole).
Nomad is a great orchestrator, with lots of integrations ( e.g. it can just run JARs or firecracker microVMs). IMHO features wise it's better than AWS ECS ( the only cloud orchestrator bar Kubernetes I've used, can't talk about the others), and gives Kubernetes a run for its money on many fronts (native templating, more flexible networking, no YAML, not restricted to containers, etc.).
I wrote about it some time ago, you can take a look if interested:
That's definitely a problem you only get at scale.
Furthermore, the services were down for long enough that the asset caches went fully cold, so spinning back up to 100% capacity would put far more load than usual on the servers meaning recovery needed to be a very slow incremental rollout to warm them.
Messy situation all around, but if you want the real root cause pay attention to the consul change log over the next few weeks I’d say.
> We didn’t want to choose any technology that requires the company to drive deep expertise, almost to the point where you have to be a code contributor back into the project to get what you want. Nomad is just very easy to adopt.
Better be damn sure you have your 24/7 vendor support contracts in order if and when shit does hit the fan.
That's a beautifully concise quote which neatly summarizes what contemporary IT values.
Wrong abstractions and too much abstraction can definitely both also be very bad.
You don't really need to understand the citric acid cycle just to operate on someone. Yet we keep educating surgeons. Is that a bad thing?
Their mistake here is thinking that because you can understand the code of a higher level service that somehow you don't need deep knowledge of it's dependencies.
That is a deeply naive view and they paid the price.
Vault is secrets and management. Put secret strings in, get secret strings out (if permitted). Most apps need secrets of some type, and vault is normally discovered via consul.
Nomad is an application/workload scheduler. You tell it what you need to run and what their memory/cpu requirements are and it finds a space in your physical infrastructure to run that. The apps it runs normally needs secrets from vault and communicate with services discovered via consul.
They all are well integrated and build off of each other, kind of like layers an onion where your app services are the outer layer of the onion. Consul failing this badly is like the core of the onion going rotten. There’s not much saving it at that point, you need to grow a new onion from the inside out, but that takes time.
How does this differ from DNS?
Where I think companies are going about this wrong sometimes is thinking having their networks be software-defined and handled by some all-in-one product suite means they no longer need network engineers and can they just rely entirely on application developers who specialize in general programming and fall back to vendor support contracts if anything gets too confusing.
Admittedly, there's some self interest speaking there, because I work as a consulting engineer for one of these vendors where we go way beyond "support" to embed full time in external product teams with a dependency on our products (though we also offer basic support for companies that think they can get away with it).
But to my mind, no software suite can let you get away with not needing any kind of IT ops at all. Smaller companies may assess that the risk is worth it to focus solely on product, but as far as I can tell, Roblox is not a small company (or shouldn't be, given the traffic scale of their platform).
Even if consul provides some sort of liveness (due to latency and concurrency it means little), I can't think of a usable case where the clients won't have retries and ability to maintain multiple open sockets, etc.
For those unaware, the feature film VFX/Animation industry has been global and performing large scale technologically ambitious projects requiring multi-company tech collaboration for decades. And not just with code, with assets too: visual assets of all kinds, audio in every possible form, and in some cases legal documents with new approaches to industry issues. Plus these media studios tend to have proprietary workflows, so there is very sophisticated file formats to contain all these information in agnostic manners. All this deep collaboration across creative technical organizations that do not trust one another has developed scalable solutions which the web and formal software development is completely unaware.
For example, a film compositor I know that is now dead and was revolutionary, called Shake: it pioneered both off loading heavy compute tasks to the GPU (not just graphics, but what is called scientific computing now) and it hid the GCC Compiler inside itself and used a macro-transformed version of C as it's "scripting language" that was actually hot-loaded C++ dlls compiled on the fly. It was also the first "node based programming environment" with graphical nodes the end-users connect with splines to define the I/O between the nodes.
If you are seriously interested, find someone working in the industry today and ask them. That industry has been changing a lot. Since I left the VFX/Animation fields have been moving towards more framework-like production environments, similar to how the web uses frameworks. The issue with these frameworks is they define the tasks to be performed, and those tasks are simplistic and rigid - meaning the actual production work has eliminated in as many places as possible the requirement for an art degree. The VFX/Animation industries are driving production towards something approaching more and more the work of being a burger flipper. The processes are being standardized and reduced in complexity so the studios can hire non-artists, they can hire anyone and work them like an automobile assembly line. Twenty years ago this was starting, and that is when I left.
Consul: It's used to help manage cloud networking so that your application doesn't have to worry about IP addresses, datacenter locations, punching through firewalls, etc. Think of the situation where you have a zillion microservices talking to each other on different machines- it makes it easier for them to find each other. It also includes a distributed key-value store
Vault: If you've used a password manager in your browser, imagine that but distributed and on steroids. You can use it to share credentials with groups. It also has some APIs to help with encryption/decryption and includes a key-value store
Nomad: Roughly similar to Kubernetes (although with good support for non-containerized software). It's used to orchestrate software. For example when you have a bunch of programs you want to run 24/7 , you can specify what machines they should run on (for example at the datacenter/region level), what resources to give the programs, how to handle hardware failures, etc.
https://www.consul.io/ https://www.vaultproject.io/ https://www.nomadproject.io/
Not super familiar with nomad but it seems like that would not necessarily follow. For example if etcd goes fully dark in kubernetes cluster things will mostly continue running for a while unless something also crashes the servers
Also, as long running services crash or reboot naturally they need to be rescheduled or else your cluster slowly dies, and as your cluster size increases the mtbf decreases and the need to reschedule workloads continually increases.
Vault does support other backends, like postgres, etcd, zookeeper, and others. Though if Consul is the backend, Vault is also registered as a service within the Consul mesh.
$ curl -i -L https://blog.roblox.com/2021/10/update-on-our-outage/
HTTP/1.1 301 Moved Permanently
Content-Type: text/html; charset=UTF-8
X-Powered-By: WP Engine
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Cache-Control: must-revalidate, max-age=7108
Expires: Mon, 01 Nov 2021 05:39:59 GMT
Date: Mon, 01 Nov 2021 03:41:31 GMT
> The page isn’t redirecting properly
> An error occurred during a connection to blog.roblox.com.
Most likely your edge node and your client side have cached something not ready yet
Excellent. That should be a very interesting read.
> In addition, we will implement a policy to make our creator community economically whole as a result of this outage.
Good to hear.
That must have been quite a year in the White household!
Mad props for this. Who else would do such a thing for an outage? I wonder how they will do this of course. The devil will be in the details :)
And the postmortem is basically like "yep can't really blame anyone for not knowing this weird ass thing or unfound bug".
Because the business had been around for a while, there was two separate ways to deploy and manage apps: old and new.
We reconfigured how code was deployed within and between racks for increase resistance to racks or even whole colos dying. This change was made in the new deployment system. And all was well for most of a year. Software and services gradually migrated from old to new as people had the time and inclination.
The new deployment also packed services more tightly, so most computers kept copies of most binaries on them.
Then we crossed one of those tipping points. On a code update, several thousand servers across the world demanded a small set of binaries from a distributed data node via the newer deployment method. Which would be fine, except we had degraded service from a couple TOR routers due to a separate bug. And so the distributed data nodes hosting this particular set of binaries were less "distributed" and more "just one poor about to be overloaded machine and network."
So a few thousand machines demanded a couple hundred megs nearly simultaneously. And when they didn't get it in a timely fashion, all triggered a rollback. Simultaneously. Which demanded another couple hundred mb of code. Simultaneously. And then the freakouts started.
And because other deploys were in flight, there was more than one service deploy ongoing. And they all started fighting each other for bandwidth.
1) There are a lot of dimensions
I think it is fairly common to focus on a throughput-related dimension like requests per second and test that very thoroughly, while not paying enough attention to other aspects. We have services where, if we were told requests per second were going to go up 100x tomorrow, I wouldn't be too concerned, but if we were told request size were going up 2x, I'd be freaking out.
Even when you think you've identified all of the dimensions, there are usually ones you missed and/or they interact in weird and unexpected ways.
2) Performance can appear to be linear when it really isn't
So many outages I've seen have resulted from everything being fine until some threshold is hit, at which point it doesn't start to slowly degrade, but instead immediately explodes. Often times due to feedback loops (GC activity in garbage-collected languages often can behave this way) or because some cache overflowed (data that is queried at a high rate no longer fitting in memory on a DBMS comes to mind).
And the remediation procedure is quite simple too - scale down. If the fleet can't handle it, increase your throttling. If even with increased throttling you can't handle the load and everybody's timing out (but you still can't scale out), you start spinning up new load balancers, and applying weights in DNS where at least some percentage of your callers are getting through while others see full failure.
The key point is in anything related to such an outage, even if it takes a long time to recover, your customers shouldn't see a full outage.
So it was either an entirely different class of problems or Roblex has a huge ops skills gap. I'm betting the former.
edit: never mind, maybe it's the latter after all: https://twitter.com/NIDeveloper/status/1454773313792880640
Anybody optimizing for competence is exchanging time for food and shelter and just gets to be reminded of the much harder game they are playing that produces much less optimal results of money.
$50B is hella hard tho. It's hard work to just get to $10M (w/o outside money) and even with outside money 100M is still very hard.
Didn't realize I struck a cord here
I’m literally making fun of the people that would try to draw this distinction and … you showed up?
Sometimes I forget that the market reality changes faster than cultural conditioning
I'm unable to open the link and seems Wayback Machine cannot archive it either.
The Friday and Saturday are critical days in getting revenue from any Halloween specific events we may do. Such as a Halloween update or discount.
I'm happy to see Roblox is refunding adverts, especially after seeing how much support this has.
Which piece was overwhelmed and in which piece of backend did the subtle bug exist?
A deeper post-mortem will be nice. An actual description of the problem seems like SRE table-stakes for outages.
Now, a header image that takes up my entire 15" laptop screen before I can even see the title of the blog post. Impressive
That seems to be a great thing if you can sit at the edge and scale the number of connections coming in. I just wish our system could do that.
So most likely their most senior SREs/infrastructure folk were the ones sweating bullets.
> A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load.
Translation Something important went down, we couldn't trace the bug.
>This was not due to any peak in external traffic or any particular experience.
Translation We're not blaming Chipotle for this
> most services at Roblox were unable to effectively communicate and deploy.
Translation We couldn't throw more compute at the problem to manage the traffic or to put up a failover copy of anything.
>To the best of our knowledge, there has been no loss of player persistence data
Translation We literally restored our entire infrastructure from scratch and probably our last set of backups. You might have lost some stuff but we don't know what you lost since we couldn't get the last copy of our databases.
What a wild extrapolation. The line in the blog post is not them writing to hint at what might have happened, it's so that the customers reading don't think this means they lost any of their data/items/etc. The Roblox marketplace is vast and high-volume; if anyone lost <any> data from them restoring a backup, we'd already know about it.
> Rather the failure was caused by the growth in the number of servers in our datacenters.
Their config server ran out of sockets and screwed up everything. Or something like that. Too many servers = nothing works has many ways to unfold and isn't obvious the first time.
My best advise about scaling, I once read (might have been right here on HN) is many devs know their primary bottleneck. Be it the "master-db", 3rd-party-api, login-service etc. But few devs knows the "2nd slowest/bottleneck" in their service stack. Not saying this was the case here, but always good to think (at least has a mental exercise) about what comes after your primary-bottleneck.
If it's a long standing bottleneck, one day it gets fixed (sw update, better arch) and suddenly everything goes down since now you hitting a new bottleneck you never even thought about scaling.
Yet they still take a 30% cut on all transactions that go to content creators / developers
Saying this as someone who doesn’t know much about Roblox but does give 15-30% to Apple and Google. Google definitely doesn’t respect their developers. Apple kind of does.
Though I was mostly talking about customer service part and development relations in terms of that.
Google regularly bans developers for absurd reasons or no reasons at all, doesn't let them talk to any humans until the developer causes enough twitter storm and gets some tech sites to write about it to get some attention from Google. Apple at least lets the dev talk to human beings.
> Just wait for these zoomers to reach voting age.
Ah yes, 50.
Retail stores maintain a physical presence and have actual overhead, as do online retailers that have to maintain warehouses, subsidize free shipping, etc. What is the overhead on an app submission, or an app purchase? Charging 30% overhead for $0.0012 cents of bandwidth, an algorithmic review of your app that only might be supplemented by an actual human review, and a literally free file copy operation comes nowhere close to that. This is a lot closer to payment processor fee territory in terms of the actual work being done by the provider.
You would think they would do more for their app developers with such a hefty fee, but they don't, so they need to be regulated. These fees are just theft.
In the case of Roblox, you can argue "but what about the platform?" but 5% is still a metric-crap-ton of revenue and is more than enough to keep their platform afloat (especially if they don't have to give 30% upstream to Apple/Google). We've just become accustomed to ridiculous % cuts because monopolies have normalized this practice. In the early 00s, the idea of an app store even taking a cut was viewed as laughable. Then it became normal.
Afloat? Maybe. Seems doubtful. Anyway, the justification for app stores charging developers is always providing the Platform. Android is a free platform. You can buy an Android phone from a manufacturer of your choice without Google seeing a cent of that money. Their way of making money is to use rents on their application store.
What I'm getting at is that they produce a product. A valuable product, given that many people enjoy it. Under standard American assumptions about free markets, this means they're entitled to a reasonable profit, not just enough to "keep their platform afloat". Profit is not theft.
Think about a company that sells you a printer at-cost in the hope that it will pay for itself in profits when you buy ink. There's usually third-party ink that will work in the printer, but the business model of the company is that there will be enough people who buy their ink to make cheap printers profitable. There's a, I don't know, 150% markup on the ink.
In other words, they have an overall profitable business, where one part of it is a cost leader and another is where they extract rents. Nothing particularly surprising about this. The app store model is exactly the same: most parts of the platform are free (despite very real development costs), and store developers pay a fee of 20-30% which keeps things running. In fact, if the developers reduced the fee but then charged users to use the platform, the overall profits of developers would probably be reduced, as fewer users would use the platform.
I'm skeptical, in the end, that the profits earned by companies like Roblox are terribly out of keeping with those owned by retail stores or computer manufacturers. You cannot separate out the parts of the business that are pure profit centers from those that are cost leaders. The fact that the profit on the profit centers seems exorbitant has to be taken in context.
There is nothing Apple (or Google, or Robolox) does to justify more than a 5% cut. 5% is generous. In many cases they don't even take the time to have a human review every submission. You can say they built the platform, but aren't we already paying for that with our 5%?
And to your point, yes, when Zoomers reach 50, we'll see awesome things like companies losing the rights to their IP if they submit 3 bogus DMCA claims, extreme regulation of things like content-ID that threaten the existence of fair use in practice, outright banning of all lobbying and corporate political donations, things like this, etc etc.
15 or 20% is probably a good middle ground