Hacker News new | past | comments | ask | show | jobs | submit login
An Update on Our Outage (roblox.com)
358 points by maxilevi 6 months ago | hide | past | favorite | 226 comments



My heart goes out to the people who likely had to work crazy hours to fix this, but it really is wild that it was down for so long. What was the last service of this size that was down for 4 days? That is a failure in architecture that goes way beyond whatever the specific cause was here. That post-mortem is going to be a doozy.


Not necessarily directed at roblox but honestly I’m surprised this doesn’t happen more often given how many teams I see run software they don't understand or don’t even have access to its source code.

Edit: but yeah must’ve been tough 72+hrs i hope their version of reliability team can use this to bash the support they need out of the management and not get scapegoated instead


So true. And also at the systems level. In most places I've seen, the #1 priority is hitting arbitrary executive feature/date goals, not maintaining robust systems. At some point, the shit will hit the fan, causing a "Why didn't you do perfectly the thing that wasn't a real priority?!?" reaction and a temporary lurch toward robustness. Although often the lurch will be less about actual robustness and more toward performative addition of control mechanisms like more layers of review.

I remember one large company I consulted for in the mid-aughts. Before I got there, they had an outage so severe it was on CNN and caused a short but notable dip in their stock price. By the time I got there, the ops people were absolutely dominant. All changes had to go through their review board, and woe be unto any project that they raised an eyebrow at.

Of course, the real problem was that developers were scheduled out 18 months in advance to work on a list of projects they hadn't been involved in estimating and where often they hadn't seen the code before being swapped onto the project. Dates were absolutely not allowed to slip, so it came out in frantic overwork, which meant a lot of bad/confusing code. Meaning more likelihood of failure and no time to think about systemic issues and impacts.


I worked at a big fintech company about to go public in a SPAC worth $4 billion. They were a unicorn when I was there. Literally everything was in a GCP MySQL Database......and they refused to pay for a hot spare. They had backups but nothing for redundancy. We had downtime almost every week.


Guessing its SoFi.


SoFi's devops function is extremely robust, if overcomplicated, and they've got redundancy for all their application servers and production databases. They're also an AWS shop, not GCP.


Lol, I can just hear the politicians now.

"Tough on bugs, and tough on the causes of bugs!"

As they add even more time-sapping process to every release.


At my company there was a service outage that lasted 2-3 weeks for a specific feature we have. This was caused by everyone quitting and no one having any experience with this service. The rest of the application remained working so it wasn't so noticable to the outside world. But internally and for customers it was massive since it was the billing system that went down.

There was another incident that took down everything that was caused by a Spring cloud config being changed that no one in the company had access to.


It’s been known for a while that human communication is the real impediment to technical development.

Companies I’ve worked at that sucked had awful internal communication. It was all very friendly, but it was all platitudes and euphemisms, dumpster fire technology implementation.

Phone and web apps are basically librarian work these days. If a businesses tech stack is having issues it’s human communication that’s the real problem.


> Companies I’ve worked at that sucked had awful internal communication. It was all very friendly, but it was all platitudes and euphemisms, dumpster fire technology implementation.

Honestly, I think part of the problem is people going around talking so much about soft skills and how people don't like to be treated harshly that they've forgotten you need hard skills and nearly every profession where they need real leadership you get told where you screwed up and how to fix it. Literally, at the company I work for, people screw up and all you hear about is positive things


I think it happens all the time. There just aren't that many services, relatively speaking, that you'd hear about when this happens. The more-popular services (that you'd hear about) probably already have a bias toward having their stuff better in order, since it's often also really just a matter of making the right investments.


> really just a matter of making the right investments

Right. But the person deciding on those investments is a product owner who's looking to release new features because that's what management wants. At least that's my experience more often than not.

My guess is that the bias might be the opposite. Maybe these companies become more popular because they are reliable and have the mindset to focus on that.


By the time a company gets large enough to be a household name, they have usually been burned a few times and have someone technical who can push back against short-termism from POs.


Either you listen to and trust your people or your don't.


This is a benefit of working remote: I work at Automattic and have worked on teams where the members are literally around the world. When shit hits the fan, people are able to hand-off this kind of work to people who are fresh after a good night’s sleep and no one needs (or should need to) work crazy hours.


Automattic always sounds awesome to work for - I know a couple of people who work there as well.

In your experience, when something like this happens, what can leaders do to help in the moment? I'm an engineering manager, and watching people getting over-stressed is much easier to do when you're physically in the office. Working remotely, it's easy to slip in the trap of either assuming no news is good news, or the opposite approach of asking "are you OK?" every 2 minutes. The only solution I've found is to all sit on a call, and pay attention to people's tone of voice, but I don't know whether that actually contributes to people's stress levels.


One of the greatest values of good technical manager is to run interference/shield for your team. Establish a chain of communication, or if there isn't one in place BE the active reporter, to ensure your hands-on-keyboard people aren't interrupted every 5 minutes (and this is a realistic timeline) for updates from various managers, execs, C-levels, and so on.

Otherwise, it's mostly things you can do before and after an incident like this.

Depending on your role and place in hierarchy, yes DO sit on calls, even if you are quiet; to understand what's going on, to help prioritize if needed, to see if your experienced "spidey sense" catches some risk others may not even though they're closer to the problem (or precisely because of it), encourage, guide, and focus; but all of that without being overbearing or making them feel watched. You need to feel like an ally, part of the team, a friend as opposed to political officer. This again is best achieved ahead of time by demonstrating your value to the team by being supportive and effective always.


The honest truth - if you're an engineering manager and you're no longer anywhere near the code then just at the heat of the moment there really isn't much you can do, aside from what you've been doing: listening on the call and doing a minimal check-in.

The main thing you can do is after all is over to make people feel awesome, give them recognition in all the ways possible: bonus, extra vacation days to compensate for the crunch time, etc

But if you still have your technical chops - jump in the trenches and debug!


Best thing managers can do in these situations is deal with internal politics - be the gateway for information going higher up the chair. The last thing the actual engineering team needs is direct calls from C-level execs.


There's avid communication and coordination is all done via text (Slack/IRC). As long as you're following along, I think you can spot it pretty well. People are also more used to managing their own stress and time. They will let the people know if they need to step out for a few minutes so the workload could be shifted/paused to give them time to step away.


This is a benefit of a global company, not working remote. You could do all that stuff in multiple offices around the world.


In a global but not-remote company the team responsible for the particular service that failed is probably still concentrated in one office. Bringing in people from other offices who aren't familiar with the problem service probably isn't that helpful.


This has nothing to do with remote.

Remote teams are usually concentrated in the same timezone too. In-office global companies make a point of having at least a few teams owning a service for this very reason.


You can have knowledge silos in either kind of company. We didn't magically get better at spreading work around or writing documentation just because we went remote.


From my experience there are a few engineers that are usually key to fixing such issues (not always the same people, depending on the failure). Handing-off work is more for day to day work like handeling a support ticket. So the key engineers need to work crazy hours when things like that happen. But that's OK, ideally these things are rare and people need to work crazy hours only in those rare situations. I'm super protective about my evening/weekend time but I wouldn't have any problem working the entire weekend like crazy when something like this happens. If it happens too often I will consider finding a better place to work.


At Automattic, we handle this by offering a 3-month sabbatical every 5 years and a “minimum” 20 day a year vacation policy. The idea being that no one person should hold knowledge to key systems for too much long.


I wonder if Engineers from Roblox are now worth a little more just because they have this experience.

I also think it is time that people should take a look again at Chaos Engineering [1] from Netflix. It is sometimes ironic that the best technology often comes from companies that aren't a technology company at all.

[1] https://principlesofchaos.org


Are you implying Netflix is not a tech company?


I definitely see Netflix as a Media Company, much like Disney. Disney and Pixar have great tech too, but they are not tech company in any shape or form.

How Netflix got lumped into "FAANG" aka Big Tech is still a mystery to me.


Now probably, but at the time the FAANG acrnonym was conceived, Netflix Originals and other productions ran internally was a drop in the ocean compared to their licensed content, similar to how Apple has controlled music (before streaming), so it made sense.


It's hard to define any company as a 'tech company' if you're willing to define Netflix as a Media Company.

In some sense, I agree - Netflix, Disney, and Pixar are media companies that use technology to distribute, to recommend, and to produce their content.

But in the same sense, Google is an advertising company that uses technology to sell and to show their ads (and they have a few loss-leading tech ventures like search, mail, a phone OS, and a browser - all of which they give away to more effectively show those ads). Facebook is another advertising company, they happen to show and profile their ads using their social network tech. Amazon is a retail sales, warehousing, and product distribution company that happens to use a lot of technology. They also have a "tech company" division that sells web services, but that's only 13% of their revenue.

Similarly, Walmart is a retail sales and warehousing company (as well as product distribution now, but most recognizable by their brick-and-mortar stores) that uses technology in every aspect of buying product, shipping it to stores, deciding where and how much of it to stock, how to price it, and when to restock it, just like Amazon. There was a time when Sam Walton walked around, squinted at the shelves, and made notes about what and how much to order on a clipboard, but that's long past - it's all tech now. Automotive manufacturers are other very big tech companies, they build and sell automobiles but every step of that design, fabricate, market, and sell process uses technology.

Apple, at least, sells technology hardware. Microsoft also sells technology hardware and software, but they did so before FAANG were big so they're not in the list either.

The list is arbitrary. Software and technology is eating - has eaten - the world, and management of it is critical to every business.


Because FAANG was coined as a term for super high growth stocks, not for anything to do with technology.


Two ways I would argue for Netflix being a tech company.

A critical question is, how and how much do they pay their developers (and other tech folks)? I believe Netflix compensates their tech workers more richly than the average media company. I am certain of it if you take an average over the past 10 years.

The other argument is, when we say tech, we really mean qualitative innovation. Did Netflix disrupt an existing 800 pound gorilla using technology? The answer is yes, they killed Blockbuster by embracing tech.


NFLX was founded in 1997. They didn't have original content until 2013.

Before their original content I seriously considered shorting their stock. Their business model without original content was structurally flawed. They effectively had a maximum profit they could earn. Literally anything they did to increase their profit would be taken by the content producers they were forced to buy the licenses from.

Creating original content was a necessary step to survive just like pivoting from their original DVD mailing business to streaming. They're still a technology company though.


By that logic Facebook certainly wouldn't be a tech company either.


I wonder how long it will take until people adopt their new name, and MAANG


The new offical acronym on the street is MAMAA by the way. Microsoft, Apple, Meta, Amazon and Alphabet.

Jim Cramer announced this on Friday.


might as well go for MAMAG at that point, swapping Netflix out and Microsoft in. Or MAMANG. Or even AMMANG. Possibilities are endless. :)


MAGMA. You heard it here first


Why not? Didn't Facebook create tools like GraphQL and React?


Netflix has created tools and open source libraries too.

RxJava was opensourced by Netflix, the defacto standard for reactive programming in Java. They also open sourced a lot of their work for working with avif files, which has managed to find its way into Chrome, among other projects.

Among other things.


they do occasionally use BSD licensed software so they can choose what/when/if to open source.


Facebook dont consider themselves as tech companies. But there aren't any other category that are clearly defined in which they fit in. Social Media isn't one.


> How Netflix got lumped into

Otherwise it would be a slur.


Since the idea of a "tech company" comes up so often, I'll just leave this here:

https://news.ycombinator.com/item?id=27693634


And I wonder if their tech management is worth less for this clusterf** to happen during their oversight?


This is how I felt reading his comment. Were Volkswagen emission engineers and the captain of the Exxon Valdez worth more after they screwed up?


Not a software outage, but the Texas ice storm of February resulted in me not receiving mail for two weeks, not being able to use the streets for at least a week (they never plowed and it just eventually melted), rolling blackout where I had power for about four hours a day, and no Internet service at all for the first two days. No water for most of the first day.

I'd argue keeping basic city services and roads operational is more important than a gaming platform, but city and state governments in the US often seem to not agree and are more than happy to save money skimping on maintenance and resilience.


That behavior is a reflection of their citizenry who have been convinced by propaganda that government can never be effective at anything and therefore should not be funded properly.


>”citizenry who have been convinced by propaganda that government can never be effective at anything and therefore should not be funded properly.”

People are treating the Texas ice storm like it was some comeuppance event but the truth of the matter is that Texas rarely experiences this kind of winter weather and that’s the reason why it didn’t handle the big freeze the same way as a midwestern or northeastern state would.

The reason why cities like DFW do not maintain a fleet of snowplows isn't because of anti-government propaganda, but because it is completely uneconomical to do so. Even during the big freeze there wasn’t a foot of snow on the ground. The major roadways were salted, though, and you could drive on them.

Snow and icing events tend to only happen for a day or two each year in my part of Texas. And it is often the case that when temperatures do dip below zero there isn’t preceding rain or snow. But when that does happen, most people just ride it out for a day or two and wait for things to melt.


NERC created a report after the Texas 2011 ice storm power outage with recommendations for ERCOT generators to improve their weatherization standards [0]. In that report they referenced not only the 2011 incident but previous incidents where the grid generators failed to adequately prepare for potential problems due to weather related issues.

In the 2021 storm, hundreds of people died. Comeuppance is not the right word. I don't know the right one, but I believe it's closer to willful blindness or predatory delay.

[0] https://www.ferc.gov/sites/default/files/2020-04/08-16-11-re...



So. Do you think Texas should winterize or not?


Skimping on resilience has been done for decades as part of "cost cutting," in large part because even when we had staffing for resilience, we found that people would collect their check and not actually do the maintenance.


In the game industry, working yourself to death for 4 days, that's called a hot fix ;)


And that is after you've been working 6 days a week for the last 9 months. As an EA spouse survivor - fuck that industry.


“Spouse survivor”, I like that! I worked on the tech/distribution/data reporting side of the Covid vaccine and consider my wife, and all the sacrifices she made and extra things she handled up on while I was on the keyboard and phone, a critical component of my team’s success.


"EA Spouse" is the name is a specific period during the early 90's when Electronic Arts burned out an entire generation of game developers.


Here is the source of the term:

https://ea-spouse.livejournal.com/274.html


People working on the cloud infra are not the people working on games.


My favourite was the Skype outage in 2007 where a peer to peer algorithm didn’t scale - they had to work this out, and change the algorithm to fix it. It took ages!

https://www.wired.com/2007/08/update-skype-ou/


The last outage of this kind I can even remember is the summer PSN was down for like, a month?


Although that PSN downtime probably fails into a different category as it was a response to being hacked. Sony intentionally took the service down and kept it down to investigate and fix their security holes. Basically the equivalent of temporarily closing a brick and mortar store after a robbery compared to being involuntarily closed because no one could find the right set of keys to unlock the store's front door.


looks familiar to me also. was roblox also hacked?


From the blog post it sounds like no. They say a service got overloaded due to an increase in the number of datacenters and triggered a bug.


We have hospitals being knocked out for weeks after ransomware attacks. If we complain about roblox I think we should reflect on our priorities.


Why not complain about both?


You can, wasn't the point I tried to make. I tried to point out the arbitrary expectations here. Depending on the fault such downtime can last a few hours to a few weeks. I don't think you can make conclusions on architecture in this case.

Perhaps it would be possible to roll back everything in just a few minutes but some users might lose a lot of work. If such special cases are to be resolved, it may take a lot of time. You just cannot really infer quality of architecture by downtime alone.


Not of this size or scale but ransomeware attacks typically take days to resolve. The pipeline attack and the large multinational vet company come to mind.


It won't be a doozy: it will be a few paragraphs of general statements without saying anything specific, just like this communication (the first of consequence in days, another big failure on their part).

Roblox has to start taking infrastructure a lot more seriously than they have so far.


> What was the last service of this size that was down for 4 days

Was "only" two days, but Fortnite just shutdown for 2 days as a part of an end-of-season event back in 2019 https://www.polygon.com/fortnite/2019/10/13/20911691/fortnit...

https://www.polygon.com/fortnite/2019/10/13/20911691/fortnit...


The immediate technical root cause and the true root cause (often down to poor decision making among the senior technical staff and management) are two different things, and I suspect the public and internal post-mortems both will focus on the former rather than the latter, sadly.


I am sure they got a lot of stock.


I mean I'd gladly pull an all-nighter if I were a millionaire thanks to the thing I'm fixing. I think. Then I'd quit and do something leisurely.


Over what time frame would you tolerate poor working conditions to make a million bucks? Four year equity vesting? A million dollars in 2025 too, not a million dollars in 2010.


Just one million? About 6 months, as I currently earn over $2MM per year from my options alone at the company I work for thanks to their IPO. Once those dry up it's going to be a hard sell for me to stay. Maybe I'll coast for a year on cash + RSU salary (which is much less than salary plus pre-IPO options) to top off my $5MM investment portfolio and ensure enough of it is invested to produce an income, but after that? No thanks.


My guess is Consul shit itself. It works until it doesn't.

For now I prefer Zookeeper because at-least I have experienced most of it's failure modes. i.e they are probably all prone to blowing up but I have lost my eyebrows enough times in ZK explosions that atleast I know what I'm up against.

Consul isn't widely used enough for me to have the same confidence for now, same goes for Vault and Nomad tbh. I really like the design of Vault and it's dynamic secrets system but I am probably just going to implement something similar on k8s secrets so that I don't have to carry around something that might spontaneously combust on me.


> For now I prefer Zookeeper because at-least I have experienced most of it's failure modes. i.e they are probably all prone to blowing up but I have lost my eyebrows enough times in ZK explosions that atleast I know what I'm up against.

Consul is insanely easy to fix if you keep backups. It is as simple as deploying a brand new cluster and restoring a known good snapshot.

Sadly, a surprising amount of practitioners don't keep backups.


Or backups failing silently in some manner not being a "known" problem until you know.


I don't know, anecdotally it seems to me that Consul is more popular than Zookeeper.

But yeah, you should go with the one you know best, especially at any sort of scale.

Vault is the most popular secrets manager out there, and has a lot of advantages over something like Kubernetes secrets ( which aren't even encrypted). Nomad is a bit obscure but IMHO it seems to be gaining momentum.


ZK is vastly more popular, it's just much less sexy so people don't write about it. Every production Kafka, Spark, Hadoop, Pulsar, etc cluster is using ZK. Just the Kafka and Hadoop clusters alone probably allows it to rival etcd in footprint.

Consul on the other hand isn't used widely outside of some startups and a few OSS systems like Grafana Cortex/Loki.

When it comes to which one has the most production hours under demanding workloads I think ZK comes out miles ahead of everything, then etcd (because of k8s) then in a very distant third is Consul.


ZK isn't sexy, we agree. IIRC Kafka no longer requires it? And IIRC it was one of the main downsides of Kafka, that it required a whole ZK cluster.

etcd doesn't compare directly to ZK and Consul, because it's just a distributed KV store. You can build service discovery on top of it with custom tooling, but doesn't come out of the box. Consul does that + service discovery + service mesh + (since last week ) API gateway. On one hand, it's lots of stuff packed together, on the other hand it's useful. I started with it for SD, but when KV was also needed it was easy to add.

I'm not too familiar with ZK, but isn't it also mostly a distributed KV store? Does it have things like Consul's health checks, prepared queries, DNS interface, etc?

Consul is more widely used than that, like Cloudflare and Criteo at huge scale.


Kafka still requires it, at least for all production clusters. In a few years the recommendation might be the new internal Raft based consensus though.


Consul is at least as widely used as ZK now seeing as anyone who wants to run a distributed system on Hashi tools needs to use it. It's much better designed, and easier to operate, than ZK. ZK, Solr, and other apache projects are pretty terrible in comparison.


Lot of people moved away from ZK because it used to be a nightmare to setup and maintain, ZK is not vastly more popular than Consul.

In every places I worked in the last 5y we did not even considered ZK because of the setup and all the components it requires.


Just curious, what are the common ZK failure modes?


In high scale situations, somewhat common if you aren't ontop of metrics etc.

In most other cases it's just people not turning on auto-purge and having disks fill up with snapshots.

So it varies depending on your teams experience with ZK and/or if you are using some sort of community managed deployment primitive that has sensible defaults baked in.


This is hilarious. Please add a link to what you implement through k8s secrets, i'd be highly interested in its creation.


you can simply disable parts of functionality if it's truly consul that's responsible


My stance is: why even have any of those? Are you running in a local datacenter?


Generally you need them even when running in a "Cloud" of some description. Namely because they provide distributed locking primitives at a speed that can't be matched by any other mechanism.

If you only need service discovery you can probably get away with whatever you platform provides (EC2 API, k8s API, hell DNS works, etc), similarly if you only need slow master election there are alternatives there too (DynamoDB w/conditional writes, k8s configmap).

However for what these systems do best, i.e very high performance distributed locking and consistent metadata with fast reads w/watches and relative fast writes there is no replacement. You either need one of ZK/etcd/consul/something else based on raft/paxos/zab.

EDIT: I realize now you might be talking about Nomad/Vault also.

So in this case I think they are running their own DCs on bare metal and they opted for Nomad over k8s. Vault is somewhat special because it has capabilities no other secrets system does out of the box, specifically it has integrations to create short-lived credentials on the fly for clients - so called dynamic secrets have many advantages and is why services like AWS STS are so popular (the magic sauce behind AssumeRole).


Having built many distributed applications I would hesitate to use ZK/consul directly, mostly since your cloud provider is already providing most type of primitives directly or indirectly in the services they provide.


But most people use them because they're required by something else, like Vault, or Solr, or some other tool built to require a particular distributed key value store.


There's nothing intrinsically wrong with say ZK, it's very powerful, but my point is that you should avoid these systems if you can. Distributed concensus is not an easy problem, it's very error prone, and if you do this more than you absolutely have to you're doing it wrong.


You are 100% correct. Avoid at all costs but when you need it, well you need it. Usually if you find yourself in that boat the best thing you can do is just understand exactly why you need consensus and ensure scope of data managed in ZK never expands beyond that point.


Even on a public cloud, Vault is great for secrets and much better ( has more integrations and is more widely supported) than the cloud vendor's equivalents ( not to mention lock-in).

Nomad is a great orchestrator, with lots of integrations ( e.g. it can just run JARs or firecracker microVMs). IMHO features wise it's better than AWS ECS ( the only cloud orchestrator bar Kubernetes I've used, can't talk about the others), and gives Kubernetes a run for its money on many fronts (native templating, more flexible networking, no YAML, not restricted to containers, etc.).

I wrote about it some time ago, you can take a look if interested:

https://atodorov.me/2021/02/27/why-you-should-take-a-look-at...


Someone on here posted the other day that someone they knew at Roblox said it was their secret store that became overloaded. Presumably a Hashicorp Vault type service (or something similar.) This update appears to support that claim.

That's definitely a problem you only get at scale.


Not far off. Having chatted with friends from the company, consul failed due to a issue with streaming which was introduced in 1.9 (https://www.hashicorp.com/blog/announcing-hashicorp-consul-1...) which caused it to have massively reduced tps in certain circumstances. No functional consul meant no vault, no consul/vault meant no nomad, no nomad meant no application servers.

Furthermore, the services were down for long enough that the asset caches went fully cold, so spinning back up to 100% capacity would put far more load than usual on the servers meaning recovery needed to be a very slow incremental rollout to warm them.

Messy situation all around, but if you want the real root cause pay attention to the consul change log over the next few weeks I’d say.


Choice quotes from their PR piece: https://www.hashicorp.com/case-studies/roblox

> We didn’t want to choose any technology that requires the company to drive deep expertise, almost to the point where you have to be a code contributor back into the project to get what you want. Nomad is just very easy to adopt.

Better be damn sure you have your 24/7 vendor support contracts in order if and when shit does hit the fan.


> > We didn’t want to choose any technology that requires the company to drive deep expertise,

That's a beautifully concise quote which neatly summarizes what contemporary IT values.


Abstracting things away is literally the entire point of programming.


Not understanding what the abstraction is for or what it is abstracting can be problematic when the abstraction breaks.

Wrong abstractions and too much abstraction can definitely both also be very bad.


You don’t need deep etcd experience to run k8s. Why is that bad?


You've never had to work a 3 day weekend to recover an etcd failure? Guess I'm just lucky


I’m on the vendor side now, so every day is helping customers recover from etcd failures caused by slow storage due to them not listening to documented requirements. ;)


You don't need more than high school science just to build a bridge. Yet we keep educating engineers. Completely unnecessary, as long as nothing breaks.

You don't really need to understand the citric acid cycle just to operate on someone. Yet we keep educating surgeons. Is that a bad thing?


To be fair Nomad failed because it's reliant on Consul. This would be the equivalent of having a k8s outage because etcd is down.

Their mistake here is thinking that because you can understand the code of a higher level service that somehow you don't need deep knowledge of it's dependencies.

That is a deeply naive view and they paid the price.


As someone not familiar with HashiCorp's products, could you give a super high level overview of what consul, vault and nomad are?


Consul is a service mesh. It’s your dynamic service discovery and routing layer. You have systems dynamically allocated in a cluster, they need to reach other systems, you ask consul where they are.

Vault is secrets and management. Put secret strings in, get secret strings out (if permitted). Most apps need secrets of some type, and vault is normally discovered via consul.

Nomad is an application/workload scheduler. You tell it what you need to run and what their memory/cpu requirements are and it finds a space in your physical infrastructure to run that. The apps it runs normally needs secrets from vault and communicate with services discovered via consul.

They all are well integrated and build off of each other, kind of like layers an onion where your app services are the outer layer of the onion. Consul failing this badly is like the core of the onion going rotten. There’s not much saving it at that point, you need to grow a new onion from the inside out, but that takes time.


> Consul is a service mesh. It’s your dynamic service discovery and routing layer.

How does this differ from DNS?


I'm not familiar with Consul specifically, but service meshes I am familiar with give you a lot more than just mapping of service names to dynamic IPs. They bundle things like TLS termination, reverse proxying, network policy enforcement, automatic certificate provisioning and key rotation, firewalls potentially aware of anything from layer 3 to layer 7.

Where I think companies are going about this wrong sometimes is thinking having their networks be software-defined and handled by some all-in-one product suite means they no longer need network engineers and can they just rely entirely on application developers who specialize in general programming and fall back to vendor support contracts if anything gets too confusing.

Admittedly, there's some self interest speaking there, because I work as a consulting engineer for one of these vendors where we go way beyond "support" to embed full time in external product teams with a dependency on our products (though we also offer basic support for companies that think they can get away with it).

But to my mind, no software suite can let you get away with not needing any kind of IT ops at all. Smaller companies may assess that the risk is worth it to focus solely on product, but as far as I can tell, Roblox is not a small company (or shouldn't be, given the traffic scale of their platform).


DNS gets cached all up and down the stack. Your OS, routers, application libraries and probably even your application code might cache it. Consul allows you to have it handle the caching, and be notified when a service goes away so it won’t return that address to your application.


I think service discovery is typically related to e.g., request routing in a way that DNS isn't. DNS could probably replace a lot of the ways service discovery is used today, but it would be a very non-typical setup, which is arguably worse than the custom things people use for it today.


Dynamic-first, better support for failover etc.. It's probably not doing anything that you couldn't build on top of DNS, but it's designed from the ground up for clustered deployments.


the thing can even provide dns... but it has fancy interfaces and an http endpoint. Technically you could do everything with dns just fine.

Even if consul provides some sort of liveness (due to latency and concurrency it means little), I can't think of a usable case where the clients won't have retries and ability to maintain multiple open sockets, etc.


Consul's notion of a service includes port numbers, and while I'm sure you could hack something up using DNS txt records and default port numbers, it's an important distinction because it means multiple instances running on one box can easily be discovered.


Really high level, it doesn't. It's an internal dynamic DNS, essentially.


One of their primary query interfaces actually even is DNS.


This dynamic service terminology sounds very familiar for someone with a few years in the feature film VFX/Animation industries - however, it would not surprise me if Consol/Vault/Nomad were created without a review of similar situations and solutions in other industries.

For those unaware, the feature film VFX/Animation industry has been global and performing large scale technologically ambitious projects requiring multi-company tech collaboration for decades. And not just with code, with assets too: visual assets of all kinds, audio in every possible form, and in some cases legal documents with new approaches to industry issues. Plus these media studios tend to have proprietary workflows, so there is very sophisticated file formats to contain all these information in agnostic manners. All this deep collaboration across creative technical organizations that do not trust one another has developed scalable solutions which the web and formal software development is completely unaware.


My mind is continually blown when I read about the sheer volume of innovation from the animation industry over the last 3 decades. I would love to learn more about these scalable solutions you're talking about though, since I'm not familiar with them. Do you have any pointers?


I've not worked in VFX since '02, but I was both at Rhythm & Hues Studios through multiple VFX Oscars as well as an early 3D graphics researcher during the 80's. I have no idea which tools are still around.

For example, a film compositor I know that is now dead and was revolutionary, called Shake: it pioneered both off loading heavy compute tasks to the GPU (not just graphics, but what is called scientific computing now) and it hid the GCC Compiler inside itself and used a macro-transformed version of C as it's "scripting language" that was actually hot-loaded C++ dlls compiled on the fly. It was also the first "node based programming environment" with graphical nodes the end-users connect with splines to define the I/O between the nodes.

If you are seriously interested, find someone working in the industry today and ask them. That industry has been changing a lot. Since I left the VFX/Animation fields have been moving towards more framework-like production environments, similar to how the web uses frameworks. The issue with these frameworks is they define the tasks to be performed, and those tasks are simplistic and rigid - meaning the actual production work has eliminated in as many places as possible the requirement for an art degree. The VFX/Animation industries are driving production towards something approaching more and more the work of being a burger flipper. The processes are being standardized and reduced in complexity so the studios can hire non-artists, they can hire anyone and work them like an automobile assembly line. Twenty years ago this was starting, and that is when I left.


Thanks for the explanation. I see how that would cause a cascading failure.


I haven't used most of these but here's my possibly-flawed understanding:

Consul: It's used to help manage cloud networking so that your application doesn't have to worry about IP addresses, datacenter locations, punching through firewalls, etc. Think of the situation where you have a zillion microservices talking to each other on different machines- it makes it easier for them to find each other. It also includes a distributed key-value store

Vault: If you've used a password manager in your browser, imagine that but distributed and on steroids. You can use it to share credentials with groups. It also has some APIs to help with encryption/decryption and includes a key-value store

Nomad: Roughly similar to Kubernetes (although with good support for non-containerized software). It's used to orchestrate software. For example when you have a bunch of programs you want to run 24/7 , you can specify what machines they should run on (for example at the datacenter/region level), what resources to give the programs, how to handle hardware failures, etc.


I'd recommend reading the official Hashicorp website for that, but the gist is that nomad is another way of deploying apps in containers (similar to kubernetes), consul is how apps find other apps to talk to (kinda like DNS but with health checking built-in so they don't get stale info), and vault is how apps retrieve the credentials they need to connect to other services.




Consul is a service mesh, vault is secret storage and nomad is for workload orchestration.


So relating to things I do know about, consul is like cloud infra, vault is a service for secret storage and nomad is like Kubernetes?


Consul is like etcd but has some extra features built in like service discovery and l7 proxy so they market it as full blown service mesh a la istio, the other two are spot on


I think this is more correct. Like Etcd in that it is a distributed kv store with consensus. It has a dns interface that facilitates service discovery but it can be used for so much more. Infrastructure like global locks, coordination or even just a fault tolerant kv database, as it is for vault itself.


> no consul/vault meant no nomad, no nomad meant no application servers.

Not super familiar with nomad but it seems like that would not necessarily follow. For example if etcd goes fully dark in kubernetes cluster things will mostly continue running for a while unless something also crashes the servers


True for long running services, but not necessarily ephemeral workloads. Those need a system to schedule them or else once they complete they just stop running.

Also, as long running services crash or reboot naturally they need to be rescheduled or else your cluster slowly dies, and as your cluster size increases the mtbf decreases and the need to reschedule workloads continually increases.


>No functional consul meant no vault

Vault does support other backends, like postgres, etcd, zookeeper, and others. Though if Consul is the backend, Vault is also registered as a service within the Consul mesh.


There's a hashicorp case study on Roblox, so likely Vault specifically: https://www.hashicorp.com/case-studies/roblox


Huh, strangely the link doesn't open for me. Seems like an infinite redirect loop.

    $ curl -i -L https://blog.roblox.com/2021/10/update-on-our-outage/
    
    HTTP/1.1 301 Moved Permanently
    Server: nginx
    Content-Type: text/html; charset=UTF-8
    Content-Length: 0
    X-Redirect-By: WordPress
    Location: https://blog.roblox.com/2021/10/update-on-our-outage/
    X-Powered-By: WP Engine
    X-Cacheable: bot
    X-Cache-Group: bot
    Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
    Cache-Control: must-revalidate, max-age=7108
    Expires: Mon, 01 Nov 2021 05:39:59 GMT
    Date: Mon, 01 Nov 2021 03:41:31 GMT
    Connection: keep-alive


Looking forward to and incident analysis of why their incident analysis post wasn't posted correctly.


You can read the blog post via the... RSS feed!

https://blog.roblox.com/feed/


Same with Firefox:

> The page isn’t redirecting properly

> An error occurred during a connection to blog.roblox.com.


Works for me, possibly a cached 301?


Same, I can open https://blog.roblox.com/2021/10/ and other linked blog posts, but not this specific one.


Pretty much same here


I just refreshed and it worked the second time.


VPN from somewhere else and use a new browser

Most likely your edge node and your client side have cached something not ready yet


> We will publish a post-mortem with more details once we’ve completed our analysis, along with the actions we’ll be taking to avoid such issues in the future.

Excellent. That should be a very interesting read.

> In addition, we will implement a policy to make our creator community economically whole as a result of this outage.

Good to hear.


That latter statement wins a lot of respect in my book. I'm not sure I could name any other platform or game company that I would expect to see go so far.


My 3 youngest kids, aged 9, 9, and 8 told me Roblix were having an extended outage, world wide. Good job to the team getting everything back up. Been there, done that. #hugops


My 8 y.o. daughter ran through the house with an excitement typically I only see on Christmas morning exclaiming “Roblox is back!”


Feels like that for me, suddenly I can actually reopen Roblox Studio and do some work.


> My 3 youngest kids, aged 9, 9, and 8 ...

That must have been quite a year in the White household!


Quite a decade more like.


Twins, and then probably ~18 months later a younger child.


Twins + adoption actually. The current picture is 8, 9, 9, 16, 16. Figure that one out. :)


> we will implement a policy to make our creator community economically whole as a result of this outage.

Mad props for this. Who else would do such a thing for an outage? I wonder how they will do this of course. The devil will be in the details :)


Did they run out of internal ipv4 address space somehow? I’ll be curious for the post-mortem, this is definitely the longest outage of a global web service I can remember in… forever? This incident puts them under 2 9’s.


My guess is something related to service mesh/discovery. If that goes down, deployment infra also depends on it, and you don't have a solid black start protocol then you're in for a very bad week.


That or DNS. ;-)


Service mesh is basically abstracted DNS, so…


Or maybe BGP >:)



Not a lot of details here but I'd suspect it's yet another example of autoscaling policies gone wrong.


What does that tend to look like? Do you have any examples?


In my experience, it tends to be that you find out one of your components has some tipping point as you scale out horizontally where everything goes to hell. Random example: you scale a service horizontally and suddenly postgres disk space usage goes from "this line looks horizontal" to "we just used 2 years' worth of disk use growth in 40 seconds".

And the postmortem is basically like "yep can't really blame anyone for not knowing this weird ass thing or unfound bug".

Software, man.


I worked at a place with a few thousand servers.

Because the business had been around for a while, there was two separate ways to deploy and manage apps: old and new.

We reconfigured how code was deployed within and between racks for increase resistance to racks or even whole colos dying. This change was made in the new deployment system. And all was well for most of a year. Software and services gradually migrated from old to new as people had the time and inclination.

The new deployment also packed services more tightly, so most computers kept copies of most binaries on them.

Then we crossed one of those tipping points. On a code update, several thousand servers across the world demanded a small set of binaries from a distributed data node via the newer deployment method. Which would be fine, except we had degraded service from a couple TOR routers due to a separate bug. And so the distributed data nodes hosting this particular set of binaries were less "distributed" and more "just one poor about to be overloaded machine and network."

So a few thousand machines demanded a couple hundred megs nearly simultaneously. And when they didn't get it in a timely fashion, all triggered a rollback. Simultaneously. Which demanded another couple hundred mb of code. Simultaneously. And then the freakouts started.

And because other deploys were in flight, there was more than one service deploy ongoing. And they all started fighting each other for bandwidth.


I think there are two big things that can be really hard with scalability -

1) There are a lot of dimensions

I think it is fairly common to focus on a throughput-related dimension like requests per second and test that very thoroughly, while not paying enough attention to other aspects. We have services where, if we were told requests per second were going to go up 100x tomorrow, I wouldn't be too concerned, but if we were told request size were going up 2x, I'd be freaking out.

Even when you think you've identified all of the dimensions, there are usually ones you missed and/or they interact in weird and unexpected ways.

2) Performance can appear to be linear when it really isn't

So many outages I've seen have resulted from everything being fine until some threshold is hit, at which point it doesn't start to slowly degrade, but instead immediately explodes. Often times due to feedback loops (GC activity in garbage-collected languages often can behave this way) or because some cache overflowed (data that is queried at a high rate no longer fitting in memory on a DBMS comes to mind).


This totally happens, and is a thing, but tracing a "unexpected change in metrics on our DB server to "increase in metric showing call volume to DB server" to "increase in metric showing increase in fleet size of app servers that call the DB server" takes minutes to identify, not 24 hours.

And the remediation procedure is quite simple too - scale down. If the fleet can't handle it, increase your throttling. If even with increased throttling you can't handle the load and everybody's timing out (but you still can't scale out), you start spinning up new load balancers, and applying weights in DNS where at least some percentage of your callers are getting through while others see full failure.

The key point is in anything related to such an outage, even if it takes a long time to recover, your customers shouldn't see a full outage.

So it was either an entirely different class of problems or Roblex has a huge ops skills gap. I'm betting the former.

edit: never mind, maybe it's the latter after all: https://twitter.com/NIDeveloper/status/1454773313792880640


yep.


It’s great to hear that it wasn’t the result of something malicious (e.g. a hack). But this has to be one of the longest outages by a company this big ($50B market cap), at least in the past decade it seems?


What I love most then is that its the largest and it doesn't matter! In this market they will become an even larger company because competency has no correlation to how much money subscribers/advertisers will pay, and none of that has any correlation to what investors will pay!

Anybody optimizing for competence is exchanging time for food and shelter and just gets to be reminded of the much harder game they are playing that produces much less optimal results of money.


Maybe your definition of competence is wrong? A lot of people can build systems with some number of 9s of availability, but how many people can build a 50B company?


One is luck and the other is engineering. From the looks of it, this is a high luck, low competency endeavor.


And Roblox is already well beyond the luck threshold


I can do the 9s. So, that's common I'd wager.

$50B is hella hard tho. It's hard work to just get to $10M (w/o outside money) and even with outside money 100M is still very hard.


Its not related at all, its a discussion about the trading market conditions

Didn't realize I struck a cord here

I’m literally making fun of the people that would try to draw this distinction and … you showed up?

Huh

Sometimes I forget that the market reality changes faster than cultural conditioning

Change faster


Let's reserve this level of vitriol for medical systems or powerplants, shall we?


Some level of intensity is warranted. Things got real pretty fast here this past week when my eight year old was home on a school PD day.


If it's down for a couple days every month kids will start playing other games and get hooked on them. Roblox would lose a fair amount of users. Losing users means losing investor money.


There's a redirect loop between https://blog.roblox.com/2021/10/update-recent-service-outage... and https://blog.roblox.com/2021/10/update-on-our-outage/

I'm unable to open the link and seems Wayback Machine cannot archive it either.


Putting aside the content of the page, can we talk about the design of the page? It has this HUGE hero image that pushes the content down below the fold on my 4k monitor.


The link seems to be in an infinite redirect loop.

https://webcache.googleusercontent.com/search?q=cache:d4DE5B...


Bad timing too, I bet holiday events like Halloween are a big deal in the roblox culture


Roblox Dev here (Make the experiences, not employee), Halloween the second most important weekend for us developers in terms of income.

The Friday and Saturday are critical days in getting revenue from any Halloween specific events we may do. Such as a Halloween update or discount.

I'm happy to see Roblox is refunding adverts, especially after seeing how much support this has. https://devforum.roblox.com/t/reimbursement-on-sponsors-and-...


I am disappointed by the lack of technical transparency in this update. If Roblox knew enough about the problem to address it, they could have included enough technical words to describe it.

Which piece was overwhelmed and in which piece of backend did the subtle bug exist?

A deeper post-mortem will be nice. An actual description of the problem seems like SRE table-stakes for outages.


They do say in the post that they're going to explain more, later.


This reminds me of the Target TAP outage some years ago:

https://danveloper.medium.com/on-infrastructure-at-scale-a-c...


First a redirect loop

Now, a header image that takes up my entire 15" laptop screen before I can even see the title of the blog post. Impressive

https://imgur.com/256LkYb


> performance tuning, re-configuration, and scaling back of some load.

That seems to be a great thing if you can sit at the edge and scale the number of connections coming in. I just wish our system could do that.


i’m very interested to hear more! it’s hard to speculate without more details, but based off the bits here i’m suspecting that their load balancing/service discovery system hit an issue that resulted in a death spiral. often times a lot of these systems are built with subtle nonlinear scaling bottlenecks that also don’t allow the system to fail into a stable state.


Can anyone cache this page and link that? Or paste the text here in a comment? This link shows a redirect loop for me.



Looks like everyone got the cause of this outage wrong on previous speculative posts.


You mean it wasn't an arbitrary edge case in a database the company doesn't even use?


This comment from 2 days ago (the top comment on the post) seems likely accurate to me:

https://news.ycombinator.com/item?id=29044500


This was probably some junior dev's worst first day


The junior devs were probably chucked in a corner with some colored wooden blocks while this was being resolved.


Given the scope and the description it's highly unlikely to be caused (or investigated by) juniors. It definitely sounds like the service-discovery/configuration store (which is Consul from what I understand) or their orchestration layer (Nomad) shat itself.

So most likely their most senior SREs/infrastructure folk were the ones sweating bullets.


[flagged]


Get a grip.


Damn, you convinced me.


[flagged]


Thank you for posting the text of the blog post. However, I don't understand your TLDR. They clearly were able to identify the root cause, and that is what allowed them to restore service.


People use "root cause" to mean the true underlying driver. You can often restore service without knowing the true root cause. In this example, let's pretend true root cause is a memory leak that takes X days to crash Consul. It's possible they don't know that yet, and just shut down, cleaned up logs and temporary storage, added some capacity, and started back up in a region-by-region ramp up.


Fine, but from their blog post they pretty well did describe the root cause. I don't expect them to post code line fixes in a public blog post.


Perhaps, though "prompted by a subtle bug in our backend service communications while under heavy load" is pretty opaque.


I mean this is honestly a PR/lawyer blog post.

> A core system in our infrastructure became overwhelmed, prompted by a subtle bug in our backend service communications while under heavy load.

Translation Something important went down, we couldn't trace the bug.

>This was not due to any peak in external traffic or any particular experience.

Translation We're not blaming Chipotle for this

> most services at Roblox were unable to effectively communicate and deploy.

Translation We couldn't throw more compute at the problem to manage the traffic or to put up a failover copy of anything.

>To the best of our knowledge, there has been no loss of player persistence data

Translation We literally restored our entire infrastructure from scratch and probably our last set of backups. You might have lost some stuff but we don't know what you lost since we couldn't get the last copy of our databases.


> We literally restored our entire infrastructure from scratch and probably our last set of backups. You might have lost some stuff but we don't know what you lost since we couldn't get the last copy of our databases.

What a wild extrapolation. The line in the blog post is not them writing to hint at what might have happened, it's so that the customers reading don't think this means they lost any of their data/items/etc. The Roblox marketplace is vast and high-volume; if anyone lost <any> data from them restoring a backup, we'd already know about it.


You missed the part where they described the root issue:

> Rather the failure was caused by the growth in the number of servers in our datacenters.

Their config server ran out of sockets and screwed up everything. Or something like that. Too many servers = nothing works has many ways to unfold and isn't obvious the first time.


>too many servers...

My best advise about scaling, I once read (might have been right here on HN) is many devs know their primary bottleneck. Be it the "master-db", 3rd-party-api, login-service etc. But few devs knows the "2nd slowest/bottleneck" in their service stack. Not saying this was the case here, but always good to think (at least has a mental exercise) about what comes after your primary-bottleneck.

If it's a long standing bottleneck, one day it gets fixed (sw update, better arch) and suddenly everything goes down since now you hitting a new bottleneck you never even thought about scaling.


This kind of thing will probably become more common as more organizations adopt overly complex cloud deployment automation.


> A key value at Roblox is “Respect the Community,”

Yet they still take a 30% cut on all transactions that go to content creators / developers


Isn't that basically the status quo from Apple and Google as well? Not to mention Steam. 30% feels high but at least for Roblox they built the game, I feel like that's much more justified than Google and Apple where they're just collecting a huge chunk because they have a captive market. With Roblox's ecosystem it doesn't feel quite as usurious as Google or Apple doing it on their app stores.


It's not justified for any of them, and if they all don't cut it out they're going to get slapped with a law putting it at 5%, and we will all laugh at them.


That high percentage doesn’t mean they do or do not respect the community.

Saying this as someone who doesn’t know much about Roblox but does give 15-30% to Apple and Google. Google definitely doesn’t respect their developers. Apple kind of does.


Apple and Google also disrespect their developer communities, probably more than any companies out there. Anything over 5% is ridiculous and was seen as such originally before Google and Apple made this the norm.


I don't disagree with you. I was just stating that Google is absolutely horrible whereas Apple at least let's us developers talk to a human being.


Apple is the worst of them all. They are much more restrictive and arbitrary with their app store policies (see previous HN articles over the last 5 years), they have done much more to lock down their devices to prevent circumvention of things like banning emulators, and they don't license the apple device model to other vendors, so they are the true definition of a walled garden. Google has plenty of problems too, but at least random device companies can make android devices, and third party app stores are (sort of) allowed to exist.


You aren't wrong.

Though I was mostly talking about customer service part and development relations in terms of that.

Google regularly bans developers for absurd reasons or no reasons at all, doesn't let them talk to any humans until the developer causes enough twitter storm and gets some tech sites to write about it to get some attention from Google. Apple at least lets the dev talk to human beings.


More like 75%. Roblox is a terrible company that pays company scrip to exploit kids.

https://www.youtube.com/watch?v=_gXlauRB1EQ


Why are these things not mutually exclusive?


Anything over 5% is highway robbery and the kind of thing that only exists in concerningly unregulated capitalism. I promise you someday legislation will put an actual legal limit there. Just wait for these zoomers to reach voting age.


Why is making 6% from your one funding source on an otherwise free-to-play product unreasonable? Do you think any markup of more than 5% is unreasonable? What about Apple's physical products? What about ... every retail store in existence?

> Just wait for these zoomers to reach voting age.

Ah yes, 50.


> Do you think any markup of more than 5% is unreasonable? What about Apple's physical products? What about ... every retail store in existence?

Retail stores maintain a physical presence and have actual overhead, as do online retailers that have to maintain warehouses, subsidize free shipping, etc. What is the overhead on an app submission, or an app purchase? Charging 30% overhead for $0.0012 cents of bandwidth, an algorithmic review of your app that only might be supplemented by an actual human review, and a literally free file copy operation comes nowhere close to that. This is a lot closer to payment processor fee territory in terms of the actual work being done by the provider.

You would think they would do more for their app developers with such a hefty fee, but they don't, so they need to be regulated. These fees are just theft.

In the case of Roblox, you can argue "but what about the platform?" but 5% is still a metric-crap-ton of revenue and is more than enough to keep their platform afloat (especially if they don't have to give 30% upstream to Apple/Google). We've just become accustomed to ridiculous % cuts because monopolies have normalized this practice. In the early 00s, the idea of an app store even taking a cut was viewed as laughable. Then it became normal.


> In the case of Roblox, you can argue "but what about the platform?" but 5% is still a metric-crap-ton of revenue and is more than enough to keep their platform afloat (especially if they don't have to give 30% upstream to Apple/Google)

Afloat? Maybe. Seems doubtful. Anyway, the justification for app stores charging developers is always providing the Platform. Android is a free platform. You can buy an Android phone from a manufacturer of your choice without Google seeing a cent of that money. Their way of making money is to use rents on their application store.

What I'm getting at is that they produce a product. A valuable product, given that many people enjoy it. Under standard American assumptions about free markets, this means they're entitled to a reasonable profit, not just enough to "keep their platform afloat". Profit is not theft.

Think about a company that sells you a printer at-cost in the hope that it will pay for itself in profits when you buy ink. There's usually third-party ink that will work in the printer, but the business model of the company is that there will be enough people who buy their ink to make cheap printers profitable. There's a, I don't know, 150% markup on the ink.

In other words, they have an overall profitable business, where one part of it is a cost leader and another is where they extract rents. Nothing particularly surprising about this. The app store model is exactly the same: most parts of the platform are free (despite very real development costs), and store developers pay a fee of 20-30% which keeps things running. In fact, if the developers reduced the fee but then charged users to use the platform, the overall profits of developers would probably be reduced, as fewer users would use the platform.

I'm skeptical, in the end, that the profits earned by companies like Roblox are terribly out of keeping with those owned by retail stores or computer manufacturers. You cannot separate out the parts of the business that are pure profit centers from those that are cost leaders. The fact that the profit on the profit centers seems exorbitant has to be taken in context.


Any scenario where a single company controls an entire marketplace I do believe that company's markup should be limited to 5% to prevent abuse, yes. If there are multiple marketplaces serving the exact same devices, and with equal status, then that's another story, and price-fixing is already illegal, so that situation is covered. Vendor-specific app stores on platforms where third party app stores are not allowed or are centrally discouraged in any way are monopolies and should be treated as such. It isn't enough to say that android apps compete with ios apps -- we aren't talking about selling iphones and android phones, we're talking about selling vendor-specific apps, and these are different ecosystems. Apple is sitting there gatekeeping everything allowed in their ecosystem and charging an extremely hefty toll, in exactly the way a monopoly prefers to, because they have made sure competition is not allowed. On that note, I also think this level of gatekeeping should not be allowed, but that's a whole other rant.

There is nothing Apple (or Google, or Robolox) does to justify more than a 5% cut. 5% is generous. In many cases they don't even take the time to have a human review every submission. You can say they built the platform, but aren't we already paying for that with our 5%?

And to your point, yes, when Zoomers reach 50, we'll see awesome things like companies losing the rights to their IP if they submit 3 bogus DMCA claims, extreme regulation of things like content-ID that threaten the existence of fair use in practice, outright banning of all lobbying and corporate political donations, things like this, etc etc.


Payment processors charge around 3% plus fees (charge back fees and transaction fees). I think the value that Roblox adds makes them deserve much more than a measly 2% more than a generic payment processor would get.

15 or 20% is probably a good middle ground


I assume they have a deal with Apple and Google for less than 30% on in-app transactions. That probably means Roblox is getting between 10-20% of the transaction.


They also have PC and Mac version of it. I have some friends who have played Roblox on PC and my younger cousin plays on iOS. I wonder what the market share looks like between them


That's interesting. I didn't know about the PC version! I'd also be curious of the market share there.


Right, that's why they get an extra 2%.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: