Someone on here posted the other day that someone they knew at Roblox said it was their secret store that became overloaded. Presumably a Hashicorp Vault type service (or something similar.) This update appears to support that claim.
That's definitely a problem you only get at scale.
Not far off. Having chatted with friends from the company, consul failed due to a issue with streaming which was introduced in 1.9 (https://www.hashicorp.com/blog/announcing-hashicorp-consul-1...) which caused it to have massively reduced tps in certain circumstances. No functional consul meant no vault, no consul/vault meant no nomad, no nomad meant no application servers.
Furthermore, the services were down for long enough that the asset caches went fully cold, so spinning back up to 100% capacity would put far more load than usual on the servers meaning recovery needed to be a very slow incremental rollout to warm them.
Messy situation all around, but if you want the real root cause pay attention to the consul change log over the next few weeks I’d say.
> We didn’t want to choose any technology that requires the company to drive deep expertise, almost to the point where you have to be a code contributor back into the project to get what you want. Nomad is just very easy to adopt.
Better be damn sure you have your 24/7 vendor support contracts in order if and when shit does hit the fan.
I’m on the vendor side now, so every day is helping customers recover from etcd failures caused by slow storage due to them not listening to documented requirements. ;)
You don't need more than high school science just to build a bridge. Yet we keep educating engineers. Completely unnecessary, as long as nothing breaks.
You don't really need to understand the citric acid cycle just to operate on someone. Yet we keep educating surgeons. Is that a bad thing?
To be fair Nomad failed because it's reliant on Consul. This would be the equivalent of having a k8s outage because etcd is down.
Their mistake here is thinking that because you can understand the code of a higher level service that somehow you don't need deep knowledge of it's dependencies.
That is a deeply naive view and they paid the price.
Consul is a service mesh. It’s your dynamic service discovery and routing layer. You have systems dynamically allocated in a cluster, they need to reach other systems, you ask consul where they are.
Vault is secrets and management. Put secret strings in, get secret strings out (if permitted). Most apps need secrets of some type, and vault is normally discovered via consul.
Nomad is an application/workload scheduler. You tell it what you need to run and what their memory/cpu requirements are and it finds a space in your physical infrastructure to run that. The apps it runs normally needs secrets from vault and communicate with services discovered via consul.
They all are well integrated and build off of each other, kind of like layers an onion where your app services are the outer layer of the onion. Consul failing this badly is like the core of the onion going rotten. There’s not much saving it at that point, you need to grow a new onion from the inside out, but that takes time.
I'm not familiar with Consul specifically, but service meshes I am familiar with give you a lot more than just mapping of service names to dynamic IPs. They bundle things like TLS termination, reverse proxying, network policy enforcement, automatic certificate provisioning and key rotation, firewalls potentially aware of anything from layer 3 to layer 7.
Where I think companies are going about this wrong sometimes is thinking having their networks be software-defined and handled by some all-in-one product suite means they no longer need network engineers and can they just rely entirely on application developers who specialize in general programming and fall back to vendor support contracts if anything gets too confusing.
Admittedly, there's some self interest speaking there, because I work as a consulting engineer for one of these vendors where we go way beyond "support" to embed full time in external product teams with a dependency on our products (though we also offer basic support for companies that think they can get away with it).
But to my mind, no software suite can let you get away with not needing any kind of IT ops at all. Smaller companies may assess that the risk is worth it to focus solely on product, but as far as I can tell, Roblox is not a small company (or shouldn't be, given the traffic scale of their platform).
DNS gets cached all up and down the stack. Your OS, routers, application libraries and probably even your application code might cache it. Consul allows you to have it handle the caching, and be notified when a service goes away so it won’t return that address to your application.
I think service discovery is typically related to e.g., request routing in a way that DNS isn't. DNS could probably replace a lot of the ways service discovery is used today, but it would be a very non-typical setup, which is arguably worse than the custom things people use for it today.
Dynamic-first, better support for failover etc.. It's probably not doing anything that you couldn't build on top of DNS, but it's designed from the ground up for clustered deployments.
the thing can even provide dns... but it has fancy interfaces and an http endpoint. Technically you could do everything with dns just fine.
Even if consul provides some sort of liveness (due to latency and concurrency it means little), I can't think of a usable case where the clients won't have retries and ability to maintain multiple open sockets, etc.
Consul's notion of a service includes port numbers, and while I'm sure you could hack something up using DNS txt records and default port numbers, it's an important distinction because it means multiple instances running on one box can easily be discovered.
This dynamic service terminology sounds very familiar for someone with a few years in the feature film VFX/Animation industries - however, it would not surprise me if Consol/Vault/Nomad were created without a review of similar situations and solutions in other industries.
For those unaware, the feature film VFX/Animation industry has been global and performing large scale technologically ambitious projects requiring multi-company tech collaboration for decades. And not just with code, with assets too: visual assets of all kinds, audio in every possible form, and in some cases legal documents with new approaches to industry issues. Plus these media studios tend to have proprietary workflows, so there is very sophisticated file formats to contain all these information in agnostic manners. All this deep collaboration across creative technical organizations that do not trust one another has developed scalable solutions which the web and formal software development is completely unaware.
My mind is continually blown when I read about the sheer volume of innovation from the animation industry over the last 3 decades. I would love to learn more about these scalable solutions you're talking about though, since I'm not familiar with them. Do you have any pointers?
I've not worked in VFX since '02, but I was both at Rhythm & Hues Studios through multiple VFX Oscars as well as an early 3D graphics researcher during the 80's. I have no idea which tools are still around.
For example, a film compositor I know that is now dead and was revolutionary, called Shake: it pioneered both off loading heavy compute tasks to the GPU (not just graphics, but what is called scientific computing now) and it hid the GCC Compiler inside itself and used a macro-transformed version of C as it's "scripting language" that was actually hot-loaded C++ dlls compiled on the fly. It was also the first "node based programming environment" with graphical nodes the end-users connect with splines to define the I/O between the nodes.
If you are seriously interested, find someone working in the industry today and ask them. That industry has been changing a lot. Since I left the VFX/Animation fields have been moving towards more framework-like production environments, similar to how the web uses frameworks. The issue with these frameworks is they define the tasks to be performed, and those tasks are simplistic and rigid - meaning the actual production work has eliminated in as many places as possible the requirement for an art degree. The VFX/Animation industries are driving production towards something approaching more and more the work of being a burger flipper. The processes are being standardized and reduced in complexity so the studios can hire non-artists, they can hire anyone and work them like an automobile assembly line. Twenty years ago this was starting, and that is when I left.
I haven't used most of these but here's my possibly-flawed understanding:
Consul: It's used to help manage cloud networking so that your application doesn't have to worry about IP addresses, datacenter locations, punching through firewalls, etc. Think of the situation where you have a zillion microservices talking to each other on different machines- it makes it easier for them to find each other. It also includes a distributed key-value store
Vault: If you've used a password manager in your browser, imagine that but distributed and on steroids. You can use it to share credentials with groups. It also has some APIs to help with encryption/decryption and includes a key-value store
Nomad: Roughly similar to Kubernetes (although with good support for non-containerized software). It's used to orchestrate software. For example when you have a bunch of programs you want to run 24/7 , you can specify what machines they should run on (for example at the datacenter/region level), what resources to give the programs, how to handle hardware failures, etc.
I'd recommend reading the official Hashicorp website for that, but the gist is that nomad is another way of deploying apps in containers (similar to kubernetes), consul is how apps find other apps to talk to (kinda like DNS but with health checking built-in so they don't get stale info), and vault is how apps retrieve the credentials they need to connect to other services.
Consul is like etcd but has some extra features built in like service discovery and l7 proxy so they market it as full blown service mesh a la istio, the other two are spot on
I think this is more correct. Like Etcd in that it is a distributed kv store with consensus. It has a dns interface that facilitates service discovery but it can be used for so much more. Infrastructure like global locks, coordination or even just a fault tolerant kv database, as it is for vault itself.
> no consul/vault meant no nomad, no nomad meant no application servers.
Not super familiar with nomad but it seems like that would not necessarily follow. For example if etcd goes fully dark in kubernetes cluster things will mostly continue running for a while unless something also crashes the servers
True for long running services, but not necessarily ephemeral workloads. Those need a system to schedule them or else once they complete they just stop running.
Also, as long running services crash or reboot naturally they need to be rescheduled or else your cluster slowly dies, and as your cluster size increases the mtbf decreases and the need to reschedule workloads continually increases.
Vault does support other backends, like postgres, etcd, zookeeper, and others. Though if Consul is the backend, Vault is also registered as a service within the Consul mesh.
That's definitely a problem you only get at scale.