It's always so enlightening to have articles like this one shed light on how companies at scale operate. It goes without saying that many of the problems Stripe faced with their monorepo isn't application to smaller businesses, but there are still bits and pieces that are applicable to many of us.
I've been working on an ephemeral/preview environment operator for Kubernetes(https://github.com/pier-oliviert/sequencer) and as I could agree to a lot of things OP said.
I think dev boxes is really the way to go, specially with all the components that makes an application nowadays. But the latency/synchronization issue is a hard topic and it's full of tradeoff.
A developer's laptop always ends up being a bespoke environment (yes, Nix/Docker can help with that), and so, there's always a confidence boost when you get your changes up on a standalone environment. It gives you the proof that "hey things are working like I expected them to".
My main gripe with the dev box approach is that a cloud instance with similar compute resources as a developers MacBook is hella expensive. Even ignoring compute, a 1TB ebs volume with equivalent performance to a MacBook will probably cost more than the MacBook every month.
Wouldn't this be a reasonable alternative? Asking because I don't have experience with this.
1. New shared builds update container images for applications that comprise the environment
2. Rather than a "devbox", devs use something like Docker Compose to utilize the images locally. Presumably this would be configured identically to the proposed devbox, except with something like a volume pointing to local code.
I'm interested in learning more about this. It seems like a way to get things done locally without involving too many cloud services. Is this how most people do it?
We do this, but with k3s instead of docker compose (it’s a wonderful single-box substitute for full k8s), and a developer starts by building the relevant container images locally. If everything works, it takes about 3 minutes to get to a working environment with about a dozen services.
We steer clear of proprietary cloud services. In place of S3, we use minio.
The one sore spot I haven’t been able to solve yet is interactive debugging. k8s pushes you toward a model where all services need a pile of environment variables to run, so setting these up is pretty painful. In practice, though, all rapid iteration happens inside unit tests, so not having interactive debugging isn’t much of a productivity drag.
At least on a MacBook, docker is still a compromise in many ways since it has to run on a Linux VM (I live in the SF tech bubble where I've only ever been issued a MacBook).
Even on my personal Linux desktop, I don't love developing in containers. It is very tedious to context switch between my local environment and the in-container environment, and I don't even consider myself the type with a super personalized setup.
So I don't consider local docker that much of an improvement over a remote devbox.
> It is very tedious to context switch between my local environment and the in-container environment
I think, with my proposed setup, you'd still do development on your local machine. The containers would only be there for the dependencies, and as a shell to execute your code. The container hosting the application under development would use a volume to point to the files on your local machine, but the container itself (with all its permissions and configuration) would match or nearly match what you plan for the production environment.
I manage a dev environment for a small, inexperienced (but eager) team and I have a similar setup. I’ll do a write up at some point if I have time. It can work, and does for me, but there are some funny consequences can end up mediate the relationship between a developer’s computer and his code, which is a terrible place to be.
It’s $250/month for a c6g 2xl 1tb ebs on demand pricing go reserved instances. Given they use AWS and are a major customer, you can expect excellent pricing above the above public pricing quote.
Considering the cost of a developers time, and you can do shenanigans to drive that even lower, this all feels totally reasonable.
If you truly need that kind of perf (and at Amazon, we had plenty of dev desktops running on ebs without that kind of performance) then you should really opt for an instance type with local storage.
Ive been deep in to implementing macOS CI workers on AWS where that isn't an option (or rather, it is an option, but it is unsupported and Amazon only buys Macs with the smallest possible SSD for the given configuration). So your options are to pay an arm and a leg for fast EBS, or pay an arm and a leg for a pro or max instance with larger internal SSD.
The article didn't actually say what "Stripe's cloud environment" was, besides "outside of the production environment". I assumed the company had their own hardware but your assumption is more probable.
I find the devbox approach very frustrating because the JetBrains IDEs are leaps and bounds ahead of everything else in terms of code intelligence, but only work well locally. VSCode is very slightly more capable than plain text editor + sync or terminal-based editor over SSH, but only slightly.
It's darkly amusing how we have all these black-magic LLM coding assistants but we can't be reasonably assured of even 2000s level type-aware autocomplete.
I work in Go. My company keeps trying to push us onto their VSCode based remote environment. “Find all references” doesn’t, “go to definition” works maybe 30% of the time, and the Go LSP daemon needs to force killed dozens of times in a working session, taking several minutes to recover each time. The autocomplete suggestions are about 3x as likely to be from the VSCode fuzzy matching thing or Copilot slop as actually existing symbols that type check in context.
JetBrains Projector and Gateway meanwhile lock up or outright crash several times an hour; text input and scrolling are not smooth.
Right, dev boxes do not need to do double duty as a personal computer plus development target, which allows them to more closely resemble the machine your code will actually run on. They also can be replaced easily, which can be helpful if you ever suspect something is wrong with the box itself - if the new one acts the same way, it wasn't the dev box.
I don't recall latency being a big problem in practice. In an organization like this, it's best to keep branches up to date with respect to master anyway, so the diffs from switching between branches should be small. There was a lot of work done to make all this quite performant and nice to use. The slowest part was always CI.
I feel like we're not getting the right lessons from this. It feels like we're focusing on HOW we can do something versus pausing for a brief moment to consider if we SHOULD in the first place.
To me the root issue is the complexity of production environments has expanded to the point of impacting complexity in developer environments just to deploy or test - this is in conjunction with expanding complexity of developer environments just to develop - i.e. web pack.
For very large well resourced organizations like Stripe that actually operate at scale that complexity may very well be unavoidable. But most organizations are not Stripe. They should consider decreasing complexity instead of investing in complex tooling to wrangle it.
I'd go as far as to suggest both monorepos and dev-boxes are complex toolchains that many organizations should consider avoiding.
> I'd go as far as to suggest both monorepos and dev-boxes are complex toolchains that many organizations should consider avoiding.
I'm not sure "monorepo" means the same thing to you as it does to me? To me, it just means "keep all the code in one repo, instead of trying to split things up into different repos."
To me, it's the thing that is the simple solution, it just means "a repo" -- the reason it gets a name is because it's unusual for large orgs with enormous codebases to have everything in one repo, it's unusual for them to do the simple thing that works fine for a small org with a normal codebase.
What is it you're suggesting a simple organization should do instead of a "monorepo"?
> To me, it just means "keep all the code in one repo, instead of trying to split things up into different repos."
To me, and perhaps more from a Devops-like perspective, mono repo means "one repo many diverse deployment environments and artifacts often across multiple programming languages".
Im advocating against the Google/Stripe situation of a singular massive repo with complex build tools to make it function - like Bazel. I think sometimes small organizations get lured by ego and bad cost/benefit analysis into implementing such an architecture and it can tank entire product orgs in my experience (obviously not for Stripe, Google, etc.).
A monorepo doesn't require multiple programming languages or Bazel. But once multiple programming languages are involved, the complexity exists regardless of the chosen tooling. With multiple repos, that complexity is just pushed elsewhere like the CI system.
The argument would be that for simple organization, dividing things into independently releasable components is less simple than just having one app. I think that's what most simple organizations do, no? Why do you need the complexity of independently releasable components for your simple organization? Now you have to track compatibility between things, ensure what version of what independently releasable thing works with what version of what independently releasable other thing, isn't that added complexity? Why not just have one application, isn't that simpler? You don't need to worry about incompatibilities between your separately releasable things -- every commit that passes CI on your single repo means all the parts are compatible (sans untested bugs).
Usually it stops being "simpler" at a level of organizational complexity or code size where it becomes a mess. The "monorepo" is the attempt to do what everyone was just doing anyway for simple orgs with simple codebases, but keep doing it at huge sizes.
The monorepo vs many-repos discussion often hits upon so many implied factors, but it's only really about how source code is stored.
It doesn't necessarily indicate much about the deployment model. You can have many separately releaseable things in one repo, and you can have one independently releaseable thing based on the sources of many repos.
Monorepos enable, but don't require, source-level co-evolution. Or maybe a better way to put it would be: many projects can have a shared history. Many-repos require independent source-level evolution. In the open source world there is no real choice: every project wants to be independent. The authors of a given project can do what they wish with it.
One weird thing to think about is that monorepos can accomodate many-repo style workflows. You can still develop projects completely independently within a single repo. Of course you can store separate projects on separate revisions, which would be weird. An even weirder approach would be having all projects in a given revision, but have totally independent builds, no single-version policy, no requirement for atomic compatibility, et cetera. These are all things that are often imposed for monorepos, but that are also not requirements. Basically, you can treat each project as independent even if their sources are stored together. I don't think there are any reasons to actually do this, of course.
If you're living in the same dysfunctional world I am, then maybe your organization split things into repos that are separately releasable, but are conceptually so strongly coupled that you now need to create changes on 3 repos to make a change.
I think this is the key phrase in what you've written. Quite often I've seen teams insist on separating things that cannot be released independently due to some form of coupling.
You end up with people talking about a particular "release" but not really knowing 100% what's in it and then discovering later that something is missing or included by mistake.
IMHO it's much easier to keep it all in a single repo and use the SHA value as a single source of truth when discussing what's in it. I don't really work on huge codebases though so your mileage may vary.
> You end up with people talking about a particular "release" but not really knowing 100% what's in it and then discovering later that something is missing or included by mistake.
If your devs couldn’t be bothered to pin versions that was never a tooling problem. You don’t need a 500GB Git repository with every vendored component to know what’s in your code.
Equally, if your team is going to store 500GB vendored components it doesn't matter if that's all in one place or smeared many repos. You still have the same issue.
Absolutely, I worked on tech behemoths and smaller companies. The dev experience was significantly better when all development was local. I even worked on initiatives to move development away from the cloud, and although other devs were skeptical, they ended up loving it.
I think we don't have good solutions for scaling down prod.
Our relatively simple prod architecture has 5 containers & a hosted database (so 6 containers when run locally), and any less would impact our product goals.
I still find running prod locally valuable, and is the most common way anyone does development here, but containers are fairly heavyweight when you want to run everything on one machine. It's also impossible if you have parts that need special accelerators to get good latency, etc.
If you're willing to build everything from scratch, you can have a framework that seamlessly lets you build conceptual services and then separate the physical deployment concerns, like Google has and sometimes even uses. But for the rest of us where we're clobbering together a bunch of different technologies, that's a luxury we can't really afford.
Look into dev containers — if you set one up for your repo, you get pretty much the same experience as GitHub Codespaces, but the choice of running it locally.
Maybe a silly question, but why all this engineering effort when you could host the dev environment locally?
By running a Linux VM on your local machine you get a consistent environment that you can ssh to, remove the latency issues but you remove all the complexity of syncing that they’ve created.
That’s a setup that’s worked well for me for 15 years but maybe I’m missing some other benefit?
I work on this at Stripe. There's a lot of reasons:
* Local dev has laptop-based state that is hard to keep in sync for everyone. Broken laptops are _really hard_ to debug as opposed to cloud servers I can deploy dev management software to. I can safely say the oldest version of software that's in my cloud; the laptops skew across literally years of versions of dev tools despite a talented corpeng team managing them.
* Our cloud servers have a lot more horsepower than a laptop, which is important if a dev's current task involves multiple services.
* With a server, I can get detailed telemetry out of how devs work and what they actually wait on that help me understand what to work on next; I have to have pretty invasive spyware on laptops to do the same.
* Servers in our QA environment can interact with QA services in a way that is hard for a laptop to do. Some of these are "real services", others are incredibly important to dev itself, such as bazel caches.
There's other things; this is an abbreviated list.
If a linux VM works for you, keep working! But we have not been able to scale a thousands-of-devs experience on laptops.
I want to double check we’re talking about the same thing here. I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.
I’m sure there are a bunch of things that make it the right choice for Stripe. Obviously if you just have too many things to run at a time and a dev laptop can’t handle it then it’s a dealbreaker. What’s the size of the cloud instances you have to run on?
> I’m referring to running everything inside a single VM that you would have total access to. It could have telemetry, you’d know versions etc. I wonder if there’s some confusion around what I’m suggesting given your points above.
I don't think there's confusion. I only have total access when the VM is provisioned, but I need to update the dev machine constantly.
Part of what makes a VM work well is that you can make changes and they're sticky. Folks will edit stuff in /etc, add dotfiles, add little cron jobs, build weird little SSH tunnels, whatever. You say "I can know versions", but with a VM, I can't! Devs will run update stuff locally.
As the person who "deploys" the VM, I'm left in a weird spot after you've made those changes. If I want to update everyone's VM, I blow away your changes (and potentially even the branches you're working on!). I can't update anything on it without destroying it.
In constrast, the dev servers update constantly. There's a dozen moving parts on them and most of them deploy several times a day without downtime. There's a maximum host lifetime and well-documented hooks for how to customize a server when it's created, so it's clear how devs need to work with them for their customizations and what the expectations are.
I guess its possible you could have a policy about when the dev VM is reset and get developers used to it? But I think that would be taking away a lot of the good parts of a VM when looking at the tradeoffs.
> What’s the size of the cloud instances you have to run on?
We have a range of options devs can choose, but I don't think any of them are smaller than a high-end laptop.
So the devs don’t have the ability to ssh to your cloud instances and change config? Other than the size issue, I’m still not seeing the difference. Take your point on it needing to start before you have control, but other than that a VM on a dev machine is functionally the same as one in a cloud environment.
In terms of needing to reset, it’s just a matter of git branch, push, reset, merge. In your world that sync complexity happens all the time, in mine just on reset.
Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.
I have no doubt Stripe does what makes sense for Stripe. I’d also wager than on balance it’s not the best option for most other teams.
PS thanks for chiming in. I appreciate the extra insights and context.
> So the devs don’t have the ability to ssh to your cloud instances and change config?
They do, but I can see those changes if I'm helping debug, and more importantly, we can set up the most important parts of the dev processes as services that we can update. We can't ssh into a VM on your laptop to do that.
For example, if you start a service on a stripe machine, you're sending an RPC to a dev-runner program that allocates as many ports as are necessary, updates a local envoy to make it routable, sets up a systemd unit to keep it running, and so forth. If I need to update that component, I just deploy it like anything else. If someone configures their host until that dev runner breaks, it fails a healthcheck and that's obvious to me in a support role.
> Just to be clear, I think it’s interesting to have a healthy discussion about this to see where the tradeoffs are. Feels like the sort of thing where people try to emulate you and buy themselves a bunch of complexity where other options are reasonable.
100% Agree! I think we've got something pretty cool, but this stuff is coming from a well-resourced team; keeping the infra for it all running is larger than many startups. There's tradeoffs involved: cost, user support, flexibility on the dev side (i.e. it's harder to add something to our servers than to test out a new kind of database on your local VM) come immediately to mind, but there are others.
There are startups doing lighter-weight, legacy-free versions of what we're doing that are worth exploring for organizations of any size. But remote dev isn't the right call for every company!
Ah! So that’s a spot where we’re talking past each other.
I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.
The only difference between what you do and what I’m suggesting is the location of the VM. That itself creates some tradeoffs but I would expect absolutely everything inside the machine to be the same.
> I’d anticipate you would be equally as able to ssh to VMs on dev laptops. That’s definitely a prerequisite for making this work in the same way as you’re currently doing.
Our laptops don't receive connections, but even if they could, folks go on leave and turn them off for 9 months at a time, or they don't get updated for whatever reason, or other nutty stuff.
It's surprisingly common with a few thousand of them out there that laptop management code that removes old versions of a tool is itself removed after months, but laptops still pop up with the old version as folks turn them back on after a very long time, and the old tool lingers. The services the tools interact with have long since stopped working with the old version, and the laptop behaves in unpredictable ways.
This doesn't just apply to hypothetical VMs, but various CLI tools that we deploy to laptops, and we still have trouble there. The VMs are just one example, but a guiding principle for us been that the less that's on the laptop, the more control we have, and thus the better we can support users with issues.
Maybe I'm missing something here but couldn't you just track the whole VM setup (dependencies, dev tools, telemetry and everything) in your monorepo? That is, the VM config would get pulled from master just like everything else, and then the developer would use something like nixos-shell[0] to quickly fire up a VM based on that config that they pulled.
Yes, but this still "freezes" the VM when the user creates it, and I've got no tools to force the software running in it to be updated. It's important that boxes can be updated, not just reliably created.
As just one reason why, many developers need to set up complex test data. We have tools to help with that, but they take time to run and each team has their own needs, so some of them still have manual steps when creating a new dev server. These devs tend to re-use their servers until our company-wide max age. Others, to be fair, spin up a new machine for every branch, multiple times per day, and spinning up a new VM might not be burdensome for them.
Isn't this a matter of not reusing old VMs after a `git pull/checkout`, though? (So not really different from updating any other project dependencies?) Moreover, shouldn't something like nixos-shell take care of this automatically if it detects the VM configuration (Nix config) has changed?
> Isn't this a matter of not reusing old VMs after a `git pull/checkout`, though?
Yes, but forcing people to rebase is disruptive. Master moves several times per minute for us, so we don't want people needing to upgrade as the speed of git. Some things you have to rebase for: the code you're working on. Other things are the dev environment around your code, and you don't want that to be part of the checkout as much as possible. And as per my earlier comment, setting up a fresh VM can be quite expensive in terms of developer time if test data needs to be configured.
You seem to assume you would have to rebuild the entire VM whenever any code in git changes in any way. I don't think you do: You could simply mount application code (and test data) inside the VM. In my book, the VM would merely serve to pin the most basic dependencies for running your integration / e2e tests and I don't think those would change often, so triggering a VM rebuild should produce a cache hit in 99% of the cases.
I think this is where our contexts may differ, and so we end up with different tradeoffs and choices :) The services running on our dev servers are updated dozens of times per day, and they roughly correspond to the non-code parts of a VM.
Or maybe we just used terminology differently. :) Why wouldn't those services be part of the code? After all, I thought we were talking about a monorepo here.
I see in another comment thread you mentioned downloading the VM iso, presumably from a central source. Your comment in this thread didn't mention that so perhaps this answer (incorrectly) assumes the VM you are talking about was locally maintained/created?
To provide historical context, 10 years ago there was a local dev infrastructure, but it was already so creaky as to be unreliable. Just getting the ruby dependencies updated was a problem.
The local dev was also already cheating: All the asynchronous work that was triggered via RabbitMQ/Kafka was getting hacked together, because trying to run everything that Infra/Queues did locally would have been very wasteful. So magic occurred in the calls to the message queue that instead triggered the crucial ruby code that would be hit in the end.
So if this was a problem back then, when the company had less than 1000 employees, I can't even imagine how hard would it be to get local dev working now
Sounds like you made a massive tradeoff in code coupling if your cant easily swap out remote for local queues etc. But i get it, when your thinking cloud first, understanding where your abstractions start or end can be a complex topic that creates flow on effects and often stop the whizz bang cloud demo code from copy/paste working in your solution. Depending on the stage of your company, this could be a feature or a bug. maybe you have so much complexity in your solution from spreading buisness logic across services that your solution only makes sense when your developing against prod-like-infra and in that scenario im seeing a benifit of having cloud first dev infra because keeping that beast tamed otherwise would be a monumental challange given the perchant for cloud-first to be auto-update-everything.
The way these problems are stated mighy make it seem like they're unsolvable without a lot of effort. I just want to point out that I've worked at places that do use a local, supported environment, and it works well.
Not saying it's the wrong choice for you, but it's a choice, not a natural conclusion.
In my opinion the single most important feature of any development environment is a reliable “reset” button.
The amount of time companies lose to broken development environments is incredible. A developer can easily lose half a day (or more) of productive time.
With cloud environments it’s much easier to offer a “just give me a brand new environment that works” button somewhere. That’s incredibly valuable.
For sure, but, a VM has that feature too. They have to run some services directly on the laptop to handle the code syncing. So if you accept a certain amount of “need to do some dev machine setup” as a cost, installing Parallels and running a script to download an iso is a pretty small surface area that allows for a full reset.
I don’t doubt that Stripe have a setup that works well for them them but I also bet they could have gone done a different path that also worked well and I suspect that other path (local VMs) is a better fit for most other smaller teams.
From what I remember (left Stripe in late 2022) much of Stripe's codebase was/is a Ruby tangled "big ball of mud" monorepo due to lack of proper modules. Basically a lot of the core modules all imported code from each other with little layering so you couldn't deploy a lean service without pulling in almost all of the monorepo code. And due to the way imports worked it would load a ton of this code a runtime. This meant that even a simple service would have extremely high memory usage and be unsuitable for a local dev environment where you have N of these bloated services running at the same time. There was a big refactoring effort to get "strict modules" in place to cut down on this bloat which had some promising results. I'm not an expert in this area but I believe this was the gist of it.
You're limited by the resources available to you on your local laptop and when you close that laptop the dev environment stops running. Remote dev environments are more costly and complicated to maintain but they can be shared, can scale vertically (or horizontally) on demand, can persist when you exit them, and managing access to various internal services from dev environments can in some cases be simpler.
It also centralizes dev environment management to the platform team that owns them and provides them as a service which cuts down on support tickets related to broken dev environments. There are certainly some trade offs though and for most companies a local VM or docker compose file will be a better choice.
Also tends to security advantages to mitigate/manage dev risks. Typically hosts will have security tooling installed (AV, EDR, etc) that may not be installed on local VMs, hosts are ephemeral so quickly created and destroyed, network restrictions, etc.
Not even once did I want to share my dev. environment, nor did anyone want to share mine. We are talking about 25-odd years of being a developer.
Never in my life did I want to scale my dev. environment vertically or horizontally or in any other direction. Unless you work on a calculator, I don't know why would you need that.
I have no problems with my environment stopping when I close my laptop. Why is this a problem for anyone?
For overwhelming majority of programming projects out there they fit on a programmer's laptop just fine. The rare exceptions are the projects which require very specialized equipment not available to the developers. In any case, a simulator would be usually a preferable way to dealing with this, and the actual equipment would be only accessed for testing, not for development. Definitely not as a routine development process.
Never in my life did I want development process to be centralized. All developers have different habits, tastes and preferences. Last thing I want is to have centralized management of all environments which would create unwanted uniformity. I've been only once in a company that tried to institute a centrally-managed development environment in the way you describe, and I just couldn't cope with it. I quit after few month of misery. The most upsetting aspect about these efforts is stupidity. These efforts solve no problems, but add a lot of pain that is felt continuously, all the time you have to do anything work-related.
I get a serious feeling that interpreted languages, monorepos, environment orchestration, snapshot ecosystem aggregators, and per-function execution evironments are all pushing software development into the wrong direction.
Those things are not bad by themselves. But people tend to do bad things with them, and those bad things spread remarkably well, disrupting every place they infect.
I'm not sure why monorepos are in the list. Care to elaborate?
I've worked on projects that used a single repository for all the code written by different departments and projects where the same department could have multiple repositories. The later added insane amount of busy work, inordinate amount of errors, difficulty to investigate failures, excessive use of resources to house various permutations of systems created at different times with different combinations of components. The day-to-day in such projects could be described by developers waiting for the infra people to sort out the morning problems which mysteriously broke everything all at once so that no progress can be made.
This was in stark contrast to companies working on a single repository, where days when nothing worked would happen maybe once or twice a year.
I also lived through transitions from multiple repositories to a single repository and the other way around. In operational terms, I've never seen any beneficial effects of splitting a repository. Not in the short, nor in the long term. Complexity always went up, productivity went down, general satisfaction with project infrastructure would also go down with such a change. Department would start attacking and blaming the infra people for creating obstacles to their progress (while never explicitly mentioning the split repository because, usually that was a decision made by the same people complaining).
Oh, yeah, all of those issues of enforcing transitive dependencies that need busy work to update, fluid APIs that make all the code around it break, lack of semantic boundaries that make it hard to decide if a problem is local, inter-component interference so that you have to select them perfectly well...
All of those are enabled by monorepos. And once people learn to do them, they seem to want to apply everywhere.
The absurd lengths people will go to avoid learning how computers actually work because they fell for the buy now, pay later promise of 'easy' development.
Talking form a perspective of someone who worked at Google and one other similar company that shell remain nameless... as well as simply looking at places like Github where people tend to post projects they are working on: I don't know of any Github project that would be even in the size range to cause any discomfort for a laptop user.
Even when it comes to the larger projects: I have multiple checkouts of GCC and Linux kernel on my laptop, and when I run du their existence doesn't even register in the first dozen results... Of course, proprietary projects tend to be on a bigger side due to putting a lot of not-strictly code-related stuff in a repository, but still... it would have to be billions LoC big to be prohibitively big for a typical laptop.
If you have 100 services in your org, I don't have to have 100 running at the same time in your local dev machine. I only run the 5 I need for the feature I'm working on.
We have 100 Go services (with redpanda) and a few databases in docker-compose on dev laptops. It works well when and we buy the biggest memory MacBooks available.
Your success with this strategy correlates more strongly with ‘Go’ than ‘100 services’ so it’s more anecdotal than generally-acceptable that you can run 100 services locally without issues. Of course you can.
Buying the biggest MacBook available as a baseline criteria for being able to run a stack locally with Docker Compose does not exactly inspire confidence.
At my last company we switched our dev environment from Docker Compose to Nix on those same MacBooks and CPU usage when from 300% to <10% overnight.
Have any details on how you've implemented Nix? For my personal projects I use nix without docker and the results are great. However I was always fearful that nix alone wouldn't quite scale as well as nix + docker for complicated environments.
Hi Jason! Like many others here I'm looking forward to that blog post! :-)
For now, could you elaborate on what exactly you mean by transitioning from docker-compose to Nix? Did you start using systemd to orchestrate services? Were you still using Docker containers? If so, did you build the images with Nix? Etc.
When we used docker-compose we had a CLI tool which developers put in their PATH which was able to start/stop/restart services using the regular compose commands. This didn’t accomplish much at the time other than being easy to remember and not requiring folks to know where their docker-compose files were located. It also took care of layering in other compose files for overriding variables or service definitions.
Short version of the Nix transition: the CLI tool would instead start services using nix-shell invocations behind pm2. So devs still had a way to start services from anywhere, get logs or process status with a command… but every app was running 100% natively.
At the time I was there, containers weren’t used in production (they were doing “App” deploys still) so there was no Docker target that was necessary/useful outside of the development environment.
Besides the performance benefit, microservices owning their development environment in-repo (instead of in another repo where the compose configs were defined) was a huge win.
several nixy devtools do some process management now
something we're trying in Flox is per-project services run /w process-compose.
they automatically shutdown when all your activated shells exit, and it feels really cool
I've been on this path and as soon as you work on a couple of concurrent branches you end up having 20 containers in your machine and setting these up to run successfully ends up being its own special PITA.
What exactly are the problems created by having a larger number of containers? Since you’re mentioning branches, these presumably don’t have to all run concurrently, i.e, you’re not talking about resource limitations.
Large features can require changing protocols or altering schemas in multiple services. Different workflows can require different services, etc. Keep track of different service versions in a couple branchs (not unusual IMO) and it just becomes messy.
You could still run the proxy they have that lazy boots services - that’s a nice optimisation.
I don’t think that many places are in a position where the machines would struggle. They didn’t mention that in the article as a concern - just that they struggled to keep environments consistent (brew install implies some are running on osx etc).
I think it’s safe to assume that for something with the scale and complexity of Stripe, it would be a tall order to run all the necessary services on your laptop, even stubs of them. They may not even do that on the dev boxes, I’d be a little surprised if they didn’t actually use prod services in some cases, or a canary at any rate, to avoid the hassles of having to maintain on-call for what is essentially a test environment.
I don’t know that’s safe to assume. Maybe it is an issue but it was not one of the issues they talk about in the article and not one of the design goals of the system. They have the proxy / lazy start system exactly so they can limit the services running. That suggests to me that they don’t end up needing them all the time to get things done.
Working in a configuration where your development environment isn't on your computer is always a huge downgrade. Work with VM? -- sooner or later you'll have problems with forwarding your keyboard input to the VM. Work with containers? -- no good way to save state, no good way to guarantee all containers are in sync etc. God forbid any sort of Web browser-based solution. The number of times I accidentally closed the tab or did something else unintentionally because of key mapping that's impossible to modify...
However, in some situations you must endure the pain of doing this. For example, regulatory reasons. Some organizations will not allow you to access their data anywhere but on some cloud VM they give you very botched and very limited control over. While, technically, these are usually easy to side-step, you are legally required to not move the data outside of the boundaries defined for you by the IT. And so you are stuck in this miserable situation, trying to engineer some semblance of a decent utility set in a hostile environment.
Another example is when the infrastructure of your project is too vast to be meaningfully reduced to your laptop, and a lot of your work is exploratory in nature. I.e. instead of typical write-compile-upload-test you are mostly modifying stuff on the system you are working on to see how it responds. This is kind of how my day-to-day goes: someone reported they fail to install or use one of the utilities we provide in a particular AWS region with some specific network settings etc. They'd give me a tunnel to the affected cluster, and I'd have some hours to spend there investigating the problem and looking for possible immediate and long-term solutions. So, you are essentially working in a tech-support role, but you also have to write code, debug it, sometimes compile it etc.
What you describe isn't a development process in a remote environment. You are testing on some remote compute resource. Testing is a non-essential part of development, so, in a sense it "doesn't count" that you test somewhere else -- you cannot call it "developing in a remote environment".
Otherwise, you could say that, for example, reading documentation on a Web page you are doing "development in a remote environment" because, well... most likely that Web page isn't hosted on your laptop.
The essential and mandatory part of development is that a program is written. If you write the program on your laptop, you aren't doing "remote development", no matter where other tools you use for development are running.
The year of Linux on the laptop has yet to arrive for most of us. Windows and MacOS both offer better battery life, if for no other reason (and there are usually other reasons, like suspend/wake issues, graphics driver woes, etc.)
Agreed. It's so much simpler when people run Linux locally too. Most of our dev environment problems are from people who don't. When you run it locally you also get good at using it which, unsurprisingly, helps a lot when you have to figure out a problem with the deployed version. Learning MacOS/Windows is kinda pointless knowledge in the long run.
At my workplace, Guix's lack of macOS support takes away some of the benefit of using something like Nix or Guix as opposed to HVM solutions like Docker Desktop or Vagrant. I imagine this situation is unfortunately common.
For teams where GNU/Linux is the primary development OS, Guix seems like a great choice.
I set this up for my last company where we had all sorts of “works on my machine issues” and a needlessly painful onboarding experience. Local development became streamlined with this tooling BUT pre-apple silicon macs couldn’t handle running Docker like this. Glacially slow. We had a python monorepo with a bunch of services within it.
I am curious whether nix is an alternative / improvement for this. Was going down the nix road at first but an infrastructure team member steered me toward devcontainers instead and I’ve been pretty happy since!
FYI, I've helped set up StableBuild (https://www.stablebuild.com) to help pin stuff in Docker that's normally virtually impossible to pin (e.g. OS package repos, Docker base images, random files from the internet, etc.)
Different kind of rot. With nix and flakes, I can come back to a project 5 years later and as long as external dependencies (i.e. package sources) still available it will bring me back straight to that environment like it was yesterday.
If you have a Dockerfile from 5 years ago...well good luck building it today.
my point stands, it's still trying to lie about the meaning of the word rot. it's just as delusional as docker original "rotting software will run in 5yrs" argument... nothing there goes against rot
Containers are a great deployment target, but they're not really a great development environment for a few reasons (e.g., they're Linux-specific, so they require extra virtualization on non-Linux operating systems, the kind of isolation they provide is more of a hindrance than a help when it comes to working on your local filesystem, and for them to be useful you have to set up infrastructure to push and pull your private containers to and from).
Nix is a better fit for this, and when you're using Nix you can also have Nix generated containers for deployment. I think you can also use a container with Nix in to provide the devcontainers interface to devs who don't have Nix installed locally, and have it in turn use Nix against your project's flake to set up its environment.
IIRC, it uses what is defined for shell environment. Just instead of activating on your machine, it produces OCI image with that environment.
I have nixOS definitions that I can use to make a SD card image, overtake a running linux system via ssh, deploy to nixos via ssh, or deploy to a local system - all from one definition.
It (obviously) leverages Nix, which in turn means the environment is declarative and fully reproducible (not "reproducible" as in docker). Now, you can use just Nix's devShells, but with devenv you have a middleground between just Nix package manager and a full fledged NixOS module system. Basically, write out one line of code - and you've got your Postgres, another one - full linter set up for whatever language you're using, etc.
Can I also get the security/isolation benefits that a duly configured docker/podman can provide (container can only act on mounted volume, non-root user, other seccomp settings?).
I feel better doing my "npm install"s in such an environment (of course it's still not a VM – but that's another topic).
When I read about nix, reproducibility is a goal, but security/isolation is a non-goal.
So you can use them in conjunction (or alternation, if for some projects you're okay running without a container) without having to specify your development environments twice.
> I feel better doing my "npm install"s in such an environment (of course it's still not a VM – but that's another topic).
There are basically two kinds of integration you can do for a project with Nix, which I'll call deep and shallow. In shallow integration, you just have Nix provide the toolchain and then you build the project (manually, with a script, with a Makefile, whatever). This is pretty common and pretty easy, and gives you no protection from malicious NPM build scripts.
For deep integration, you can actually have Nix build your whole project. This has some downsides, like that it can't really handle incremental builds. It also imposes restrictions, like no network access by anything but Nix at build time, all packages are built by special build users with no homedirs and no perms to access anything, etc. When you do that kind of build/install, you do get some protection from crypto miners lurking in the NPM registry or PyPI or whatever.
My small team uses devenv for all our development environments and we really like it. Local DX is really important to me and to our team, which is a big part of why we've chosen Nix and devenv.
As we've started to use it more extensively, we've also found that we want to add some enhancements, work out some bugs, and experiment with our own customizations out-of-tree, etc. I'm happy to report here on HN that devenv is well-documented and easy to extend for Nix users who have some experience with Nix module systems, and that Domen is really responsive to PRs. :)
I think for smaller companies, you can get a long way towards a lot of this with judicious use of docker-compose, and convenience scripts in a Makefile. As long as you don't do anything stupid like try and spin up 100 services when you're a team of 8, most laptops these days are sufficiently capable of handling a database, Redis, your codebase, and something like LocalStack.
I would say you can even go a looong way without any Docker at all.
And for the large majority of the companies/projects, if your project is so complex and heavy of resources that it doesn't fit on a modern laptop, the problem is not in the laptop, it's in the whole project and the culture and cargo-cult around "modern" software development.
Containers/VMs are a nice way to isolate away any machine configuration discrepancies. Conversely it does encourage the use of non hermetic and deterministic build systems which come with other issues too (eg speed differences surfacing race conditions in the build)
>Some caveats: It’s been nearly five years, and I have no doubt that I have misremembered some of the specific details, even though I’m confident in the overall picture. I’m also certain that Stripe has continued evolving and I make no claim this document represents the developer experience at Stripe as of today.
Are there any more recently ex-Stripe folks here willing and able to comment on how Stripe's developer environment might have evolved since the OP left in 2019?
The biggest difference not mentioned is the article is that code is no longer kept on developer machines. The sync process described in the article was well-designed, but also was a fairly constant source of headaches. (For example, sometimes the file watcher would miss an update and the code on your remote machine would be broken in strange ways, and you'd have to recognize that it was a sync issue instead of an actual problem with your code.) As a result, the old devbox system was superseded by "remote devboxes", which also host the code. Engineers use VSCode remote development via SSH. It works shockingly well for a codebase the size of Stripe's.
There are actually several different monorepos at Stripe, which is a constant source of frustration. There have been lots of efforts to try to unify the codebase into a single git repo, but it was difficult for a lot of reasons, not the least of which was the "main" monorepo was already testing the limits of the solution used for git hosting.
Overall, maintaining good developer productivity is an extremely challenging problem. This is especially true for a company like Stripe, which is both too large to operate as a "small" company and too small to operate as a "big" company. Even with a well-funded team of lots of super talented people putting forth their best efforts, it's tough to keep all of the wheels fully greased.
Glad to see that they moved to code living with the execution environment. The code living separate from the execution environment seemed like too much overhead and complexity for not enough benefit.
Especially given VSCode, or Cursor ;), work so well via ssh.
To the engineers that don't want to use those IDE's it might suck temporarily, but that's it.
IntelliJ is also supported. If you want to use something else, like VIM, then you need to ssh into the remote devbox machine. They have support for custom dotfiles, so you can set up your cool VIM environment for all your remote devboxes.
If you don't want remote devboxes, the regular devboxes still work. You just need to deal with the additional pain for syncing the files.
* Code is off of laptops and lives entirely on the dev server in many (but not all) cases. This has opened up a lot of use cases where devs can have multiple branches in flight at once.
* Big investments into bazel.
* Heavier investment into editor experiences. We find most developers are not as idiosyncratic in their editor choices as is commonly believed, and most want a pre-configured setup where jump-to-def and such all "just work".
That last point has long been a red flag when interviewing. A developer who doesn't care about their tooling also tends to not care about the quality of their work.
I'd rather work with developers who are flexible and open minded about the conditions they can work in than those who get notoriously pissy if things aren't set up exactly the way they like it. Especially when that way is ridiculously elaborate and non-standard.
I'm glad to see that first bullet point. The code living separate from the execution environment seemed like too much overhead and complexity for not enough benefit.
Not ex-Stripe but in "close relationship" with them since its inception and there's a clear mark in my calendar circa end of 2018 when their decisions and output started to become... weird, or ill-designed.
I don't think it has to do with the dev environment itself, but I'd blame such thing for allowing to deliver "too fast" without thinking twice. Combine that with new blood in management and that's an accident waiting to happen *
They're the best in business still, but far from the well-designed easy-to-use API-first developer-friendly initial offering.
Though I am under the impression that things have gotten more sensical internally over the last year or so.
Note also that the devprod team has largely been shielded from the craziness, and may still be making good decisions (but I don't know what they are in this realm personally).
I was only there in 2022, but at that point there were in fact three or more monorepos (forked roughly based on toolchain - go and scala in one, primarily Ruby in the one detailed here, and there was one for the client stripe api libs that was JS only. There may have been more.
I use syncthing to manage the synchronization of files between local laptop and remote development server. The software code base is upwards of 20 years and has dependencies on Windows for runtime. I can run unit tests locally on very fast MacBook Pro or run it much slower on Windows VM. With syncthing I can easily edit files locally or remotely and they are available locally for source control.
The worst problem is refining the ignore settings to ensure only code is synced preventing conflicts on derivative files and that some rule doesn’t overlap code file names.
I love this. I believe I might have even interfaced with your team around that time. I was leading Facebook's (now Meta) Developer Products team and we were building against super similar areas internally.
We ran back then a similar project that I coined "Developer On-Demand" to tackle that same problem space. It's also what eventually lead me to find the magics of Nix and then build Flox.
I also agree with a lot of what was shared in other comments, while the problems we tackled at large orgs such as Facebook, Shopify, Uber, Google (to name a few teams I remember working with) and obviously also Stripe, certain areas of the pain are 100% universal regardless of team size.
On the Flox side, we're trying to help with a few of them today and many more hopefully in the soon future, very open for thoughts! Things like - simple to use Nix for each of your projects + keep deps and config up to date across everyones Macbooks and Linux boxes, etc -- even if you don't have a full AWS team and Language Server team ready to support.
We use similar practices in our 3.5 person team; we work via code-server and Aider with our own tooling on VPSs and this gets synced to execution VPSs which run dev versions, a lot of sentry logging and tests (mostly playwright these days). There is also a vps which does builds all day and logs to Sentry too. We can almost instantly get on our own test versions and see what we did, and, over the space of some seconds to minutes we see test and build data coming in. It works incredibly well for many years already. Onboarding people is easy and no one ever has 'it doesn't build on my system' as that's not something we do (you can of course, all scripts are there but why waste the time?).
I grew up with mainframes, minis and unix batch andor multiuser machines; for me this is the best way for business applications. I didn't particularly like the move to local all that much.
I'd suggest you revise your competitor analysis. Bazel definitely has a test command that with remote execution and caching absolutely allows you to run entire test suites in seconds* both locally and in CI eg. https://blog.aspect.build/typescript-with-rbe
> This blog post says 2 and a half minutes not seconds.
It's meaningless to say "we can run tests in seconds". You can't run my tests in seconds because they're single threaded and take 10 minutes. The important thing is the speedup, and they got a pretty good speedup. Arguably the nop build/test time is important too but it doesn't look like they measured that.
> Basel does not solve this problem out of the box.
Yes it does.
> I wonder why stripe didn’t “just use Bazel”.
In my experience it's because setting up Bazel is a) more work than setting up some ad-hoc build system (Make or CMake or whatever) and b) difficult to switch to retrospectively. So it only gets used where you have people who are experienced enough to know that you will wish you had started with it, and can convince the inexperienced people that it's worth the effort.
Usually you get too many inexperienced people saying "it's too difficult; we'll be fine with Make".
Stripe does use Bazel. It just didn't exist before Stripe built some of its own internal systems, but it's gradually replacing ~everything from a build standpoint.
The one thing to know about Bazel is that it's both incredibly impressive, and also one of the least ergonomic pieces of software ever created. It's very clearly an internal project which was cleaned up and open sourced without any attempt to make it more usable outside of Google.
Bazel's kind of like Kubernetes in a way -- you don't actually get enough benefits to adopt it until you're at a certain point in the company lifecycle, and to get to that point you usually have to build other systems first. Then you have to gradually replace those systems with Bazel.
First release of Basel was in 2015 when Stripe was already 5 years old and the progenitor of this tooling was already running with several dozen users.
To be clear the sync step is used for the test suite execution not only the one off command running - it’s just something we can also easily do because we have a hot env in the cloud
> They don’t work from your local development env and also work in your CI env.
This is one of the biggest selling points of bazel-like build systems. Like to the extent that, for some changes, bazel can say "even though you changed this source file, I can be 100% certain that that change didn't affect any tests and so I will not run them"
> Finally: the development experience, of course, is only part of the story: the full lifecycle of code and features continues onward into CI and code review and ultimately through deployment into production, where it will be further observed, debugged, and evolved. Writing about those systems would require further posts at least this long.
In case the author is around: I would love to read those!
They decided to keep the code on the local machine, but the language server on the remote one. That seems like a recipe for inconsistency. You only get relevant results from your language server once your code has synced.
The article mentions that the LSP itself already has baked-in support to enable editors to send chunks of unsaved edits to the language server (LS) as they happen.
What Stripe’s configuration introduced is that they used a remote LS instead of the default local LS. Regardless, VS Code already defers LSP communication until it feels idle, and developers are used to that. So I wouldn’t expect a remote LS to significantly impact the level of inconsistency that developers already accept when using a local LS.
On the other hand, there was so much code that running everything on your own laptop was essentially out of the question. Doing a git pull after a long vacation locked up your dev box for a hot minute while it checked all the types—doing the same thing on your MacBook would be painful at best.
The code syncs on every keystroke. Consistency isn't an issue unless you are having connection issues. And if you are then pretty much all development is broken anyways.
We’ve been using a hundred repositories and a hundred Go services in a local docker-compose setup that’s worked fairly well. CI runners can struggle if their disks can’t keep up with Docker.
It comes up that we should make a devprod for front end folks to make the backend abstracted more.
Overall a lot of people prefer local dev because it gives them access to the entire stack, lets them run branch images easier, and has better performance than remote boxes.
So from a gut feeling that sounds right, finance is a pretty complicated domain with a lot of per vendor interactions, and Shopify outsources their payment stuff to Stripe.
Also on a headcount level, Google tells me Shopify has 3,500 employees to Stripe's 9,500. Obviously neither company is compromised entirely of engineers, so this is a ballpark estimate.
GitHub feels like the real case where there might be a larger codebase. It's in the middle for employees (6,500), but it's existed longer than Stripe (though not as much longer as my gut feeling told me, interestingly)
It's possible to get stuck in merge hell where all your reviewers ok the PR but someone merged a conflict 2 seconds ago, or you've got a reviewer in Singapore while you're in SF and conflicts appeared overnight.
In general it was pretty rare, in my experience. The code bases were pretty well modularized.
PR's are not split into submodules/frameworks for distinct review purposes. Functionally though, right now we have three distinct monorepos (not very 'mono' but we're working on it!), that represent our three main development stacks.
In our PR tooling, there's nothing that enforces/encourages scoping changes to a specific subset of any given repo. We generally encourage smaller PRs as a best practice, but a huge chunk of the benefit of a moonorepo is folks can make cross-cutting changes if/when necessary.
We have some custom goo on top of Github that manages the review flow. Specifically, we try to make it easy for folks to farm out reviews to the teams that own the code you're editing, so they don't have to hop in Slack and track down reviewers.
NB. What the article describes isn't a developer environment in the cloud. It's testing in the cloud. The editor in their model lives on the programmers' laptops, the editing happens there as well and so on. The code is deployed to cloud infrastructure for testing.
"I’ve described a lot of fairly-involved custom tooling; we needed enough engineers to build and maintain it, and enough “customer” engineers for that investment to pay off."
This is so important when deciding to re-invent the wheel. I've gotten bitten by this many times.
I wonder if there’s a devbox-as-a-service tool out there. I use a MacBook Air for most of my work and on occasion would be benefited by using a beefier machine in the cloud. I just don’t want to set up a machine, set up sync etc.
Thanks. I wonder what the experience is like working on a very large codebase with or without a framework. E.g. Stripe vs Shopify.
Or if the framework is barely noticeable at that scale and doesn't really matter anymore. That's the impression I get for Instagram (which was built with Django).
This is an awesome writeup of the tools and culture issues you run into maintaining dev environments.
From post, the problems that justified central dev boxes are roughly:
1. dependency / config mgmt / env drift on laptops
2. collaboration / debugging between engineers
3. compute scaling + optimization
4. supporting devs with updates and infra changes
The last one is particularly interesting to me, because supporting the dev env is separate engineering role/task that starts small and grows into teams of engineers supporting the environment.
I'm helping build Flox.
We're working on these pain points by making environments (deps, vars, services, and builds) workable across all kinds of Mac/Linux laptops and servers.
1) a. Virtualize the pkg manager per-project
b. Nix packages can install across OS/arch pretty well
2) Imperative actions like `flox install`/`upgrade` always edit a declarative env manifest.toml -- share it via git
3) less Docker VM's -- get more out of devteam Macbooks
4) reduce toil with a versioned, shareable envs
--> less sending ad-hoc config and brew commands to people (as mentioned in the post.)
Just `git pull && flox activate`.
I think on problem point #2, collab tools are advancing to where, pairing on features, bugs, and env issues can be done without central SSH. (ex: tmate, vscode liveshare, screensharing, etc) -- however, that does sort of fall apart on laptops for async debugging of env issues (ex: when devprod is in the US, and eng is in London).
Having universal telemetry on ephemeral cloud dev-boxes with a registry and all of the other DNS and SSH goodies could be the kind of infra to aspire to as your small teams run into more big-team problems.
In the Stripe anecdote, adopting the centralized infra created new challenges that their devprod teams were dedicated to supporting:
- international latency from central, US-based VM's
- syncing code to the dev boxes (https://facebook.github.io/watchman/)
- linting, formatting, generating configs (run it locally or serverside?)
- a dev workflow CLI tool dedicated to dev-box workflows and sync'ing with watchman's clock
- IaaS, registry, config, glue for all the servers
This is all very non-trivial work, but maybe there's a future where people can win some portability with Flox when they are small and grow into those new challenges when it's truly needed -- now their laptop environments just get a quick `flox activate` on some new, shiny servers or Cloud IDE's.
I really like the notes from the author on how useing Language Server Protocol across a high latency link has great optimizations that work along side the watchman sync for real-time code editing.
Yet another replay of timesharing development experiences, I guess we need a couple of generations more to count how many times does a pendulum swing back and forth during a developer's lifetime.
"This scale – the scale of devprod, and in turn the scale of the overall organization, such that it could afford 10 FTEs on tooling – was a major factor in our choices"
Is basically the summary for most mono/multi repo discussions, and a bunch of other related ones.
It doesn't matter if you have a mono-rep or multi-repo, you will need engineers on tooling to make it work if your project is large. There are pros and cons to both multi-repo and mono-repo with no one right answer (despite what some will tell you). They are different pros and cons, but which is best depends on your particular context.
Yeah that was my point. In the end both approaches can be fine (depends on your context). The real difference is that whatever choice you take, it will need the right investment in tooling and support.
Multirepo also comes with cost overhead. I think people talk about it somewhat less. I’ve worked at multirepo and monorepo places, both, before. My current company has a multirepo setup and it sure seems like it comes with plenty of tooling to fetch dependencies. That tooling has to be supported by FTEs.
+1. I'd go as far to say that multi-repo probably needs as much, if not more effort to properly keep functioning, but all that effort is better "hidden" so people assume monorepos are more work.
With a monorepo, it's common to have a team focused on tooling and maintaining the monorepo. The structure of the codebase lends itself to that
With a multirepo codebase, it's usually up tu different teams to do the work associated with "multirepo issues"— orchestrate releases, handle dependencies, dev environment setup, etc. So all that effort just kinda gets "tucked away" as overhead that each team assumes, and isn't quite as visible
Internally, they definitely do. I worked at Stripe's monorepo many years ago, and I am working at a larger company with massive amounts of repos. The difference in pain has little to do with mono v multi, but with the capabilities of your tooling team.
If there's anything I'd say to low-level execs, the kind that end up with a few hundred developers under them, it's that mis-sizing the tooling team, in one way or the other, comes with total productivity penalties that will appear invisible, but will make everything expensive. Understanding how much of a developer's day is toil is very important, but few really try to figure that out.
I think a lot of this is just type of thing comes because with a monorepo you can actually see the problems to solve whereas you can easily end up with the same N engineers firefighting the same problems K times across all your polyrepos.
You have different problems with both. Some problems are hidden in one, but there is no one best answer. (unless your project is small/trivial - which is what a lot of them are)
How does a payment service wind up with over a 1000 engineers?
I understand that "engineers" may not mean "developers", it could DevOps, site reliability and all the bits and pieces that make up a large service provider, but over a 1000?
It's a fair question, I think you need ~10 or so to do the work and 990 to bikeshed the horrors that come about when you have what should be a few 3mb HTTPS APIs at $50-100pcm split between 1000 resume-hungry engineers and a taste for cloud complexity.
probably has something to do with the products that go beyond accepting credit cards listed on https://docs.stripe.com/products, though I suspect operating just the credit card accepting part is harder than you imagine.
This isn't recommended practice really and there is nothing about this which justifies having to maintain huge code bases in a single folder or multiple folders in one larger one.
Won't be surprised to see that many would probably need a safari map or README documentation in every single folder to navigate a repository as large as stripes.
Sounds like an emergence of a new bad practice if you are having to praise how large your code base is.
> Won't be surprised to see that many would probably need a safari map or README documentation in every single folder to navigate a repository as large as stripes.
No different to having thousands of smaller repos instead.
I personally dislike monorepos, for very niche, in-the-weeds operational reasons (as an infra person), but their ergonomics for DX cannot be understated.
The 'ergonomics for DX' benefit is that you can share code across projects without having to go down the path of creating a package / library pushed to some internal registry and pulled by each project right?
Or are there any other aspects to the monorepo architecture that make it beneficial for large companies like that?
Just curious, I've never worked in such an environment myself.
To put it in the most general terms: It provides the same value that using a VCS has for a project, but applied to the entire company.
In a standalone project, would you accept a change that is incompatible with other code in the project? For example, would you allow a colleague to change a function in a way that breaks the call sites? No, you probably would not.
The attitude within monorepo shops is that this level of rigour should be applied to the entire company. Nobody should be able to make a change anywhere if it would break anything elsewhere, or they should only be permitted to do so with intention. There are caveats to this, but that is the general idea.
In addition to what you mentioned, the ability to atomically commit to a library and all of its consumers. And for a change to a library run the tests of all of its consumers as well.
Every host running a particular commit is running the code you think it is. No submodules or internal packages. If you updated the Button component in the design system, when your commit is deployed, every service that gets deployed has the new button now.
I'd say there's 4 main advantages, summarizing what other comments are saying but also from my own experience:
- atomic PRs. All changes for a migration/feature living in one spot makes development much easier, especially when dealing with api changes and migrations
- single history. This is useful when debugging. A commit can more easily encapsulate the state of "the whole system" as opposed to a single part of it. This makes reverting, if necessary, easier
- environment consistency. updating the linting tool, formatting tool, UI library, etc is never a priority, so there's always drift, where an old repo gets stuck with old tools, dependencies and an old environment
- not shipping your org chart is easier when everyone can see and work work on the whole codebase, as easily as possible.
Example: Service A requires version 1.1 of libFoo and libFoo 1.1 requires version 0.1 of libBar. But Service A also directly uses libBar version 0.2. Now you have a conflict.
If libFoo and libBar are internal code stored in a monorepo they're automatically version-compatible because there is only one version of both.
how do you coordinate deploying a change that requires six different repos to be deployed to six different systems at the exact same time? with a mono repo, you're still deploying to six systems, but at least there's only one commit sha to keep track of.
Meta also has a massive monorepo accessed primarily through cloud devservers.
When several of the world’s most successful software companies use this approach, it’s hard to argue that it’s inherently bad. Of course it’s sensible to discuss what lessons apply to smaller companies who don’t have the luxury of dedicated tooling teams supporting the monorepo and dev environment.
Just because some successful companies use some approach doesn't make it the best practice. I have seen firsthand nuisance of monorepo, which took almost 15minutes to correctly switch branches on intel machines(and decently spiked the CPU by causing windows defender to panic). It has decent benefit of easy code sharing, but build and test are soul sucking experiences and if someone decides to run some updated formatter and linter rule accidentally, the whole MR becomes a nightmare to correctly review(once had a 2k+ changes and had to request to rollback and then only commit what they actually wanted to change).
> took almost 15minutes to correctly switch branches on intel machines
This can probably be fixed with trivial tuning. Just configuring Git to fetch only your branches would speed up the branch switching significantly.
> build and test are soul sucking experiences
Why? It doesn't have to be. If you are going to build the entire monorepo, then yes, but this should only happen when you are running CI, and even then you can break down the builds into smaller components.
> the whole MR becomes a nightmare to correctly review
Not if you set up code ownership properly. You also need to think what happens in case of emergencies, so having a selected list of "super users" and users with permissions to bypass reviews is important.
It sounds like this company wanted a monorepo, but nobody invested any money or time to actually think about developer productivity. When this happens, yes, of course it won't be good, because no project succeeds like this. The nice thing about a monorepo is that instead of 1,000 repos with tooling all over the place and no specialist to take care of them, you can have one repo with really good tooling and a team dedicated to just keep it running smoothly. But if nobody is actually taking care of the monorepo, it will rot just like any other codebase.
“Someone autoformatted the whole thing under new settings at the same time as introducing a new feature” is hardly a monorepo problem. That could be a pain in the ass to review even in a single file. But the flip-side, of someone cleanly wanting to a do a mass autoformat or autorefactor, is much easier in a monorepo than in split repos.
Nothing you describe is inherent to monorepos. Git is slow yes but go use hg. Build and test are slow? That's a CI problem: you didn't allocate enough resources to the build system. Someone ran a formatter accidentally? That's that someone's mistake.
imo monorepos are great, but the tooling is not there, especially the open-sourced ones. Most companies using monorepos have their own tailored tools for it.
I've been working on an ephemeral/preview environment operator for Kubernetes(https://github.com/pier-oliviert/sequencer) and as I could agree to a lot of things OP said.
I think dev boxes is really the way to go, specially with all the components that makes an application nowadays. But the latency/synchronization issue is a hard topic and it's full of tradeoff.
A developer's laptop always ends up being a bespoke environment (yes, Nix/Docker can help with that), and so, there's always a confidence boost when you get your changes up on a standalone environment. It gives you the proof that "hey things are working like I expected them to".