I have figured out that monorepo mainly covers for faults in maintaining separation between components (like ensuring only single direction of dependencies and no cycles, ensuring backwards compatibility, self-service, et.c) and to some extend to cover for out of control microservice craze.
Kinda like giving unrestricted access to your PROD databases does help improve efficiency, but at the cost of additional risks and deterioration of separation between applications, lack of APIs for users to self service themselves, etc.
Especially with microservices, these tend to fare poorly if you don't solve static costs of maintaining a service. So rather than invest in proper tooling let's just plop it into single repository. It will cover some problems a little bit.
Now, I don't want to say monorepo is bad by itself. The problem is when it stunts people's ability to maintain proper APIs and development process.
This means that developers are free to write code that is correct now rather than be limited by technical decisions made in the past, that may no longer be valid.
It also means that when you do update an API you are confronted with the technical reality of how the API consumers actually work. This may cause you to reevaluate and improve your mental model of the API.
If your API is siloed away in its own repo then you can easily get in a position of taking it in a direction that is not actually in line with what API consumers actually need.
In my company the former is handled by never deprecating any field (and then clients just get to deal with picking and choosing the ones they want). If a major schema change is required, spin up an entirely new service and tell people to migrate over and eventually turn off the old one when nobody's using it anymore.
Library changes can be done atomically: change the API, change the call sites, if tests pass you're done. One may opt to use the same strategy as w/ microservices here too.
Regardless, the dynamics for interacting w/ microservice API changes doesn't change based on whether you're on a monorepo or not. But a monorepo can help in the sense that some aspects of a service are version-controllable (namely the schema definition files) and it's in the clients' best interest to have those always up-to-date rather than having to remember to run some out-of-band syncing command all the time.
Every internal service would always hit the exact version it was compiled with and you only need to worry about external api compatibility at that point.
Most use cases you can just get some scheduled downtown, though.
In the same vein, at my company we have a mechanism to specify package ownership, and the owners may opt to make themselves mandatory code reviewers for any incoming change.
IMHO, this is nicer than multi-repo because you get a lot more visibility into who's actually using what and you can enforce some level of accountability, which means you don't get into awkward situations where A made a breaking change, B uses A but never upgraded and now C is trying to deal with a newly discovered vulnerability affecting A and B.
A project may have ownership data, and if it does it defaults to making the owner an optional reviewer. But if the project doesn't have ownership data, ownership bubbles up the folder tree to a folder that does have such information.
We organize projects such that increasing folder depth also increases ownership specificity (e.g. at the project level, the folder structure implies a specific team has ownership, one level up is their business vertical, one level up is the cost center category and so on, all the way up to root, which represents "everyone"). With this scheme, we can reassign ownership in situations like when the sole owner of a project leaves the company, or if teams get restructured. And by not making review mandatory, we are able to unblock cases like landing a high priority security patch while the reviewer of one of the affected packages is on vacation.
In a multi-repo setup, one change becomes many commits and everybody sees only their piece of the jigsaw puzzle.
It does depend on everybody being adults and behaving sensibly; unfortunately there's no technology that can solve that problem.
Unfortunately, the world is not ideal. Some problems that are easily detected when you have multiple applications that have their own teams, apis, release schedules, and repositories stop being naturally easily detectable when every developer has access to every repository and can introduce changes in sync on both client as well as service side.
One company I worked for used monorepo. Developers would shortcut development process and would omit some practices by modifying both client and the server at the same time, in sync, not caring for backwards compatibility.
Then there was a huge outage because the two changes would require both client and the service to be deployed at the same time. But you know, in a distributed system no two things happen at exactly the same time, so there was a short moment where the services were mismatched and some broken data was saved. And a day later that broken data completely destroyed an important production batch with large loss for the company.
And while it is definitely fun, convenient and efficient to be able to do just that, it requires also a little bit of care so that the whole system does not deteriorate over time.
I'll think about the next steps taken by the OP. Does anyone have any other practices they can recommend for managing these type of projects?
I said there were a number of frustrations, but that was the driver.
In the new repo format each repo produces a single library/module, and can be versioned as a whole. When I run an application for dev, I can run a given library from a release or from live code (with a dev server serving ES modules from a configured path). This works well when one of the repos is not changing. But I end up developing in a significant number of repos at the same time, so I just configure it to run from live code rather than release for all those libraries i'm working on. This feels like working with a monorepo except I have the hassle of managing commits to all thse repos. Long story short, right now I am working out of the master branch because I screwed up commits. You may have figured out by now my strength is not in dev ops.
When you make a version with lerna it auto-tags the commit with the version numbers of the components, so consumers can still depend on a specific version of a component.
I'm setting it up now at a new company and its pretty amazing.
Honestly, the only way around these sorts of issues is to utilize automation in some form.
I've found that setting up repositories (like devpi, Artifactory, or Docker Registry) on a shared network location (for your project, could be local if you work alone) and using CI/CD tools (like Jenkins) are the key. The goal is that you end up working on one portion of the code base at a time, and those need to go through the standard validation processes so that you can pull in the updated package version when you work on something down-stream. You making sure that the CI/CD environment _doesn't_ have access to other packages's non-versioned code is key for making sure things actually work as expected.
For example, if you have FooLib, and you need an update in that for BarApp, then even if you branch FooLib 1.2.3 to 1.2.3-1-gabc1234d (the `git describe` of the commit) on `feat/new-thingy` , then even if BarApp v2.3.4-1-gaf901234 depends on that new branch, it shouldn't be in any way able to reference that branch on the CI/CD build process. How do you get around this? Good development -- finish the FooLib branch, get that working, merge it in with the updated version, and push the package (with the new version) to the CI/CD-accessible repository. At that point, when you push your BarApp change, it can actually build and not die. But until FooLib has got a versioned update, BarApp's branch _shouldn't_ be able to build.
The statement of "But I want to work on the changes locally, in parallel" is valid. That's what local development is for -- giving you space to work on related things that don't impact the upstream codebase. You should have the option to utilize FooLib's branch code in your BarApp code locally, and you can often do that via things like `pip install` or `maven install` or whatever the relevant local install command is. At this point, the package still probably has the same version number, so the local build doesn't trigger issues. You can work on the two and tweak and twist as you want, but refrain from actually trying to push BarApp referencing FooLib's branch until it's actually in the repo.
This all takes a great deal of restraint and patience. The goal here is make it just a tad harder to introduce problems somewhere since you can't depend on something that hasn't been given the go-ahead. While there might be a lot of "Updated FooLib requirement to v1.2.4" throughout your codebase, why are you doing that just off-hand? If you are doing it because of a security issue or bug, let that be known in the commit message. If you are doing it because you can utilize a new feature/whatever, your commit message won't be just "Updated FooLib", you likely are doing "Added Feature X2Y, updated FooLib to 1.2.4".
PHP I try not to touch much, simply because I've always had bad experiences. I know for a fact that there are decent ways to do it with build tools like Maven, setuptools, and Docker. Hell, I have used Docker as a way to introduce versioned dependency packaging, only needing to use Docker Registry (each dependent project does a multi-stage build, pulling in the dependencies via the versioned package images).
This is such a horrible practice. You're creating mountains of extra work, and encouraging devs to delay integration testing, which is certain to lead to cycles of rework. It also only 'works' on toy features. When you're building a complex feature that requires a few weeks of work and a few devs, it quickly breaks down, and further prevents early QA testing of the new feature itself.
Unfortunately this practice is often forced on people by reliance on the horrid SemVer scheme, which only makes any kind of sense for 3rd party dependencies, but is foisted on internal dependencies as well by many idiotic package managers, like Go mod or NPM.
Typical CRUD stuff should be like at least 80% purely functional business logic (that is 100% testable without integration) and 20% or less IO code. If you really need that integration to find all the rough edges and work out the bugs, you probably have too much surface area in your IO "tainted" code.
Java-style-OOP really encourages this sort of thing by subconsciously compounding data state with functional methods. The whole "I needed a banana and you gave me a whole jungle" problem.
For some problems, the formal specification is tractable and even necessary. But for many complex problems, it is either not tractable (the input is too complex, you would need on the order of magnitude of componentB to actually specify the semantics) or its just not worth it (componentB is only called by componentA).
I also want to note that I'm not talking about regular types when I say 'formally specifying the valid inputs and their semantics', though I'm sure dependent types could in principle achieve this. I'm talking about cases like components which comunicate though script-like objects or configuration templates etc.
Creating a version tags the commit with the version number for each package that's been updated and it allows for the creation of pre-release versions. If you have things that aren't ready for prime time.
Consumers can depend on a particular git commit by referencing the tag.
So there is one main branch that contains all of the commits, but different components are versioned independently and reference particular commits in the branch.
Not to mention, you often need to polish a release while developing large features for the next release - again cases where you need branches.
Of course, you can also try to take the feature flag model, and avoid refactoring entirely. Unlikely to be a good strategy for a long lived product.
In the .net world your CI/CD pipeline should continuously build and publish NuGet packages of your common code as you make changes. Since the old versions are obviously still available, other parts of the system are not forced to be updated to the new version of the dependency.
I will say one problem I have is in refactoring the interfaces of my modules, which is what I seem to spend a lot of time one, at least in this stage of the project. When I am updating the bottom ones, I pretty much have to update the others in parallel.
The only major issue I face with git novices is making sure everyone on the team sets their machines to pull recursively.
We moved from multiple separate repos in platform/ to a single repo in platform2/ for a number of reasons:
- Make it easier to work across multiple projects simultaneously
- Increase code re-use (via common libs) rather than duplicate utility functions multiple items over
- Share the same build system
for "multi"-repo (not sure) what really, fuchsia uses something called jiri, and there is still gsync
All I'm saying is, beware of cargo cult thinking. Do something because you need it, it's practical, it's faster, not because someone else does it. I've had a few projects that were intentionally difficult because Someone decided to make it a microservices architecture, but in a monorepo to encourage the distributed monolith idea.
We wrote a new framework from scratch using a monorepo approach, with separate packages via Lerna. The problem here was tooling. Dependent builds were not supported and I've had to delete node_modules more times than I'd ever cared to count. The article talks about some github specific problems (namely, the issues list being a hodge-podge of every disparate package). We tried zenhub, it works ok, but it's a hack and it kinda shows. I've seen other projects organize things via tags. Ultimately it comes down to what the team is willing to put up with.
We eventually broke the monorepo out into multi-repos, and while that solved the problem of managing issues, now the problem was that publishing packages + cross-package dependencies meant that development was slower (especially with code reviews, blocking CI tests, etc).
Back to a monorepo using Rush.js (and later Bazel). Rush had similar limitations as Lerna (in particular, no support for dependent tests) and we ditched it soon afterwards. Bazel has a lot of features, but it takes some investment to get the most out of it. I wrote a tool to wrap over it and setup things to meet our requirements.
We tried the "multi-monorepo" approach at one point (really, this is just git submodules), and didn't get very good results. The commands that you need to run are draconian and having to remember to sync things manually all the time is prone to errors. What's worse is that since you're dealing with physically separate repos, you're back to not having good ways to do atomic integration tests across package boundaries. To be fair, I've seen projects use the submodules approach and it could work depending on how stable your APIs are, but for corporate requirements, where things are always in flux, it didn't work out well.
Which brings me to another effort I was involved with more recently: moving all our multi-repo services into a monorepo. The main rationale here is somewhat related to another reason submodules don't really fly: there's a ton of packages being used, a lot of stakeholders with various degrees of commit frequency, and reconciling security updates with version drift is a b*tch.
For this effort we also invested into using Bazel. One of the strengths of this tool is how you can specify dependent tasks, for example "if I touch X file, only run the tests that are relevant". This is a big deal, because at 600+ packages, a full CI run consumes dozens of hours worth of compute time and we see several dozens commits a day. The problem with monorepos comes largely from the sheer scale: bumping something to the next major version requires codemods, and there's always someone doing some crazy thing you never anticipated.
With that said, monorepos are not a panacea. A project from a sibling team is a components library and it uses a single repo approach. This means a single version to manage for the entire set of components. You may object that things are getting bumped even when they don't need to, but it turns out this is actually very well received by consumers, because it's far easier to upgrade than having to figure out the changelog of dozens of separate packages.
I used a single repo monolyth-but-actually-modular setup for my OSS project and that has worked well for me, for similar reasons: people appreciate curation, and since we want to avoid willy-nilly breaking changes, a single all-emcompassing version scheme encourages development to work towards stability rather than features-for-features-sake.
My takeaway is that multi-repos cause a lot of headaches both for framework authorship and for service development, that single repos can be a great poor-mans choice for framework authors, and monorepos - with the appropriate amount of investment in tooling - have good multiplicative potential for complex project clusters. YMMV.
The thing about projects getting bigger is orthogonal to whether people pick monorepos or not. Many project feature sets are simply complex in a way that you can't refactor your way out of. When I hear this argument that projects could somehow be made smaller, typically it's from someone who's never had to deal with the regressions of such a decision or someone who's never had to be accountable for their estimates. I've seen firsthand projects that got rewritten with the benefit of hindsight, proper staffing, management blessing and all the jazz, and still struggle to meet feature parity of their older counterparts. It's clearly not a question of being interested in putting in the work, or even having the budget for it.
From what I've seen, messiness is largely a function of experience. A lot of people simply don't have experience in writing libraries and/or architecting systems. I find that once they acquire the skills, quality of encapsulation improves a lot. FWIW, I think monorepos help with that transition because it provides that fast feedback loop of single repos while a developer is learning the ropes of how to librarize, instead of getting bogged down waiting on slow code review/CI/publish feedback loops.
Should managing all this code be hard work? I don't blame people who want to skip that.
You don't need a monorepo. It's an anti-pattern that people resort to when their code is a tangled, tightly coupled mess.
If the modules are loosely coupled and high cohesion, multiple repos is the ideal approach.
The solution to the problem of having to 'constantly update dependencies' is not to bring them all into a single monorepo. The solution is to ensure that these dependencies handle separate concerns and have simple interfaces which allow them to be loosely coupled with the main project logic in such a way that they can be updated independently of each other.
If different module dependencies often need to be updated together whenever you add a new feature or fix a bug, it almost certainly means that you have a problem with coupling and/or cohesion. You don't need tools that make it easier to work with a tangled mess. The correct solution is to untangle the mess. Otherwise the mess will keep getting worse.
Loose coupling over domain-specific data structures (and their related procedures) just means that errors that should be caught by a run-of-the-mill type checker are now found in production.
In fact, they're the only 2 rules that have consistently delivered value across all the languages that I've tried and for all different kinds of system that I've built. Closest thing possible to a silver bullet.
I find this kind of mindset that x is not possible to be defeatist.
Instead of acknowledging that programming is difficult and can take many decades to master, people prefer to pretend that programming is easy but unavoidably messy. People always try to come up with narratives which make it easier to accept mediocrity rather than work hard to keep improving themselves.
> The solution is to ensure that these dependencies handle
> separate concerns and have simple interfaces which allow
> them to be loosely coupled with the main project logic
> in such a way that they can be updated independently of
> each other.
A 'single' monorepo makes your problem easily solvable using automation and code. You can use a multirepo if you prefer to solve problems using human coordination.
If an API turns out to be insecure and has to be removed or a mitigation put in place that changes it's behaviour that can be an unavoidable breaking change.
Either way, my point wasn't that "breaking changes always have to be made" but "avoiding making breaking changes because everything has to be deployed granularly slows you down and so does needing to make the same change in N places".
Any downstream library team can instantly tell if their change has broken any upstream users because a good CI system will just automatically run all tests of all relevant projects. That way a library developer can either iterate until all projects depending on it are green, or the library developer can even proactively change all dependant projects and roll all fixes out in a single atomic commit.
This has proven incredibly beneficial to our development speed and the number of avoidable code conflicts that crop up.
Any proper (recent) package manager will do the same with a good CI without having any of the mono-repo disadvantages ( duplicated libraries, diamond dependency mess ).
The only real advantage of mono repo is that editing simultaneously multiple software components with API breaking changes is made easier, much easier.
What you are describing is a reasonable approach if you can get the module boundaries right in advance. For more open ended software problems, that is impossible - you need the freedom to rearchitect and change responsibilities. That's the situation where a monorepo is much simpler than anything else.
Git isn't the best tool for THAT job. I've been saying the same thing for 10 years now, nobody wants to listen because a lot of developers only know Git.
Everyone starts with an idea how how the software should turn out. A vision where there often is remarkable agreement among people, software should be modular, independent parts encapsulated in some way, and easy to change over time.
Then some people starts from this vision by focusing on the ideals. Since stateless software is so much easier to reason about, let's build our architecture on that it should be. The same goes for other ideals such as side effect-free and idempotency. Any parts that deviate from this vision are dirty little special cases and can be treated as such.
But software without state is useless. State, and side effects, are the whole reason for the software to exist.
Some people take this second approach and starts with the state and side effects, how these are represented and stored and how to allow for change over time. Then the rest of the software, the easy parts if you will, is sketched out after that to accommodate to this design. These tends to be the same people who starts thinking in data structures and thinks the design of data is more important than the design of code.
Just as an example, sometimes I see people with monstrous Kubernetes-style architecture for a web app and then all state shoved in a Postgres in the back without even a thought. Well, in reality that's your whole application right there, in the back. There are a million ways to start stateless web workers, all perfectly fine, that's not where your energy should be spent.
Maybe the above is a simplification. I know for certain that I am in the latter camp. But over the years where I have found myself in disagreement over architecture, it is often with people I have later come to see as in the first camp. And this keeps coming back again and again, on all levels of software architecture.
In our monorepo, few changes are cross module. But its a grand day to reduce total package count by 20 by finally getting rid of some old project dependencies and being able to use a single test runner config and version across packages. I keep my sanity that way.
Meanwhile, our non-JS repos haven’t required as much attention in this area.