The thing is, a monorepo makes clear what multiple repos can obscure: if you have a software system, then you need all those things anyway. Multiple repos can hide the cost, but they don't eliminate it — in my experience, they multiply it. E.g. if someone makes a breaking change in a monorepo, he breaks the tests and he fixes them before his changes can be merged in; but with multiple repos someone can commit a breaking change and a dependent repo won't know about about it until days, weeks or even months later; the person responsible for fixing it has no idea what really changed or why; and he has a broken build until he does figure it out.
> The ability to force updates on dependencies in a single big commit can be really worthwhile as long as you're willing to spend the time to build the tools and do the maintenance.
It's like maintaining your car: you can spend $20 to replace your oil every 3,000-5,000 miles, or you can spend thousands every 10,000 miles. Up to you.
Or the code might be safety-critical and updating it is introducing unnecessary changes (and hence, unnecessary danger).
Some tests may include hardware integration. For example, aircraft software may need some number of flight hours after significant changes. That's probably not going to be a part of the CI suite, and the changes will introduce a greater manual testing burden.
Isn't that the whole point of versioning? If there is a breaking change only the repo's using the latest version will see it. At which point they can submit a bug report before any of that code enters production.
I still agree that avoiding all of that is preferred.
This is absolutely true. Without a very strong culture around testing and deep, continuous investment in infrastructure and tools, a monolithic repo would quickly devolve to being a giant midden heap of code.
I would say that any large project needs a "very strong culture around testing and deep, continuous investment in infrastructure and tools"
Obviously not the way it should be, etc, but reality bites.
I don't think those are necessary artifacts of a monorepo. A huge chunk of code does not mean that every component there is connected.
push not only the updates to my library, but also the
updates to all my dependents in a single commit.
Plus you do have perfect knowledge if your CI tools builds everything on your commit. Which it will do anyway if you have a monorepo.
Sometimes you get that contract wrong, and you don't realize it until you've shipped it. If you want to evolve an API by introducing a breaking change, you can do it at Google. You just have to migrate the clients with the API change. It means your APIs can both improve and stay lean, without needing to carry around legacy cruft, which is great.
It's not really an option in open source.
Between published manifests and search indexed source code, you would know who all your clients are, even without a monorepo.
its as bad as you describe, but more organized :)
It allows a push model instead of a pull one.
You can know everything you will break ahead of time.
More generally, you can know the impact of a change (performance, etc) before you make it.
Is it possible to do it without a mono-repo? Theoretically, yes, through some large federated something or other. These don't really exist ;)
Basically, it lets you have a library team that moves the world forward without causing pain (at least, for those reasons :P).
It means the library authors must be knowledge expert on all projects using the library. How can you make changes to all the client code without knowing the code base well? Make change willy-nilly on codebase you are not familiar with on the basis that it uses your library?
That sounds like a recipe for disaster.
Or you need to get consensus and code review for each and everyone of those projects and have all tests pass for all projects. This look like slow-as-molasse progress. Submitting any changes would always be an extremely big endeavour.
The alternative is to have submit not affect other projects. You can do this in a monorepo by mandating that other projects not sync to head but to a given release tag of your library and only move to a more recent tag on their own schedule, after they've tested the upgrade. IOW, pretty much how a packaged release would work. Pretty much how one would have sync'ing to separate repos.
I fail to see what are the benefits or monorepo. The downside is either anyone can touch anything without being and expert and having an extremely long and complex submission process. (And woe to you if someone submitted something in the meantime... you have to redo your testing and fixing again?!?)
The world at large has lived with separate repos and explicit releases for ages. That's how third-party library are handled in every single software team I've ever seen.
"It means the library authors must be knowledge expert on all projects using the library."
They must know enough to design good APIs or else you will fail anyway. If your apis require expert knowledge of the system to use correctly, you've probably done it wrong.
" How can you make changes to all the client code without knowing the code base well?"
Good API and good refactoring tools
" Make change willy-nilly on codebase you are not familiar with on the basis that it uses your library?"
With good tests and testing tools, i fail to see the issue.
"Or you need to get consensus and code review for each and everyone of those projects and have all tests pass for all projects. This look like slow-as-molasse progress. Submitting any changes would always be an extremely big endeavour.
Nope, there are tools to manage the changelist process for large refactoring tools. My experience says velocity is not an issue in terms of things moving too slow, pretty much ever.
"The alternative is to have submit not affect other projects. You can do this in a monorepo by mandating that other projects not sync to head but to a given release tag of your library and only move to a more recent tag on their own schedule, after they've tested the upgrade. IOW, pretty much how a packaged release would work. Pretty much how one would have sync'ing to separate repos.
Try this sometime, and see what the lag time to upgrade in a large enough set of teams and environment is. It's "forever until someone breaks you".
"I fail to see what are the benefits or monorepo."
I think they've been covered quite well.
You just think they don't happen in practice, despite people with experience saying they do.
That's okay. you don't have to believe them, but it's not clear what anyone could do to convince you:)
" The downside is either anyone can touch anything without being and expert and having an extremely long and complex submission process."
The first is an explicit feature. If everyone needs to be an expert to touch most of your codebase, that sounds like a bad codebase to me. Your latter is simply not correct. The rate of change is many times faster than any separate repo process i've seen.
" (And woe to you if someone submitted something in the meantime... you have to redo your testing and fixing again?!?)
If it's broken, it's broken, whether it's packaged or not. The only thing you are talking about is whether you can tell it's broken before you get to the next packaged release.
"The world at large has lived with separate repos and explicit releases for ages. That's how third-party library are handled in every single software team I've ever seen.
Yes, and honestly, it's been a complete and total disaster everywhere i've seen it, with tons of teams with forked, old, buggy versions, the inability to reuse code (or bugs) because of subtle third party dependency incompatibilities, etc.
I'm not sure why you claim it's a panacea.
Neither is, but not for the reasons you've given.
You can also use this exact same approach to create tasks on teams to upgrade their library if you've EOLed the version they're using. You can be as narrow (requiring everyone on the latest version) or as broad (having a band of versions allowed, in case the change is significant enough to require updates to how it is used, or if there's a major component of the system using an older version for performance-regression reasons you're still addressing on the improved library) as you want.
This actually simplifies your build system, as you can reuse the same standard open source tools that come along with your language of choice, which often assume they're running at the root of the project and can do what they want to the surrounding directory structure, and build-n-test cycles take mere minutes instead of hours, being isolated to only the component you're working with (and in the case of a library update across project, parallelized due to this isolation across build servers, so still much faster than the monorepo approach).
This also makes the codebase healthier, in general. Libraries and services need to be designed to actually have a clear purpose and interface; they can't cheat by grabbing source files from each other while breaking the expected interface, so refactoring their internals has a near zero probability of causing downstream issues. Services can be rewritten by standing up a parallel service, shadowing traffic, and validating equal responses between them. You can also fork and repurpose service code for another project and know exactly what code is required by it immediately.
These things are possible in monorepos, but they require a very high degree of diligence, and even moreso in a mixed-language environment.
In general, breaking everything into a separate repository is madness (see the modern Node.js ecosystem), so new services should start out as "mini-monorepos" where everything new they need that isn't already an available library should be contained within the service. Once a few services seem to be doing the same sort of thing, then you look at the commonalities between their implementations, learn from past mistakes (that we all make), and make the library to share between them, and allow future services to not have to reimplement that.
(This also has the advantage of making it easier to open source said library, since it would contain zero references to proprietary services, and requires no cleaning of the git history, and to get services to use your library, you already need to document it clearly.)
Frankly, the world Google is in with their monorepo is an amazing feat of engineering built upon the initial tools they had available ~15 years ago, but it really isn't an ideal situation, and there are better, more flexible ways of doing things now. I'd be sad if our industry hadn't improved this over such a period of time.
Sometimes it is annoying that it lacks the flexibility, but it always means we just have to do the refactoring now, rather than let the code rot. A Lead Dev should always keep this in mind, choose tools that make the right way the easiest and possibly only way forward. A mono repo is an example of a tool that gives superb knowledge and flexibility, but it requires some serious discipline to avoid a rapidly growing complexity.
The product family we are developing contains website, two mobile apps (iOS, Android), three PC applications (OS X, Windows and one legacy application) and software for embedded devices.
Each product lives in it's own repository, most repositories use one shared component as a submodule and many product share common platform, which is used as a submodoule and products build on top of that. Test automation core sits its own repo. I built and maintain that test automation core and it's a pain.
Each product repository has it's own functional tests that use the test automation core to make the tests actually do something. So when ever I make changes to the test automation core, I need to branch each product and use the feature branch from my core in it. Then run all the tests in each repo and see that I did not break backwards compatibility. If I do break it, then I need to fix it through pull requests to possibly multiple teams.
I'm not the greatest git wizard in the world, so maybe someone else could easily maintain good image of the whole mess in their head, but for me this is a pain. And everyone else who maintains a shared component shares my pain.
Monolithic repo would not magically make all the pain in the world to disappear, but it would be so much more easier to just have one repo. That way I would need to only branch and PR once.
You can run integration tests in the dependent project across the interface of the dependency project, but tests in the dependency project should find 99% of breaking changes in the dependency.
So you can do whatever you want in any project, then test those changes. If your tests point out you've broken your contract with other projects, you decide to either solve that issue before releasing the change or cut a new major revision. Behavior that isn't covered by tests should basically be considered undefined, teams of dependency projects should contribute tests for things they want to rely on.
Either way the other project doesn't break -- it either updates to the new compatible version or stays behind on the last compatible release until it's been updated for the incompatible change. You should be confident having a robot automatically test projects with updated versions of dependencies and committing the change.
Of course, it's not always possible or desirable to have this degree of engineering around projects, but then the reality is that these aren't independent codebases, they are a single interdependent project with a single history and should be versioned/tested/released as such.
I could break the DAL further apart into 20 different pieces, one per application, but there is so much shared data access functionality between the applications, that it doesn't make sense.
If you can't come up with a small, stable interface between the the 20 applications and the shared component, then you don't have 20 applications + a shared component, you have one big app that is painfully maintained as separate repos across a large cross-sectional arbitrary boundary.
If the history of the project is so interdependent as to be singular (everything is changed in lockstep across repos, effectively a single history) why not have a monorepo? If versioning them under the pretense that they are independent modules costs you so much effort, why do you labor to do so?
No. You just send one diff that changes all the team's code and update everything in lockstep, so at any point in your history, everything is compatible.
Instead of you going back and forth with multiple teams, you're bringing them together to comment on a change in one place. You synchronize on version control history instead of needing to wrangle multiple teams across multiple repositories, and you no longer need to deal with fallout for code change compatibility. You just make the change. Everywhere. In one shot.
So, in practice, for large-scale changes that affect many teams, we still try to break up large patches into multiple smaller steps, even working in a monorepo.
A single commit is nice for small patches that only affect a handful of teams, though.
I try my best to keep head of master branch in such a state that it can always be taken into use in all projects. Just last week one branch of one embedded device had slight API change. Nothing big, but backwards incompatible change.
I branched test automation core and made it work with the new API. All tests looked green and things looked nice. We agreed with the embedded device owner that we'll merge that change upwards soon. Soon like in one hour or so. I rushed and merged the test automation core changes to master.
At the same time I was working with another API change with one PC app. That branch was also looking good and those changes were merged upwards in both the test automation core and in the PC app.
Now my test automation core master head was compatible with everything else but one embedded device, the one with the small API change that looked good in all tests. For some reason business did not want that change to go live yet with the device, so now I had changes in my test automation core that made it incompatible with the released embedded device.
Yes, it was my mistake to rush with the merge. But because getting those changes upwards was two merges, one in the product itself and in the test automation core, it was possible to get those out of sync. If we had used monolithic repository, it would have been just one merge and such thing would not have been possible.
Sure, not a huge thing but still an annoyance I could live without.
benefits of monorepo
* change code once, and everyone gets it at once
* all third party dependencies get upgrade at once for the whole company
cons (if you are not google)
* git checkout takes several minutes X number of devs in the company
* git pull takes several minutes X number of dev
* people running get fetch as a cron job, screwing up and getting into weird state
* even after stringent code reviews, bad code gets checked in and break multiple projects not just your team's project.
* your IDE (IntelliJ in my case) stops working because your project has million of files. Require creative tweaks to only include modules you are working on
* GUI based git tools like Tower/Git Source dont work as they cant handle such a large git repo
Google has solved all the issues i mentioned above so they are clearly an exception to this but for rest of the companies that like to ape google, stay away from monorepo
I feel like you are addressing teams of several hundreds of developers. Unless they commit large binary files each day this is hardly an issue for smaller several tens people teams.
> even after stringent code reviews, bad code gets checked in and break multiple projects not just your team's project.
Revert such code immediately once detected by the CI. Which is harder to do if the changes and their adjustments are spread across dozen of repositories. Also, please compare the easiness of setting up CI for a hundred of repositories compared to a single one with tens thousand of files.
> your IDE (IntelliJ in my case) stops working because your project has million of files. Require creative tweaks to only include modules you are working on.
Monorepo does not mean you have to load everything in a single workspace. It means everything gets committed at once. If your tools cannot handle so many files or cannot be configured to work on subsets, blame your tools.
Yes, monorepo present some challenges but handling hundred of repositories is no better. Having done both, I prefer the former.
And yes not everyone is Google, with thousand of developpers and billion lines of code.
It's not so black and white though. There's plenty of difficulty in keeping a mono-CI system running and being helpful.
* The CI service becomes a single point of failure for all developer productivity
* Running a full build on every commit while accounting for semantic merge issues (so serially) is non trivial.
Jane Street did a pretty good write up of how hard this can be: https://blogs.janestreet.com/making-never-break-the-build-sc...
This is not really an issue, because it depends on your local project files, not on the size of the repository. You only have problems with the IDE in a large project such as Chrome. Most projects will have moderate sizes that can still be handled by Eclipse.
* Mapping the repository into the filesystem requires drivers which can add a whole dimension of fun if one has to support a larger team over multiple platforms.
* Doing replication is hard if you want uptimes of 99.9x%
Yes, done right it can work great. But usability does not degrade gracefully when things break.
CitC clients takes seconds, many IDEs work (some better than others with dedicated/volunteers to support), internal GUI based versioning, code reviewing, testing, documenting, bug tracking tools also work at scale.
This has been a problem, despite being at Google.
1. Android, using shell and Make
2. ChromeOS, using Portage
3. Chrome browser, using Ninja
4. google3 (aka "the monorepo") officially using blaze (but often there were nested build systems - I remember one that used blaze to drive scons to build a makefile...)
The diversity of the build systems significantly steepened the learning curve when switching projects. During orientation, they told me "All code lives in the monorepo, and every engineer has access to all code", but this turned out to be not true at all. If anything it was the opposite: more build system diversity at Google than at other places I worked.
What you wrote reminded me of a feeling I had: that these projects are these weird exceptions, only nominally part of Google. They have their own policies (style guide, code review, etc), have their own hardware, etc. The Android building even had their own desks (non-adjustable!)
Good lord.. then again I have to use MSBuild. Why can't someone write one that consumes something like JSON and is async?
As mentioned elsewhere, Android and Chrome are huge exceptions in that they're a) open source and b) pretty much standalone products not closely tied to the rest of the Google ecosystem.
This sounds like horror to me: it's essentially a forced update to the latest version of all in-house dependencies.
Interesting article though. It feels like there's a broader lesson here about not getting obsessed with what for some reason have become best practices, and really taking the time to think independently about the pros and cons of various approaches.
Of course in practice, it depends on how important the app is. If you break gmail it's definitely getting rolled back. If it's just a one-off abandoned tool somewhere doing something they're not supposed to do anyway, or some tests that flake all the time, maybe it will be ignored.
If you are upstream of a lot of important code, you might not be able to move as quickly as you like and that's because you have a lot of responsibility.
Migrations need to be planned carefully. Unlike in open source, you cannot make an incompatible change, bump the major version number, and call it a day. Migrating the downstream dependencies is your job. Version numbers are mostly meaningless and compatibility is judged by actually compiling code and running tests.
You soon learn that if you can fix a bug without changing a public API it's a whole lot easier. But if you need to change an API, there's a way to do it.
This is really what kept things stable. If you maintained a a common library and wanted to update it, you had to either keep things backwards compatible, write an automated refactor across the codebase (which wasn't too hard because of awesome tooling), or you would get dozens of angry emails when everything broke. If you relied on a library, you had to be sure to write solid e2e tests or your code might silently break and it would mostly be your fault.
I definitely miss the big G, so many resources were dedicated to engineer productivity and tooling. It's what makes the monorepo work.
Can you elaborate on this? Are they just days where the team goes "Ok, no need to stress about new features today, let's just catch on on test coverage / documentation" ? If so, that sounds pretty wonderful.
(And I guess I'm putting words into the mouth of the earlier poster, sorry about that. This is from my own experience, not theirs!)
Just I wonder the repository performance and the every day workflow.
There's a ton of fantastic artifact caching (that gets invalidated whenever source changes; which is frequently), distributing building, and some really nifty source code juggling
This doesn't prevent build breaking changes, it just makes the team using the library aware what is broken and they can have fix ready when they want to upgrade their dependencies.
And god damn, I hate Mark Shuttleworth for using that name for his company.
Without reading it, I've used monorepos and multiple repos, and far prefer the former. What people don't get is that any system consisting of software in multiple repos is really a monorepo with a really poor unit & integration test story. The overhead of managing changes to inter-repo dependencies within the same system is utterly insane — I'd say that it's somewhere between 2^ n & n^2, where n is the number of repositories (the exact number depends on the actual number of dependency relationships between repos).
In fact, after these several years, I'm beginning to think that 'prefer' is not the word: monorepos appear, more and more, to be Correct.
Coincidentally, I have been reading "Building Microservices" from O'reilly and I still don't see the benefits.
This is my analysis as well, after working with both.
I can see how it would be an advantage to use only one library version but it could also discourage upgrading since all the cost is upfront.
A client software in Java, another client in C#, another one in Obj-C and a server software in C++ will be hard to test together, will share no code, and often be developed in independent development cycles, often by independent teams.
Putting such things in monorepos would be stupid.
You can of course just press on regardless anyway, but that doesn't mean it's a great idea!
- This is a technique, and it's a toolset, but most importantly it's a commitment. Google could have split this up many times. In fact, this would have been the "easy" thing to do. It did not. That's because this strategic investment, as long as you keep working it, keeps becoming more and more valuable the more you use it. Taking the easy way out helps today -- kills you in the long run.
- This type of work just isn't as important as regular development, it's more important than regular development, because it's the work that holds everything else together.
- In order for tests to run in any kind of reasonable amount of time, there has to be an architecture. Your pipeline is the backbone you develop in order for everybody else to work
- You can't buy this in a box. Whatever you set up is a reflection of how you evolve your thinking about how the work gets delivered. That's not a static target, and agreement and understanding is far more important than implementation. I'm not saying don't use tools, but don't do the stupid thing where you pay a lot for tools and consultants and get exactly Jack Squat. It doesn't work like that.
We don't know if the monorepo approach is worth it. Google believes it is (Facebook as well). Many others don't. The manyrepo approach also has advantages.
The article goes into detail about both benefits and drawbacks.
The manyrepo approach also has advantages.
Care to elaborate on any of these? (aside from the obvious "works fast on my machine")
The problem is what version of the third-party dependency is various projects in the BIG repo should depend on.
Article mentions that: "To prevent dependency conflicts, as outlined earlier, it is important that only one version of an open source project be available at any given time. Teams that use open source software are expected to occasionally spend time upgrading their codebase to work with newer versions of open source libraries when library upgrades are performed."
So if you have a lot of external dependencies -- you need a dedicated team to synchronize them with all your internal projects.
In general, the rule was good, as it kept most teams up to date with the latest security patches, etc. It did, however, incur a fair bit of pain, as you frequently had to bump a myriad of dependents whenever bumping some package.
And npm packages were/are a real nightmare to manage due to that rule. Particularly when you have some popular npm packages that end up (transitively) depending on 3+ versions of a particular package.
The other thing is that third party packages are checked in as source. Everything is built from source at Google.
I mean, there's no way to prove something is actually used in those cases except for actually running the thing.
Do you just rely (and hope) on tests across all projects?
It usually goes like this:
- You make your API backward compatible.
- You use grep to try and find all the ways people are using it.
- You write and run a codemod to convert 70-90% of the call sites mechanically.
- You manually fix the remaining ones. This is usually a good opportunity to throw a hackathon-style event where a bunch of people sit in a room for few hours and drive it down.
- You remove support for the previous version.
The mere thought of this makes me think of the dark ages and a time in my career that I'd rather forget. All hail statically typed languages. Or languages that can easily be tooled against.
I assume it's easier with RPC.
Perhaps the extra tools, automation, testing, etc helps to a large extent, I can see that being reasonable, but I don't see how they solve all the problems I have in mind.
Perhaps more so, if you've invested in all these automated tools, I am, perhaps (certainly?) ignorantly, not entirely certain what those tools inherently have to do with the choice of a monolithic code base? Couldn't many of them work on a distributed code base if they're automated? I mean, we're talking about "distributed" in the sense that its all still in the one org here...I realise that in practice, this distinction between monolithic and distributed is possibly getting a bit academic...
A lot of these things depend upon the ability to quickly and easily see how a given piece of code is being used universally. Whether that means a single repo or simply a single unified index of all code and the ability to atomically commit to every repo is immaterial, but the latter sounds a lot like the former.
"The team is also pursuing an experimental effort with Mercurial, an open source DVCS similar to Git. The goal is to add scalability features to the Mercurial client so it can efficiently support a codebase the size of Google's. This would provide Google's developers with an alternative of using popular DVCS-style workflows in conjunction with the central repository. This effort is in collaboration with the open source Mercurial community, including contributors from other companies that value the monolithic source model."
Python is a high-level language, a bytecode language, a standard library, and a variety of interpreters and compilers. And a large community of shared libraries, forums, collaborators, etc.
Java is a high-level language, a bytecode language, a standard library, and a (smaller) variety of interpreters and compilers. Given their similarities, why would you expect that one language can, but the other cannot create a "scalable" software project?
The Google code-browsing tool CodeSearch supports
simple edits using CitC workspaces.
While browsing the repository, devel-
opers can click on a button to enter
edit mode and make a simple change
(such as fixing a typo or improving
a comment). Then, without leaving
the code browser, they can send their
changes out to the appropriate review-
ers with auto-commit enabled.
I don't know if it's the same, as I just heard of it from a YAPC talk a few days ago and haven't tried it, but it's called "Code Search" so it seems likely.
The tool that was open sourced by russ uses the same technology/scheme that the original codesearch was built on.
Given Russ wrote code search (the service) as an intern in a few months, one would think you should be able to take the pieces and put the rest together.
As for cost to maintain it for the rest of the world --
Look, if you have, say a team of 3-4 people, and your mandate is mainly one to support internal developers, and there is plenty to do there, you just aren't going to end up keeping external folks happy.
This is likely true of almost anything in the software world.
Even if it works, people want to see it evolve and continue to get better, no matter what "it" is.
If the next question is "why is internal so different from external that this matters", i could start by pointing out that, for example, internally it doesn't need to crawl svn, etc repositories to be useful. There are tons and tons and tons of things you don't need internally but need externally. What it takes for a feature to be "good enough" to be useful is also very different when you are talking about the world vs 20000 people.
So it's not really even a question of "cost to maintain" in some sense.
Almost all engineers have access to the "complete system," which is really Google's server-side software. Other repos like Android have historically been more locked-down, but there's been some recent effort to open them up within Google.
Presumably if you tried to copy all of the files, you'd first be rate-limited, then get an access denied, and lastly a visit from corporate security. I wouldn't want to try.
And there's no way it'd fit on a single hard drive.
We wanted to send them a script to help with the conversion process and they flipped out and told them not to send them any code until we become authorized in some special capacity.
They then decided to tell us how a week or so before a coworker sent himself some code so he could work on something from home (according to him at least), and within 20 minutes of emailing himself that code security was already escorting him out the door.
That's 20 minutes for automated filters to catch it, security to review it, HR to process the termination paperwork, and security to go to his desk to escort him out.
Generally, you don't checkout all code you're using, just what you modify. You work in repos that exist virtually, because checkouts would be huge.
For example, single-repository environments may require you to check out everything in order to do anything. And frankly, you shouldn’t have to copy The World most of the time. Disk space may be cheap but it is never unlimited. It’s a weird feeling to run out of disk space and have to start deleting enough of your own files to make room for stuff that you know you don’t care about but you “need” anyway. You are also wasting time: Subversion, for instance, could be really slow updating massive trees like this.
There is also a tendency to become too comfortable with commonality, and even over-engineer to the point where nothing can really be used unless it looks like everything else. This may cause Not-Invented-Here, when it feels like almost as much work to integrate an external library as it would be to hack what you need into existing code.
Ultimately, what matters most is that you have some way to keep track of which versions of X, Y and Z were used to build, and any stable method for doing that is fine (e.g. a read-only shared directory structure broken down by version that captures platform and compiler variations).
Dependencies are usually handled by a binary import/export system where you pull modules from a shared "build the world" system from a known checkpoint.
Can you justify it with examples?
For examples of Google non-superiority (remember, this is a hacker/entrepreneurial forum, we should seek solidarity in outdoing behemoth entities with agility), I simply encourage you to think for yourself and not put any credence into lore, methodology, or tech simply because it comes out of Google. I see a 1:1 comparison between Android and the non-WinNT kernel Windows releases. Google put a festering pile of code out and allowed handset makers and carriers to basically never patch any type of vulnerability. The permissions model and app over reach are just barely now contained in Android 6... seven years after release. Chrome bundles tons of third party libraries... it's another moving train wreck with enough bodies to somehow deal with the naive vendoring, scope creep, and general upkeep but it's still a nightmare to correctly build and package for an OS. By comparison I have immense respect for the Servo developers who are making an interesting reach with far less resources than Google.
At Square, we have one main Java "monorepo", one Go "monorepo", and a bunch of Ruby repos. The Java repo is the largest, by a huge factor (10x the Go repo, for example).
The Java repo is already noticeably slow when doing normal git operations. And I'm told we've seen nothing of the pain the folks at Twitter have had with git: their monorepo is huge.
We periodically check to see how the state of Mercurial and Git is progressing for monorepo-style development. Seems like Facebook has been doing interesting stuff with Mercurial.
But I still miss the fantastic tooling Google has internally. It really is so much better than anything else I've seen outside.
We have ~90k lines of java code. I don't think it will be a problem unless it grows ten times. We spend time grooming it. Removing old code, etc. I believe it is the case for most companies. Unless you are google, facebook, square, etc.
Here's the cloc of mozilla-central:
Language files blank comment code
SUM: 102542 2311779 2493273 13090450
Maybe we can change the link to that?
Someone gave an Internet Archive link elsewhere in the thread:
Thanks a lot for sharing!
For us non-googlers, would you trade away the benefits pip,gem,npm for single source of truth?
If my central package repo had genuinely reproducible builds, such that I could download the source, run the build process and know that I had the same output that the repo held, then I think that I would love to have a setup where:
- I could "add a dependency" by having an automated process that downloads the source, commits it as its own module (or equivalent) in my source repo, tags it with the versioned identifier of the package, and then builds it to make sure it matches what the package repo holds.
- I could make local modifications to the package if I needed to, and my source control would track my deviation from the tagged base version
- I could upgrade my package by re-running the first step, potentially merging it with my local changes.
Hmm, I think I just described the ultimate end state for Go package management...
And also that there be no walls between teams and projects and code sharing is a universal value across the company. At least outside of Chrome/Android, which are in their own worlds.
For the rest of us, we need the practical ability to simply make changes that might break dependencies, because of that the ability to fixate to specific old versions, and the ability to create alternate/forked incompatible versions - simply because it allows people to go/develop in the direction they want without being tied down by others who may want a different direction or even no direction at all, because it's not maintained anymore.
The main reasons (as usually is the case) are social/political, not technical.
Let's suppose we have a tool that can
* automatically check out the HEAD of each dependent repo
* run a complete integration tests across all the repo before any push/check-in.
This will work fine even with a multi-repo model, won't it?
Also, As mentioned earlier by others, The reason google can do it is because google can
* maintain a powerful cloud-based FUSE file system to support the fast checkout
* run automated build tests before any submits to ensure the correctness of any build
So they don't need to maintain multiple dependency version(for the most time)
Say version N of the code is compiled and running all over the place, and you make a change to create N+1. Well, if you don't understand the ABI implications of building and running your N+1 client against version N servers (or any other combination of programs, libraries, clients, and servers), then you'll be in a mess.
And if you do understand those ABI boundaries well and can version across them, I'm not sure you need a monorepo much at all.
The article gave me the following idea's to extend GitLab: Copy-on-write fuse filesystem https://gitlab.com/gitlab-org/gitlab-ce/issues/19292
Autocommit using auto-assigned approvers
CI suggesting edits
https://gitlab.com/gitlab-org/gitlab-ce/issues/19294 Coordinated atomic merges across multiple projects https://gitlab.com/gitlab-org/gitlab-ce/issues/19266
huge corporation that provides IT services for the airline industry, handling the reservation process and the distribution of bookings.
Their software is probably among the biggest C++ code bases out there.
We were around 3000 developers, divided into divisions and groups.
Historically they were running on mainframes, and they were forced to have everything under the same "repository".
With the migration to Linux they realized that that approach was not working anymore with the scale of the company, and every team/product has now its own repository.
All libraries are versioned according to the common MAJOR.RELEASE.PATCH naming and upgrades of patch level software are done transparently. However Major or release upgrades have to be specifically targeted.
What is more important for them is how software communicates, which is through some versioned messages API.
There is also a team that handles all the libraries compatibility, and package them into a common "middleware pack". When I left around 2012 we had at least 100 common libraries, all versioned and ready to use.
financial software used in front/back office for banks.
We had one huge perforce repo, I can't even begin to tell you what pain was it.
You could work for a day on a project, and having to wait weeks to have a slot to merge it in master.
Once you had a slot to merge your fix in master, chances are that code has changed meanwhile somewhere else and your fix can't be merged anymore. That was leading to a lot of fixes done on a premerge branch, manually on the perforce diff tool.
Also given the number of developers and the size of the repository, there was always someone merging, so you had to request your slot far in advance.
Maybe the problem was that the software itself was not modular at all, but this tends to be the case when you don't force separation of modules, and the easiest way is to have separate repositories.
Small proprietary trading company
We didn't have a huge code base, but there were some legacy parts that we didn't touch often.
We separated everything in different repos, and packaged all our libraries in separate rpms.
It worked very well and it eased the rebuild of higher level projects. If before to release some project would take ~1h, with separation of libraries it would only take 5 minutes. It was working well because we didn't change often base libraries that everyone was depending on.
For instance did everyone at Google migrate to Java 8 at the same time? That seems like a huge amount of work in a mono repo.
The presenter in the video linked in this thread that this is very advantageous due to cross dependencies. I don't think that this is the correct way to handle a cross dependency.
I'd much rather handle it by abstracting your subset of the problem into another repository. Have some features that two applications need to share? That means you're creating a library in my mind. This is much better suited for something like git as you can very simply use sub-modules to your advantage.
Hell, you can even model this monolithic system within that abstracted system. Create one giant repository called "libraries" or "modules" that just includes all of the other sub-modules you need to refer to. You now have access to absolutely everything you need within the google internal code base. You can now also take advantage of having a master test system to put overarching quality control on everything.
This can be done automatically. Pull the git repo, update all the sub-modules, run you test platform.
I'd say that's a better way to handle it. Creating simple front end APIs for all of the functionality you need to be shared.
> approximately two billion lines of code in nine million unique source files
So what are the other ~991 million files? I don't doubt that there's a lot of binary files, but what else? Also what does "unique" source files mean?
Is it just me or is there not actually a table there?
This is more of a rhetorical question. As a tech minimalist, the preferrable answer by a long shot would be that, internally, Google is keenly aware of the severe bloat and technical debt of their codebase and have clear plans going forward to drastically reduce the scale, by more than 1000x at least, without sacrificing any of the features/bug-fixes/performance of any of the code.
Compare, to, say, Apache alone: 1.8M lines of code. Or Riak: 250,000 lines of code.
Just because it sounds like a lot to you doesn't mean it actually is quite as much as you think.
It's not like anyone could be well versed in more than a tiny percentage of the entire codebase, so the gains from that kind of reuse just aren't there.
Google's operations probably require X lines of clean code added everyday by 10% of the engineers (so ~5000 instead of ~50,000 engineers), but because of the sheer number of engineers Google has, that are supposed to do something continuously to show performance, Google has ended up with ~10 or ~100 X lines of daily bloat accumulation.
Sounds like a textbook example of too-many-hands-spoiling-the-broth.
Whoever needs a tool can cd to the source of the tool, build it right then and there, and run it.