Hacker News new | comments | ask | show | jobs | submit login
Why Google stores billions of lines of code in a single repository (2016) (acm.org)
489 points by fagnerbrack on Dec 10, 2017 | hide | past | web | favorite | 298 comments



I've worked with both a distributed repo model and a monorepo model and vastly prefer the distributed approach (given the right tooling). The trade-offs are complementary and no doubt with proper discipline you can try to maximize the benefits, while minimizing the downside. But here's what I don't like about working in a large monorepo:

1) Difficult to track changes to the code I'm interested in. Every day there are hundreds of changes in the repo and almost all of them have nothing to do with what I'm working on.

2) all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.

3) Frequently have to update the world at once. Unless the repo can store multiple versions of the same module, then all the consumers have to be updated at once, even if it's inconvenient. Sometimes migrations are better done gradually.

4) Encourages sloppy dependency management. There are frequently unclear boundaries between software layers.

I'm sure people will say "if you're having those problems, you're doing it wrong" but the same thing could be said to people who find the distributed model problematic.


The trick is that Google have their own VCS, build tooling, automated refactoring tools, etc etc, specifically designed to deal with their monorepo. Nobody else has that - we're stuck with git and a complex landscape of tools for managing code in ad-hoc ways. As a result, with the tools we have, many repos is better than a monorepo - but perhaps if we had those tools, for some cases, a monorepo might be better than many repos.

Note that even where Google are forced to use git (e.g. Android, Chrome) they use a many-repo approach.


Google has that, Facebook has that, Microsoft partially has that, some investment banks have that too.

Everywhere I've seen mono repo, mono repo was better than multi repo.

They all built special tooling and have dedicated teams to support it.


I wish someone would create an open-source vcs that supports mono-repos at scale out of the box.



Maybe we could look at the problem from the other side. Create tools to manage multi-repos like if they were a single mono-repo. A docke-compose for git.


It's sort of intrinsically a pain since then you lose atomic, cross project commits.



Android does this with a tool named repo. It tires into gerrit and you can treat the whole Android project as one repo.


Why is the problem with Subversion?


Are you proposing a subversion-based monorepo?


We get a lot of mileage out of Subversion as a monorepo. It certainly works better than people like to give it credit for.


Perforce Helix might be even better - it even has a DVCS model based on creating a "local server" that can fetch/push from a shared server asynchronously from use of that local server, and a hybrid model that allows for only parts of the repository to be hosted on your personal server, and other parts to follow the more traditional Subversion-like model. Things like exclusive locks on files that can't really be "merged" are also supported (for example, all your assets).

The only downside is that it's not open-source, and as a result has a much smaller community. It's free for up to 5 users, then "email us" for any more. But if a very flexible VCS model is something you need, it's the same as anything else you need to pay for.

Google used to use Perforce until they hit a certain scale, so it's likely it'll work for you until you hit that scale and can build your own tools too.


Well, it seems to fit the requirements better than git. Obviously, subversion is not used much. I would like to hear some experience reports what is the problem with it.


We have two issues with Subversion:

- it requires a certain discipline: we need branching in our workflow and this is handled mostly by convention in a subversion repository. We have "branches" that were created by less careful colleagues by copying subdirectories of trunk to the branches folder.

- all the tooling developers fled to work on making git bearable. It seems that there is good money in sugarcoating got and none in making good tools for Subversion (awareness of branches in Jenkins, decent code review...). We have a budget, but that does not compensate for the lead that git has in that regard.

Other than that, subversion fits our needs. It just works.


Subversion is not used much anymore - just in case you entered the industry after this.

Subversion has been used in basically each and every open source project as a replacement for the previously most used CVS.

Subversion was better tjan cvs, but still bad in many aspects, slow synchronization and bad branching and merging support come to my mind.

Because of these shortcomings and because of the idea of decentralized versioning coming up, many systems like git, mercurial, and others came up then, and git seems to be the most successful of these by now


That being said, not sure how and if subversion has improved by now, didn’t use it for years...


What special tooling is required to deal with a monorepo that is not required for multi repo?


Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.

Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.

Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...

It's regular tooling that every development company should have, but only big companies with mono repos have it.

It's not that the tooling is needed to deal with the mono repo, it's that the tools are great and you want them. But they can't be implemented in a multi repo setup.

Think of it. How could you have a global search tool in a multi repo setup? Most likely, you can't even identify what repo exists inside the company.

Makes me realize. If I ever go back to another tech company, the shit tooling is gonna make me cry.


IIRC, Bitbucket Enterprise has pretty decent global search. GitHub Enterprise doesn't seem to have much of any cross-repo tooling, which is one of my least favorite things about it.

Global refactoring seems a lot less necessary if you have clean separation among your processes. Maybe this is me coming from a more microservices perspective, but I'm inclined to say that needing to do a refactor that cuts across several different functional areas is a sign that things are becoming hopelessly snarled together.


Google have dedicated (no more there) language, platform, library, etc. teams that can push really huge refactoring changelists - for example if they've noticed that code had plenty of: "if (someString == null || someString.empty())" - they would replace it with something simpler.

Or if they've found some bad pattern, would pull it too. I do remember when certain java hash map was replaced, and they replaced it across. It broke some tests (that were relying on specific orders, and that was wrong) - and people quickly jumped and fixed them.

This level of coordination is great. And it's nost just, let's do it today - things are prepared in advance, days, weeks, months and years if it had to. With careful rollout plans, getting everyone aware, helping anyone to get to their goal, etc.

It's also easy to establish code style guides, and remove the bikeshedding of tabs/spaces, camel braces or not, swtich/case statement styles, etc. Once a tool has been written to reformat (either IDE, or other means), and another to check style, some semantics - then people like it or not soon get on that style and keep going. There are more important things to discuss than it.


The idea of global refactoring is mostly that you can decide to modify a private API, and in the process actually update all the consumers of that API, because they all live in the same repo as the component they're consuming. (This is also the argument of the BSD "base system" philosophy, vs. the Linux "distro" philosophy: with a base-system, you can do a kernel update that requires changes to system utilities, and update the relevant system utilities in the very same commit.)


Code search in bitbucket server is dismal. All punctuation characters are removed. This includes colons, full stops, braces and underscores. This makes it close to useless for searching source code.

Regarding global refactorings think new language features or library versions.


Bitbucket PM here. Thanks for the feedback!

Support for punctuation in search is something we knew wasn't ideal when we first added code search. As with all software, there were some technical constraints that made it hard to do.

We plan to have support for full stops and underscores in a future version and are exploring how to best handle more longer term. Our focus, based on feedback, is on "joining" punctuation character to better allow searching for tokens. Support for a full range of characters threatens to blow out index sizes, but if we get more feedback on specific use cases we're always happy to consider them.


That boggles the mind. Why wouldn't they just ship Hound or something else based on the Go regex search backend?


There's always a reason ;)

Being a self-hosted product we have to make tradeoffs for the thousands of people operating (scaling, upgrading, configuring, troubleshooting...) instances. In short, we try to keep the system architecture fairly simple using available technology and keeping the broad skillsets of admins in mind.

It was a somewhat difficult call to add ElasticSearch for it's broad search capability, but being used for other purposes helped justify it. Adding Hound or similar services that were considered would have added more to administrative complexity and wouldn't provide for a broader range of search needs.

We continue to iterate on search, making it better over time.


A fair point, but I will just say that Hound is _astonishingly_ low maintenance. I set it up at my current employer like two years ago and have logged into that VM maybe twice in the entire time. It just hums along and answers thousands of requests a week with zero fuss.


You really need a good "search and replace", whether it's called a refactoring tool or something else.


> Must have: Tooling that can interact on a file or sub directory level. Git cannot do that.

I mean, when you get big, sure. But until you're big, git is fine. Working at fb, I don't use some crazy invocation to replace `hg log -- ./subdir`, I just do `hg log -- ./subdir`. Sparse checkouts are useful, but their necessity is based on your scale - the bigger you are, the more you need them. Most companies aren't big enough to need them.

> Should have: Access control to view and change file on a subdirectory basis. Everyone can see the repo so you can't permissions users per repo anymore. It's optional but these companies have that.

Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.

> Recommended: Global search tools, global refactoring tools, global linting that can identify file types automatically and apply sane rules, unit test checks and on commit checks available out of the box for everything and that run remotely quickly, etc...

I'd bump this up to `should have`. The power of a monorepo is being able to modify a lib that is used by everyone in the company, and have all of the dependencies recursively tested. Global search is required, but until you're big, ripgrep will probably be fine (and after that you just dump it into elasticsearch).


> Depends on your culture (and regulatory requirements). I prefer companies where anyone can modify anyone's code.

This is still true at Google, except for some very sensitive things. However, every directory is covered by an OWNERS file (specific or parent) that governs who needs to sign off on changes. If I’m an owner, I just need any one other engineer to review the code. If I’m not, I specifically need someone that owns the code. IMHO, this is extremely permissive and the bare minimum any engineering organization should have. No hot-rodding code in alone without giving someone the chance to veto.

>ripgrep, ElasticSearch

Having something understand syntax when indexing makes these tools feel blunt. SourceGraph is making a good run at this problem.


Eh, at least in FB, I see more unstructured querying.


Elasticsearch is too dumb. You need to use a parser and build a syntax tree to have a good representation of the code base. That's what facebook and google do on their java code.

Agree that any small to medium company could have a mono repo without special tooling. Yet they don't.

There are companies that care about development and there is the rest of the world.


Github uses Elasticsearch [1]. I agree that ES is too dumb by default, however the analysis pipeline can be customized for searching source code.

1. https://www.elastic.co/use-cases/github


Might I suggest using a tool designed for searching source code rather than dumping into elastic. Bitbucket, sourcegraph, github search or my own searchcodeserver.com

Unless designed to search source code most search tools will be lacking.


I had a bad time at Google and was glad to leave, but wow did I ever miss that culture of commitment to dev process improvement and investment in tooling. The next startup I joined was kind of a shocking letdown. It became clear pretty early on that nobody else there had ever seen anything like the systems at Google, couldn't imagine why they might be worth investing in, and therefore the level of engineering chaos we wasted so much time struggling with was going to be permanent.

The startup I'm working for now is roughly half ex-googlers, so it is a different story. Of course we can't afford Google level infrastructure, but there is at least a strong cultural value around internal tooling, and a belief that issues with repetitive or error-prone tasks are problems with systems, not the people trying to use them.


Worked at google for 2-3 years, mainly java, under google3: my thoughts: Having things under single repo, and with a system like blaze (bazel), I can quickly link to other systems, or be prevented/warned that it's not good idea (system may be going deprecated, or just fresh new, and you need visibility permission (can be ignored locally)).

Build systems, release systems, integration tests, etc. - everything works easier - as you refer to things just by global path like names.

Blaze helps a lot - one language for linking protobufs, java, c++, python, etc., etc., etc.

Lately docs are going in it too, with renderers.

Best features I've seen: code search, let's you jump by clicking on all references. Let's you "debug" directly things running in servers. Let's you link specific versions, check history, changes, diffs.

GITHUB is very far away from this, for nothing else - but naturally by not even be possible to know how things are linked. Even if github.com/someone/somelibrary is used by github.com/someone-else/sometool, GITHUB would not know how things are connected - is it CMake, Makefiles, .sln, .vcxproj. It maybe able to guess, but that would be lies at the end... Not the case at google - you can browse things better than your IDE - as you can't even produce this information for your IDE (a process that goes every few others updates it, and uses huge Map Reduce to do that).

Then local client spaces - I can just create a dir, open a space there, and virtually everything is visible from it (whole monolithic depot) + my changes. There are also couple of other ways to do it (git-like include), but I haven't explored those.

What's missing? I dunno... I guess the whole overwhelming things that such a beast exist, and it's already tamed by thousands of SREs, SWEs, Managers, and just most awesome folks.

I certainly miss the feeling of it all, back to good ole p4, but the awesome company that I'm in also realized that single depot is the way to go (with perforce that is). We also do have git, but our main business is game development, so huge .tiff, model files, etc. files require it.

Also ReviewBoard and now swarm (p4 web interface and review system) is so far nice. Not as advanced as what google had internally for review (no, it's not gerrit, I still can't get around this thing), but at going there.

Another last point - monolithically incremental change list number would always be easier than random SHAxxx without order - you can build whole systems of feature toggles, experiments, build verifications, around it, like:

This feature is present if built with CL > 12345 or having cherrypicks from 12340 and CL 12300 - you may come up with ways to do this too with SHA - but imagine what your confiuration would look like. It's also easier to explain to non-eng people - just a version number.


Sounds like an opportunity for Google cloud


Wouldn't it be better to just adjust the linter, refactoring etc to work on a multi-repo hierarchy? Most of them already mostly do.


What special tooling is required to deal with a monorepo that is not required for multi repo?

From my time at Google the first thing that came to mind was citc. But I couldn't remember if citc was publicly known, so I did an Internet search for "google citc". The first search result was this article.

"CitC supports code browsing and normal Unix tools with no need to clone or sync state locally."


Facebook is painfully non mono repo.


Dan Abramov's comment elsewhere in this thread says otherwise:

"I work at Facebook, and can confirm we keep all code in a monorepo".

https://news.ycombinator.com/item?id=15893184


Unless something drastic changed in the last year, I really doubt it. There is the fb frontend, the backend, the offline batch processing repo, and the instagram frontend repo. I think the phone apps have their own repos too? It was a giant mess, especially when you had to make changes that spanned repos, like introducing a new backend API and then depending on it, or changing logging formats.


> Note that even where Google are forced to use git (e.g. Android, Chrome) they use a many-repo approach.

Google uses many-repo approach for Android and Chrome because you cannot fit everything in a single git repo (well you can, but it will be a pain in the ass to work on that repo). Git is just not designed for huge repos. Google is also working on tools to make the many-repo of Android or Chrome work like a monorepo.


I have written my own tooling. It is doable. Just time-consuming. -- It took me about a year (2015) of my (limited) spare time.


Software A version 1 consumes format F1 and produces format G1 data, and software B version 1 consumes format G1 and produces H1.

To upgrade format G2 we must change both software A and B.

First, software B version 2 must accept both G1 and G2. To do this we may need to build software A version 2 and try them in a sandbox environment to gain confidence that ∀F1 we produce the correct G2. If F1 is complete, we may be able to do this exhaustively, but if F1 is sufficiently diverse, monte carlo simulation might be used.

Then, if there's a 1:1 relationship between A/B we can upgrade pairs.

If there's a N:M relationship, we need to upgrade all of the instances of software version B1 to B2 (at least within a shard). If you're running in a non-stop environment, this might have it's own challenges. Only then, can we begin the upgrade from A1 to A2.

Now:

Something, somewhere needs to record what and where we are in this journey. It is relatively straightforward how to do this with a monorepo, but it is very unclear how to do it with a distributed repository:

Almost everyone I know punts and uses some other golden record (like a continuous integration server, or a ticketing system, or an admin/staging system), and like it or not: that's your monorepo.


You can also design software A to produce both G1 and G2 side by side, deploy it, and then develop new software B against G2, submitting bug reports to project A when there’s a problem detected in G2.

If you’re doing the multirepo strategy it’s best imho to make the projects truly independent, as if they were developed by different companies. That way every project only needs to think about its own dependencies and consumers, and how to do migrations, without needing to have the big picture mapped out.


> You can also design software A to produce both G1 and G2 side by side

This can be impractical if G is a database table that is very large.

> it’s best imho to make the projects truly independent, as if they were developed by different companies.

One of our systems might cost £300k, so completely desynchronising them so that code paths can build both G1 and G2 simultaneously (allowing B to develop separately) means "simply" doubling the costs. That might put our team at a disadvantage against someone who figures out another way.


> This can be impractical if G is a database table that is very large.

If this is true, you have no choice, and must run things side by side while you convert to G2. Or shut everything down to make the migration atomic, which is increasingly not an option.


> you have no choice

It depends on your database.

If you imagine a simple postgres or mysql server and an "alter table G..." then you're right.

If (however) G1⊂G2 then a document store or a column-based database can usually partition the table somehow.


> This can be impractical if G is a database table that is very large.

Unless if the new format G2 is a super set of the old format G1. Read protobuffers best practice to know more about this.


Maybe using a ticketing system [or just call it a project management system] is the right abstraction level.

If A and B has nothing to do with each other - other than for some circumstantial reason they consume data from each other, then why would we care if A or B starts to support a new output format?

If we want to do a format change for some reason, maybe it'll allow better security/traceability, then sure, make a project and track the tasks (like make A able to produce/consume new format, make B able to produce/consume new format, deploy A2 and B2 to test environment, promote to prod), but I don't see why would you track that on the source code versioning level.

A and B has separate tests to ascertain that they can deal with the new format, and then you do the integration testing, that might catch problems, that then should be covered by unit tests in A or B. (Or in a fuzzer for said format.)


> Maybe using a ticketing system [or just call it a project management system] is the right abstraction level.

> If A and B has nothing to do with each other - other than for some circumstantial reason they consume data from each other, then why would we care if A or B starts to support a new output format?

First, even if the coordination between A and B is recorded in the ticketing system, the coordination between F and G is probably not.

> I don't see why would you track that on the source code versioning level.

Pretend F and G are tables in a database (or other data storage system) if that makes it easier.

Where is the schema stored? Who records the migration path?

Many people like to record migrations in a version control system, but it is tricky to link those migrations to the (otherwise) independent A and B.

If these are file formats where exists the code to consume and produce them? Or network formats? The problem remains the same -- do we break this up into additional libraries?

That there's a very real ordering between release of A and B that isn't properly encoded, we're relying on process diligence (as opposed to tooling) to be correct.


If tables, then if they are in the same DB, they should be in the same project.

If they are independent tables, then I don't care, show me the API between the projects.

If these are file/network/serialization/wire/in-memory/binary/codec formats, then there are conformance checkers (passive and active, like fuzzers). Those are separate projects, but they can be used like tools during testing and development.

Rely on tooling to make sure that the stated goal of the project is reached. (It now supports F or G or X,Y,Z formats. It supports output-format G by processing input-format F. If that's a project requirement, test it in that project.)

You can use a top level repo for the integration tests. But it's no need to make it one flat repo.


> If tables, then if they are in the same DB, they should be in the same project.

Lock-stepping two otherwise unrelated applications because they both share support for a data structure is silly at best, and often impractical, especially if development for only one of the projects is "in-house". Consider the possibility that "A" is a commercial product produced by another company.

Anyway, it's my experience most software upgrades don't involve a schema change, so it's worth optimising for the common case, and supporting the difficult case.


> but it is very unclear how to do it with a distributed repository

Versioning through branching and tagging, while having some drawbacks - at least the fact that you have to DO an operation and that this is not automatic - seem to solve this problem, and are not, in my eyes, a form of monorepo. You globally get more flexibility at the cost of a bit more repo management work.

If the problem is retrieving the right version automatically, externals or submodules should be able to solve this problem. If A and B have no clear dependency direction, a top level repo might help.


> a top level repo might help.

This is the way I generally do it: A repository that represents my system/environment that has submodules for A1, A2, B1, and B2, and scripts for updating the environment.


Not by any definition I can think of. How are you defining "monorepo"?


I will get a furry of downvotes for saying it but the cause of your first three problems is git, not the mono repo approach.

Git only allows to check/commit/view to the entire repo at once. Then, some git operations are superlinear with the number of files or revisions, they are slow on large repo to the point of being unusable.

It's mandatory to have operation on a per file or subdirectory level in a mono repo approach. Companies that have mono repos all built tooling to support it. CVS/SVN used to do that out of the box but everyone hate them now.


> CSV/SVN used to do that out of the box but everyone hate them now.

It's "CVS" and it lacks a concept for a repository-wide version (except, maybe, a timestamp). A repository-wide version is –I guess– the single best reason to have monorepo in the first place.

SVN is okay.

Also, `git log -- subdir/`.


Why aren't submodules more wide spread?

Building a monorepo out of submodules should solve these problems, or not?

Also, there is subtree.


Because submodules is hard to understand. They require that you understand git and that you can add one level of abstraktion.

Once you done that, they är great!


Also hard to deal with. Lots of operations leave submodules in unclean / out-of-date states.

In general / light use, yeah, they're great. Unfortunately, they have a very large number of edge cases where they essentially require either a) everyone to be experts in the edge cases, or b) tons of new tooling (because existing tools won't take these steps for you).


That is... until you try to remove a submodule.

I like sobmodules, but the whole feature needs a lot more polish before it's widely adopted.


That's fair. I think the important thing is to be able to have multiple versions of portions of your repo and then be able to version that. A branch of branches, so to speak.


> Every day there are hundreds of changes in the repo and almost all of them have nothing to do with what I'm working on.

Wouldn't the right tooling be able to show you changes to the slice of code you're interested in? I remember SVN would allow you to checkout just a single subdirectory, for example.

> all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.

Makes sense about the pulling, but again, wouldn't grepping be configurable to only search where you need to?

> Frequently have to update the world at once. Unless the repo can store multiple versions of the same module, then all the consumers have to be updated at once, even if it's inconvenient. Sometimes migrations are better done gradually.

I'm not sure I understood you here, can you expand on that? Do you mean all the devs have to update their module? Why not use tagged/branched version of libs instead of working off trunk?

What kind of tooling do you use for the distributed approach?


Yeah, it's all about the tooling. The distributed system I used was when I worked at Amazon. Each "package" there has its own git repo, which can have multiple branches and the dependencies of each branch are versioned along with the branch. I wonder if at the end of the day it really matters whether something is a "monorepo" or not if the tooling provides the necessary abstractions to version things the way you need.


Did you use standard software for the versioning? We are looking for a solution for this at work and have come up blank so far.


Using different branches for different modules becomes hard in a monorepo when branching is a global operation. You can only have a single branch checked out in your working copy then.

The only solution I can think of is to create a copy within the monorepo to create de facto branches without regular VCS support. This would be kind of terrible.


I am a firm believer of using processes and tools that make it hard to do stuff you should not be doing in the first place. Dependencies should be solved ala how FAKE and Paket is doing it, not in a mono repo. It's the same story for many projects in a solution vs few. With few projects you avoid wrestling with cyclic dependencies between projects, on the other hand, that is just a tell tale sign that the overall structure is starting to deteriorate.

I still do not see the appeal in mono repos since you're heavily dependent on discipline to not introduce spaghetti dependencies, where you fix one bug, but introduce 4 new in unrelated parts of the code. Now you solve that with an if statement, and thus we introduced a great deal of technical debt.


> 3) Frequently have to update the world at once

I’ve never worked in a monorepo, so may be wrong, but this point presumes a dependency on “latest” at all times. I’d assume the components in the mono repo still release versions to the various package management systems (maven, pypi, npm, etc), allowing dependencies to be more stable.

Has that not been the common experience by those that have worked in them? I see a lot of merits of having to update everything at once (less code rot, hopefully) but it does seem to have drawbacks (many have commented on these as well).


If the library/component is wide spread, it either can be developed in a branch, and specific version be tagged, or always developed "trunk" mode with feature toggles (may not be always possible, but one can adjust), e.g. - certain features are disabled, and need to be enabled after others come across, or after some specific time, etc.

While at google, we used that kind of development for the project I was in. Someone would push source code changes for new features, but prefferable behind a flag (normally a command-line flag, driven to a configuration, like the one ksonnet has). The confiuration file would say - enable this flag, only the binary was compiled with this CL version, and/or these cherrypicks, or some other rule.

This also allows a feature to be quickly disabled by SRE, SWE, or other personnel if it's found to be not working well.


Both approaches can be made to work. For me, the overriding concern is simplicity and ease of configuration management, so I prefer something that on the surface looks like a monorepo. Somewhat paradoxically, in my attempt to solve the issues that you mention, I ended up scripting my commits and checkouts so I could place the repository in a set of git repositories -- so I have a distributed set of repositories under the hood that look like a monorepo to the people using it! Neat, huh?


(argh, I wasn't able to post this due to HN's over-eager 'submitting too fast' filter; at the time, I'd submitted all of two comments & one story within an hour — and then I had to restart my browser, which lost the text of this comment)

> 1) Difficult to track changes to the code I'm interested in.

What's wrong with 'git log $PATH'?

> 2) all sorts of operations take longer (pulling, grepping source, etc.) to support code I couldn't care less about.

A different format could help here, as can different tools (e.g. ripgrep or ag instead of grep). The time spent on those operations has to be balanced with the time spent updating your code to deal with someone else's incompatible library changes, again, when the other person is on vacation and you have no idea what the new philosophy of his library is. And you don't have any choice about updating, because another one of your dependencies that you really must update has already been updated to rely on his changes.

> 3) Frequently have to update the world at once.

IMHO that's a feature, not a bug. The person or team responsible for breaking the world is responsible for fixing it, rather than getting to break the world, then pop off down to Barton-on-Sea for an extended holiday while everyone else in the company gets to update his code to use an entirely different idiom.

> 4) Encourages sloppy dependency management.

My experience has been that multiple repos tend to encourage sloppy dependency management, while a monorepo tends to encourage deliberative, collaborative, professional dependency management. That's just my own experience, and of course different organisations will differ.

> I'm sure people will say "if you're having those problems, you're doing it wrong" but the same thing could be said to people who find the distributed model problematic.

My own experience has been that multirepos tend to be like dynamic typing and monorepos tend to be like static typing: multirepos can in theory be done right, but in practice they never are, while monorepos work, but at the cost of people having to colour within the lines. Which makes sense for any particular organisation may actually be a function of its maturity: if a place is trying to move fast and break things, maybe multiple repos make sense; if it's trying to deliver quality software, maybe a single repo makes sense.


You could argue that all your points can be solved also by the right tooling.


1 and 2 are trivially fixable.

3 and 4 are pretty fundamental though, especially 3 - if you don't want to force everybody to keep up with head, you probably don't want to use a monorepo.


#1 in git is `git log <subdirectory>`


Large-scale refactorings are actually pleasant.

My team owns a framework and set of libraries that are widely used within the Google monorepo. We confidently forward-update user code and prune deprecated APIs with relative ease — with benefits of doing it staged or all-at-once atomically.

It's imperfect, but maintenance in distributed repositories is infinitely worse. Still, I remember the earlier days of the monorepo and keeping Perforce client file maps; that was a pain! https://www.perforce.com/perforce/r15.1/manuals/dvcs/_specif...


I attended a talk by one of the Google Guava (Java collections library) authors and he told us how they didn't have to worry about maintaining backward compatibility at all. When they made a breaking change they could check out all of the impacted Java code across Google, refactor it, verify that the tests still passed, and then commit everything in one shot. It's easy to understand the productivity advantages.


One challenge is latency in the generation of codebase and identifier and callgraph search index (Cf., Code Search and Kythe). We can perform global tests across the entire monorepo, but that takes time. What happens if someone introduced new usage of old API immediately before our atomic refactoring, and what about pathological tests or flakes? This still necessitates doing some cleanups multi-stage: (1.) mark old API as deprecated (optionally announce), (2.) replace and delete legacy usages, and (3.) deletion of final trailing usage sometime soon thereafter once the codebase has been reindexed.

Some languages and ecosystems are more tolerant of this problem than others. That said, incremental cleanup still has advantage with bisecting regressions.

As I said, it is not perfect but broadbased change quickly is relatively easy.

In my time maintaining open source, I never had these luxuries, which is why I said the monorepo is infinitely easier. Another consequence: if global cleanups are easy, perhaps that reduces the barrier to experimentation. Perfect is no longer the enemy of the good and the good enough. For me, I felt in open source where I had zero control over dependent code and its callgraph, the reverse was true: hesitance to publish something for fear of cost.


Facebook gave a conference not long ago about then doing the same thing.

Interestingly, they only do that for java code. Java has good analysis and refactoring tools.


>Interestingly, they only do that for java code.

Not sure what you mean. I work at Facebook, and can confirm we keep all code in a monorepo (or, rather, one of two big monorepos) rather than just Java code.

This lets us easily do React API changes: we can deprecate an API internally, and update all JS code that references the old APIs in a single commit.


I meant the refactoring. I've only ever seen it work on Java.

Other languages are much harder to process.


> Interestingly, they only do that for java code. Java has good analysis and refactoring tools

I can definitely see this.


You can do that with a many-repo, provided you have the right tooling. In fact, I'd argue Google's advantage is in the tooling they built around the repo, not the monorepo itself. E.g how fast you can find all your dependents in the whole repo.


Not really, as there isn't the ability to atomically commit your changes. With 70000+ full-time employees code is getting checked in all the time. Atomicity is extremely valuable.


It's like the people commenting about this forget what distributed means. You can have multiple repos, but you can still have a gate way/"source of truth" repo. You can run tests and whatever else on it just like you do in a "monorepo." The power behind Google's/Facebook's choice isn't and will never be the mono repo. It's specifically the tooling they built around their choice.


You can. You have many repos and one top level repository with submodules.

Edit: And although this is a multi step process, it still allows you to de-couple modules and work on them separately.


The 70,000+ number you cited includes engineers and non-engineers but surely, the actual number of engineers that need commit access will be much lower.


Infinitely worse? That's a bit hyperbolic.


I agree, though I wouldn't say this is a fundamental aspect to distributed systems, mostly just a consequence of git being built with terrible merge and merge conflict tooling.


What changes can you do that you couldn't have done before, in a multirepo world? And I do mean could not have done - clearly the monorepo enjoys a few hundreds of thousands (millions?) of hours of effort that the multirepo did not.

i.e. what stops your current tools from `for each repo, run...`, or how is monorepo fundamentally more capable than building automated library management / releases / etc with the same level of tooling?


With a single commit you can change an API, and all it's users, and run all the tests for all the dependant projects, etc, and you're done. All in a days work, and no emails/communication necessary.

In a multi-repo world, people are probably linking against old revisions of your library, and against certain tags/branches etc. There is probably no overarching code search to find all users of the API. You're gonna have to grep the code and hope to find all uses. You might miss some repos/branches. Everyone has their own continuous integration/testing procedures, so you can't easily migrate their code for them. You're gonna have to support both API's for probably months until you have persuaded every other user to upgrade to the latest 'release' of your code which supports the new API before finally turning off the old API. The work involved in the migration is spread amongst all the project owners, which is probably much less efficient.

As others have said, it's the fully integrated version consistent codesearch with clickable xrefs across gigabytes of source code, cross repo code review, cross-repo testing, etc. which really makes a monorepo work well.


(edit: shortened dramatically. apologies, earlier wasn't all that useful.)

With the exception of cross-repo code review (I hadn't thought of that one - would be useful for multi-repo too, but I've honestly never seen a multi-repo tool for this, thanks!), this is all just the benefits of standardization, plus a massive injection of tooling enabled by the standards.

Standardization of projects brings huge benefits when it's done right, absolutely agreed. But that's entirely orthogonal to mono vs multi.


Not really. The point is that you can't have the same level of standardization in a multi repo set up.


    for repo in ls repos
    do
      thing
    done

?


Diamond dependency like issues crop up.

Imagine I have Repos A,B,C. A is a base repo. B and C depend on A, and C also depends on B. If I modify some API in A, and also update all the callsites in B and C, I also have to bump the version of A depended on by B and C, and also bump the version of B that C depends on, otherwise I'll get version mismatch/api compatibility breakages.

To make this work that means that nothing can depend on latest, everything has to have frozen dependencies, and you either need to manually, or via some system, globally track all of the dependencies across repos, and atomically update all of them on every breaking change.

In other words, you reinvent blaze/bazel at the repo level instead of the target level, and you have to add an additional tool that makes sure you're dependencies can never get mismatched.

The monorepo sidesteps this issue by saying "everything must always build against latest".


"everything must always build against latest" is perfectly enforceable on multirepo too, it's just that nobody does it.


>"everything must always build against latest" is perfectly enforceable on multirepo too, it's just that nobody does it

No, you cannot. That's my entire point. Here's a minimal example:

Repo one contains one file, provider.py:

    def five():
        return 5
Repo two contains one file, consumer.py.

    import provider  # assume path magic makes this work
    def test_five_is_produced():
        assert provider.five() == 5

    if __name__ == '__main__':
        test_five_is_produced()
I also have an external build script that copies provider and consumer, from origin/master/HEAD into the same directory, and runs `python consumer.py`.

Now I want to change `five` to actually be `number`, such that `number(n) == n`, ie. I really want a more generic impl. What sequence of changes can I commit such that tests will always pass, at any point in time?

There is no way to atomically update both provider and consumer. There will be some period of time, perhaps only milliseconds, but some period of time, at which point I can run my build script and it will pick up incompatible versions of the two files.

This is a reductive example, but the function `five` in this case takes the role of a more complex API of some kind.


or you give your CI the ability to read transaction markers in your git repo. e.g. add a tag that says "must have [repo] at [sha]+". dependency management basically. you can even do this after the commits are created, so you can allow cycles and not just diamonds.

but yes, cross-project commits are dramatically easier in a monorepo, I entirely agree with that - they essentially come "for free".


Didn't you just reinvent versioning and frozen dependencies? What you described is not always building at latest, it's building at latest except when there are issues at which point you don't build at latest and instead build at a known good version.

Consequences of this are, for example, that you cannot run all affected tests at every commit.


sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.

My point here is that you're describing a known problem with known solutions, and saying it's impossible. I'm saying it requires work, as does all this in a monorepo.

edit: to be technical: yes, you're correct, it can't always build at latest at every instant. Agreed. I don't see why that's necessary though. Simplifying, sure; necessary? No.


>sure. I honestly don't see why that's a problem though, especially since "at every commit" can have clear markers for if it's expected to be buildable or not.

The value from this is the ability to always know exactly which thing caused which problem. If you know things are broken now, you can bisect from the last known good state, and find the change that introduced a breakage. With multi-repo, you can't do that, since it's not always a single change that introduces a breakage, but a combination.

Ensuring that everything always builds at latest allows you to do a bunch of really cool magical bisection tricks. If you don't have that, you can't bisect to find breakages or regressions, because your "bisection" is

    1. now 2 dimensional instead of 1
    2. may/will have many false positives
That puts you in a really rough spot when there's a breakage and you don't have the institutional knowledge to know what broke it.


No, you're back to "we can't build HEAD in a multirepo", which is fixable with CI rules. If you can, you can bisect exactly the same (well, with a fairly simple addition to bisect by time. `git bisect` is pretty simple, shouldn't be hard to recreate).

In any case, unless you have atomic deploys across all services, this is generally untrue. Bisecting commit history won't give you that any more in a monorepo than in a multirepo.


To your first point, I'mma need you to explain how you bisect across a poset, because that's what you just claimed you could do.

To your second point, nothing I've said has anything to do with deployment. We're still entirely in the realm of continuous integration.


How many people work on the repository tooling at Google?

I'm asking because i wouldn't know how to setup a mono repository at my 50 people Startup even if we deemed this to be necessary.


> I'm asking because i wouldn't know how to setup a mono repository at my 50 people Startup even if we deemed this to be necessary.

Sorry if this is a really dumb question. If you only have 50 people I'm assuming your codebase isn't that big, so why can't you just make a repo, make a folder for each of your existing repos, and put the code for those existing repos into the new repo?

I imagine there's a way to do it so that your history remains intact as well.


Yes, there is. Move the entire content of each repo to a directory and then force-merge them all in a single repo. I did this a few years ago with 4 small mercurials repositories that belonged together.


You can do that easly with a subtree


For a 50 people startup, a Git repository will be usually enough. At my previous company we managed to do the monorepo approach easily with similar amount of people and GitHub.


Google's mono-repo is interesting though, in that you can check out individual directories without having to check out the entire repo. It's very different from checking out a bajillion-line git repo.


It's kind of interesting that nowadays people assume that version control system == git.

For a huge, non-open codebase there are some pretty large downsides to a fully distributed VCS in exchange for relatively few benefits.


Good point.

It's important to stress that Google uses Perforce and not git (at least for that monorepo, they use git/gerrit for Android).

A monorepo this size would simply not scale on git, at least not without huge amounts of hacks (and to be fair, Google built an entire infrastructure on top of Perforce to make their monorepo work).


Google doesn't use perforce anymore. It's been replaced with Piper, you can read about it in articles from about 2015 or so. Perforce didn't scale enough. I guess it's not clear to what extent Piper is a layer of infrastructure on top of perforce or actually a complete rewrite? I was never super sure. The articles appear to imply way more than a layer on top...

You are exactly right that git doesn't scale though, go see the posts on git that Facebook's engineers made while trying, only to be met with replies to the extent of "you're holding it wrong, go away, no massive monorepo here", at which point they made it work with mercurial instead. Good read though, lot of good technical details. Can't find the link at the moment though :(, but it was from somewhere around 2012-13 ish.

Edit: here, looks like the original thread is deleted but here's the hn pointer: https://news.ycombinator.com/item?id=3548824


There's nothing wrong with saying "you're holding it wrong" if they're holding it in a way clearly contrary to the solution design. I don't fit in a toddler's car seat and if I tried, it's clearly my fault and not the seat engineer's. I doubt they'd want to accept my changes that would make it work worse for toddlers either.


Sure, if you don't care about people actually using your stuff you can ignore their requests. But Facebook and Google are now working on Mercurial rather than git, and Mercurial actually cares about ease of use (whereas git seems to revel in its obtuseness) and the Mercurial folks are looking at rewriting it, or parts of it in Rust to improve performance, which has always been the major issue.

If all those things continue I think the only reason to use git over hg would be github. How long until they decide to support Mercurial too and people abandon git?


> Sure, if you don't care about people actually using your stuff you can ignore their requests.

Yes. End of story. People will abandon things that don't support them for things that do and those that want to continue using something that fits their application will do so. Nothing to see here; we get it, you don't like git -- don't use it if it doesn't fit your needs. However, don't expect those who do like it to go out of their way in a way they don't want to please you. Just because there is a community developed around something and that something is open source does not mean they are required to accept whatever patches come their way -- often the best projects know what to keep out as much as what to let in. In this case, the git community has decided it doesn't want to do those things; more power to them.


>> Sure, if you don't care about people actually using your stuff you can ignore their requests.

I think you nailed the problem with Git here: it was created by one guy to support his pet project and as long as it works well for him all the other feature requests are low priority.


Agree completely, git is just not the tool for the job, the original thread (which I still can't find, gah), makes that pretty clear.




Mercurial (with lots of extensions) sits on top of Piper at Google. It doesn't replace it.


I thought it was Facebook that did the mercurial thing: https://code.facebook.com/posts/218678814984400/scaling-merc...


Actually that says they are working on improving Mercurial to the point where they can use it.


That article doesn't claim that. It only claims that mercurial is used within Google.


> monorepo this size would simply not scale on git

https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...


That's still not even close to Google's repository:

"The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files."

Source: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...


Thanks for this quote.

It prompted me to do a quick afternoon experiment with how git would handle a billion lines of code:

https://news.ycombinator.com/item?id=15892518


As another user mentioned, many git actions scale linearly in the number of changes, not in the size of the repository. Try recreating the scaled repo, but say, in commits of 1000 lines each (ie. 200K commits), and see how long things take.


Did your experiment also do 40,000 changes per day (35 million commits, of varying sizes throughout the repo), and then see how that affects git performance? My (admittedly crappy) understanding of git is that it also scales on the commits, not just the raw file number/size count.


This is just the Windows codebase and is relatively small compared to almost the entirety of Google.


Google no longer uses perforce either. I believe it also stopped scaling. They now use Piper, which has a perforce like interface, but is not the same thing.

And there are other non perforce like Piper interfaces.


Could you please elaborate? I've only used svn and git, and the largest codebases I've worked on have only been about 150k lines of code.

What are the other ones and the main differences, really curious


Perforce is really common in a few domains because it handles 1TB+ repo sizes cleanly, has simple replication, locking of binary files and a good UI client for non-programmers.

Was pretty much used exclusively back when I was in gamedev, not sure if that's still the case.


For example in svn checking out only some subdirectory instead of entire repo is pretty much the default way how you should use it.


git is designed from the ground up to be 100% distributed. This is useful for small and/or open source projects. It's 100% portable. You can fork and merge between different repos maintained by complete strangers.

Now, imagine you're a huge corporation. Your code consists of millions of files that have been edited millions of times. It's never going to be released to the public. It's never going to be forked, much less by a stranger. You're going to have only one main branch and main build ever, except for maintenance branches. The complete history of everything that has ever happened on that repo is would take up many gigabytes, and developers are probably only ever going to need to look at and/or build locally 0.01% of that code themselves.

If you were going to design a version control system from scratch for the latter scenario and you had never heard of git or any other existing VCS, how would you design it? Would you come up with something like git? Probably not. People would just have local copies of the minimum of what they needed to get their work done, anything else would call some server on the VPN they were always on. And you would probably come up with some whole specialized server architecture with databases and such that wasn't that similar to a corresponding client architecture that it would also need.


Such as? Then why has everyone switched to git? (Hint, because it is fundamentally built on more powerful ideas than what came before it)


Open source has moved to git, mostly because being standardized on one vcs made it easier for people to contribute.

A lot of companies don't use git.


No, before git the standard was svn, and before that, cvs. The switches happened even though there was an existing standard.


Having worked with all three, nothing but Stockholm syndrome would keep anyone to switch from cvs. Likevise the switch to git for open source happend (In my opinion) in large part because Github offers a far better experience than SourceForge witch was dominant at the time.


There was actually a brief period where Google Code was ascendant, but then GitHub was demonstrably investing more in collaboration.

I think one aspect of Git that is really important is forking, and having your own local commits. Merging commits and patches in svn were awful. You wouldn't ever allow someone random to join your svn repo, but if they can reasonably provide a patch, you could take it. Git makes that massively easier.


>There was actually a brief period where Google Code was ascendant,

brrrr


People switched to Git because Svn merging continues to suck and branching and tagging are implemented in the most inane way possible.


I didn't care about merging and tagging.

For me the main feature was distributed nature. SVN is OK on a gigabit corporate LAN with dedicated people to manage & maintain the servers + network. Anything less than that, and it becomes slow and unreliable.


Well clearly not "everyone" has since we're apparently talking about a company that hasn't.


While technically true due to some features of tooling, that is really only masking off part of the repo under a READ-ONLY directory.

Builds can (and usually do) depend on things that aren't part of your local checkout.

I'd say CitC is a much more accurate representation of the way Piper and blaze "expect" things to work.


The dependencies are still downloaded only on need though.


The dependencies aren't really "downloaded" at all. When you build something, the artifacts are cached locally, but the files you are editing generally speaking not actually stored on your machine. They're accessed on demand via FUSE.


This used to be done manually via "gcheckout" but that's long since been replaced. Users now don't do anything but create quick throwaway clients that have the entire repo in view.

Until very recently there was a versioning system for core libraries so those wouldn't typically be at HEAD (minimizing global breakage). Even that has been eliminated now and it's truly just the presubmit checks and code review process that keeps things sane.


> truly just the presubmit checks and code review process that keeps things sane.

also rollbacks :)


Microsoft is solving this for git with their GVFS.

Another issue with git monorepos is access control, does anyone know of good solutions for this, does GVFS solve this also?


Yeah that was what was nice about SVN. You could check out paths.


> Google's mono-repo is interesting though, in that you can check out individual directories

The same is true of svn, which many people like to bash nowadays, even in this discussion.


That's pretty common feature in most non-DCVS. It was nice having Perforce on the last game I worked on. The art directory was ~500GB and not fun to pull down even with a P4 proxy.


I work in a large company and I have used a central repository for six years and a distributed for six years. I think a central repository is better. The benefits are:

1) Transparency. I can see what everybody else is doing and if somebody has an interesting project I can find it quickly. You can also learn a lot from looking at other peoples changes.

2) Faster. To check out the source code for the project I now work on takes an hour in the distributed system, while it only took 5 minutes in the centralized system.

3) Always backed up. All code that is checked into the central repository is backed up. It has happened twice that employees have left and code was lost because they only checked it in locally.

Many have only used CVS or SVN, which are horrible. I rather use Git or Mercurial, but Perforce is really good.


> 1) Transparency. I can see what everybody else is doing and if somebody has an interesting project I can find it quickly. You can also learn a lot from looking at other peoples changes.

This doesn't require a single central repository, just that all repositories live in a common location.

> 2) Faster. To check out the source code for the project I now work on takes an hour in the distributed system, while it only took 5 minutes in the centralized system.

What distributed repository management system do you use, and what centralized system did you use?

> 3) Always backed up. All code that is checked into the central repository is backed up. It has happened twice that employees have left and code was lost because they only checked it in locally.

As with point 1, this doesn't require a single central repository, just that all repositories live in a common location.


>1 Single repo

It's more of a matter of your tool to visualize change set history.

>2 Faster

This again is an issue with the tool quality. There needs to be meta git repos. Groups in Github and Gitlab attempt to create a shallow sense of that.

>3

Always push. That's not an issue that is resolved by a single central repo.


This doesn't require a single central repository, just that all repositories live in a common location.

Even better, if every project includes a DOAP file (or something similar) and/or you publish commit messages using ActivityStrea.ms or something, you could easily have an interface that shows project activity around the organization, regardless of how many repositories and/or servers you use. Of course it's probably easier if all the repositories live in a common location...


I use git-svn to use a central repository. Let me list the advantages

1) Faster

There is no comparison. But let me count the ways

a) checking out stuff

It is faster than just downloading a directory using SVN.

b) just trying something out (ie. branch)

Creating a branch, making a few changes takes me seconds, and does not require me to change paths like it does for the svn victims I work with. Throwing it back out again takes seconds, and all operations are reversible for when I fuck up (which is often).

c) merging

Git's merging. Oh my God. In half the cases I just have to check stuff over, if that.

d) submitting

We use code review. Unlike most of the subversion folks I can easily have 5 co-dependant changes in flight (5 changes, each depending on the previous one) without going insane, and I have gone up to 13, not counting experimental branches. I observe around me that it takes a good developer to manage 2 with subversion. 5 is considered insane, I bet if I showed them the 13 were in flight at the same time they'd have me taken away as a danger to humanity.

2) always backed up

Subversion doesn't back up until you commit and people don't commit anywhere near quickly enough ... The way people lose code around here 99.9% of the time is by accidentally overwriting their in-flight code contributions (the remaining 0.1% involves laptop upgrades and overenthusiastic developers. Even then cp -rp will just copy my environment and just work, and yet the same is absolutely not true for the subversion guys).

Now with Git, I commit every spelling fix I make, every semicolon I have forgotten, on occassion separately, other times with "--amend". And only then make my share of stupid mistakes, after committing, something that's technically not impossible on subversion but not practical, mostly because of code review ("just commit it" on subversion takes ~5 minutes in the very fast case (that requires a colleague dropping everything that very second, AND can't involve any actual code changes, as that trips a CI run that takes 3 minutes assuming zero contention), and 20-30 minutes is a more typical time (measured from "hey, I'd like to commit this", to actually in the repository). Committing on git takes me the time to type "<esc>! git commit % -m 'spellingfix'". The subversion commit time means that developers often go for weeks without committing. Weeks, as in plural.

I get that a git commit isn't the same thing as a subversion commit. But it does allow me to use the functionality of source control, and that's exactly what I'm looking for in a source control system. Subversion commit doesn't allow me to use source control without paying a large cost for it, that's what I'm getting at.

So I have backups guarding against the 99.9% problem (and an auto-backup script that does hourly incremental backups for the 0.1% case). The subversion guys are probably better covered for the 0.1% problem. Good for them !

3) actual version control

Git's branches, rebase, merge, etc mean I can actually work on different things within short time periods in the same codebase.

The fact that other developers are using subversion means I can have my own git hooks that I use for various automated stuff. Some fixing code layout, some warning me about style mistakes, bugs, ... (you'd be surprised how much your reputation benefits from these). Some updating parts of the codebase when I modify other parts, ... you have to be careful as these are part of the reason subversion is so slow (esp. the insistence on CI, I hear a CI run at big G, which is required before even code review can happen, takes upwards of an hour on many projects with some taking 8-9 hours)


The discussion wasn't really about SVN vs Git; you can have one or multiple repositories with either system.


You'd have the same problems with any other centralized versioning system the way companies use it these days (ie. with CI, and code review).


Not really. I work at google. I work on a leaf, so my CI takes < a minute. I also can send out multiple chained changes, in a tree, to multiple reviewers, and have them reviewed independently.

Certainly, CI takes a long time for certain changes, but those are changes that affect everything. You'd have the same problem in a multi-repo approach if you updated a repo that everything else depended on. At some point, you have to run all of the tests on that change.


Cool. I've wondered about Google's CI a lot, but there are a lot of horror stories online. Most people are complaining about it taking an hour for simple changes (something called "tap", I wonder what that stands for).

Chained code review changes, I refuse to believe that in Google version control (which is perforce according to Linus' git talk at Google) chained changes are easy. Branching in perforce is literally worse than SVN, it's a bit more like the old CVS model, and they've sort-of tried to get the SVN copy-directory model forced into the design afterwards. Also the tool support (merges ...) is bad compared to subversion and stone-age compared to Git's tools.

The one reason I keep hearing for using perforce is that perforce allows the administrator to "lock off" parts of the repository to certain users.

I've done branches and merges in Git, Subversion and CVS (and I've had someone talk me through one in Perforce, but I don't really know). Google's branch/merge experience is very likely to be somewhere between SVN and CVS, and those can accurately be referred to as "disaster" and "crime against human dignity". It's certainly not impossible, but it's very hard and you can't expect me to believe (normal developer) people can reasonably do that in Perforce.

Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests. Perforce, like subversion and cvs doesn't have any way of tracking stuff unless you commit it and you can almost never commit without CI and code review, so would you track changes like that, or would you just leave them in your client untracked until you're ready for a code review ?


>Cool. I've wondered about Google's CI a lot, but there are a lot of horror stories online. Most people are complaining about it taking an hour for simple changes (something called "tap", I wonder what that stands for).

Well, like I said, its possible to do modify things that have a lot of dependencies, at which point you run a lot of tests, but that would be truish anyway. Consider the hypothetical situation where you're changing you're modifying the `malloc` implementation in your /company/core/malloc.c`. Everything depends on this, because everything uses malloc. If you have a monorepo, you make this change, and run (basically) every unit and integration test, and it takes a while.

Alternatively, if `core` is its own repo, you run the core unittests, and then later when you bump the version of `core` that everything else depends on, you run those tests too, but now if there's a rarely encountered issue that only certain tests exercise, you notice that immediately when you run all the monorepo tests, and can be sure that the malloc change is the breakage. If you don't do that, then you notice breakages when you update `core`, or maybe you don't notice it, because its only one test failing per package, and it could just be flakyness. So noticing it is harder, and identifying the issue once you've decided there is one is harder, and now you need to rollback instead of just not releasing.

>Chained code review changes, I refuse to believe that in Google version control (which is perforce according to Linus' git talk at Google) chained changes are easy. Branching in perforce is literally worse than SVN, it's a bit more like the old CVS model, and they've sort-of tried to get the SVN copy-directory model forced into the design afterwards. Also the tool support (merges ...) is bad compared to subversion and stone-age compared to Git's tools.

Google no longer uses perforce, we use Piper (note that this is a google develped tool called Piper, not the Perforce frontend called Piper, yes this is confusing, afaik, Google's Piper came first). Piper is inspired by perforce, but is not at all the same thing. (See Citc in the article). The exact workflow I use isn't piblic (yet), but suffice to say that while Piper is perforce inspired, Perforce is not the only interface to Piper. This article even mentions a git style frontend for Piper.

>Google's branch/merge experience is very likely to be somewhere between SVN and CVS, and those can accurately be referred to as "disaster" and "crime against human dignity". It's certainly not impossible, but it's very hard and you can't expect me to believe (normal developer) people can reasonably do that in Perforce.

Suffice to say you're totally mistaken here.

>Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests. Perforce, like subversion and cvs doesn't have any way of tracking stuff unless you commit it and you can almost never commit without CI and code review, so would you track changes like that, or would you just leave them in your client untracked until you're ready for a code review ?

So, Piper doesn't have a concept of "untracked". Well it does, in the sense that you have to stage files to a given change, but CitC snapshots every change in a workspace. Essentially, since CitC provides a FUSE filesystem, every write is tracked independently as a delta, and it's possible to return to any previous snapshot at any time. One way to think of this concept is that every "CL" is vaguely analogous to a squashed pull request, and every save is vaguely analogous to an anonymous commit.

This means that in extreme cases, you can do something like "oh man I was working on a feature 2 months ago, but stopped working on it and didn't really need it, but now I do", and instead of starting from scratch, you can, with a few incantations, jump to you're now deleted client and recover files at a specific timestamp (for example: you could jump to the time that you ran a successful build or test).

>Also: what would happen if you send out 20 chained commits, 10 of which are spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot semicolon, "]" that should have been ")", etc ...), 2 of which are small changes to single expressions and 3 of which introduce a new function and some tests.

I'd logically group them so that each resulting commit-set was a successfully building, and isolated, feature. Then, each of those would become its own CL and be sent for independent review.


I think you are confusing central/distributed with monorepo/multiple repos. Also distributed VCS doesn't imply that you don't have a central master somewhere.


perforce has such a janky ui though. whenever i try to do anything significant with my company's codebase, the whole application locks up for hours. i guess i need to learn how to use the cli.


This might not only be GUI vs cli, it can just be down to the granularity of your client mapping - if the p4 server thinks it needs to lock across large regions of depots it can go into the weeds.

I always try to have the absolute minimum in my client specs, but sometimes you do need to operate over the world.

The perforce docs are generally well written, worth looking at them.


yeah, the perforce UI is super easy to crash


The third problem can essentially be solved by doing all your production builds by checking out code from some central repository. If you follow that rule, then you guarantee you'll have the source code for every binary in production.

That way, you can still have a distributed repository (Git, Mercurial, etc.) if you want. Even if some code exists only in some developer's local repository, it's presumably not that big of a deal since that code can never have made it to production.


So as someone who previously worked for Google and now works for Facebook, it's interesting to see the differences.

When people talk about Google's monolithic repo they're talking about Google3. This excludes ChromeOS, Chrome and Android, which ard all Git repos that have their own toolchains. Google3 here consists of several parts:

- The source code itself, which is essentially Perforce. This includes code in C++, Java, Python, Javascript, Objective-C, Go and a handful of other minor languages.

- SrcFS. This allows you to check out only part of the repo and depend on the rest via read-only links to what you need from the rest.

- Blaze. Much like Bazel. This is the system that defines how to build various artifacts. All dependencies are explicit, meaning you can create a true dependency graph for any piece of code. This is super-important because of...

- Forge. Caching of built artifacts. The hit-rate on this is very good and it consumes a huge amount of resources given the number of artifacts produced. Forge turns build times for some binaries from hours (even days) into minutes or even seconds.

- ObjFS. SrcFS is for source files. ObjFS is for built artifacts.

This all leads to what is usually a pretty good workflow like the ability to check out directories if you want to modify them and just use the read only version if you don't. You can still step through the read only code with a debugger however.

Now Facebook I have less experience with (<6 months) but broadly there are four repos: www, fbobjc, fbandroid and fbcode (C++, Java, Thrift services, etc). At one point these were Git but for various reasons ended up being migrated to Mercurial some years ago.

The FB case (IMHO) highlights just how useful it can be to have one repo. Google uses protobufs for platform independence. FB uses GraphQL at a client level and Thrift at the service level.

So one pain point is that, for example, you can modify a GraphQL endpoint in one repo but its used by clients in others (ie mobile clients). There are lots of warnings about making backward-incompatible changes, some of them excessively pessimistic because deterministically showing something will break some mobile build in another repo is hard.

Google3 has less of these problems because the code is in the same repo. On top of that, Google has spent a vast amount of effort making it so the same build and caching systems can handle C++ server code as well as Objective-C iOS app code. Basically if you're working on Google3 you basically compile very little to nothing locally.

Engineers on Android, Chrome and ChromeOS however compile a lot of things locally and thus get far beefier workstations.

At FB the mobile build system doesn't seem to be as advanced in that there is a far higher proportion of local building.

IIRC the Git people seemed to reject the idea of large code bases. Or, rather, their solution was to use Git submodules. There was (and maybe is?) parts of the Git codebase that didn't scale because they were O(n). Apologies if I'm misspeaking here but I peripherally followed these discussions on HN and elsewhere years ago as someone from the outside looking in so I'm no authority on this.

The problem of course is that Git submodules don't give you the benefits of a single repo and I've honestly not heard anyone say anything good about Git submodules.

Just to stress, the above is just my personal experience and I hope it's taken as intended: general observations rather than complaints and definitely not arguing that one is objectively better than the other. There are simply tradeoffs.

Also, there are definite issues with Google3, like the dependency graph getting so large that even reading it in and figuring out what to build is a significant performance cost and optimization issue.


I have two main concerns when I see monorepos being used.

First, like in other areas, I see companies that want to "google scale" and blindly copy the idea of monorepos but without the requisite tooling teams or cloud computing background / infrastructure that makes this possible.

Second, I worry about the coupling between unrelated products. While I admit part of this probably comes from my more libertarian world view but I have seen something as basic as a server upgrade schedule that is tailored for one product severely hurt the development of another product, to the point of almost halting development for months. I can't imagine needing a new feature or a big fix from a dependency but to be stuck because the whole company isn't ready to upgrade.

I've read of at least one less serious case of this from google with JUnit

> In 2007, Google tried to upgrade their JUnit from 3.8.x to 4.x and struggled as there was a subtle backward incompatibility in a small percentage of their usages of it. The change-set became very large, and struggled to keep up with the rate developers were adding tests.

https://trunkbaseddevelopment.com/monorepos/#third-party-dep...


> I worry about the coupling between unrelated products.

I even worry about coupling among related products.

I could see monorepos working out well for a company that just does SaaS, and is able to get away with nice things like maintaining a single running version of the app, and continuous delivery.

Having mostly worked in companies that do shrinkwrap software or that allow different teams or clients to manage their own upgrade schedule, though, monorepo seems to me like a recipe for a codebase that is horribly resistant to change. Not just in the "big bang upgrades like JUnit4 are awful" ways described above, but also in a, "We never clean up old stuff, because most of the time when we try it breaks a bunch of other teams' code and we just nope out of that whole hassle, so barely-supported code sort of collects continuously, like dead underbrush in a forest that's never allowed to burn, until eventually it all explodes in a horrible conflagration," sort of way.

Seeing the list of things that Google keeps in a monorepo, vs things that Google keeps in Git repos, it seems like they might be thinking similarly. They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.


The dynamics actually played out very differently at Google. Because it was a monorepo with automated testing, if you didn't want other teams to break you when they change the dependencies, then you had better have a robust test suite.

Breaking changes would then lead to a discussion with your team, rather than your fruitlessly trying to binary search to find the commit that broke you.

Over time, the culture at Google became that all teams need to write tests at the unit, functional, and (usually) integration level.


>They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.

This depends on what you mean. Most/all consumer android applications don't run on google-owned hardware, but are in the monorepo.

That said, you're right that the whole "keep things up to date" thing is important. That's where tools like rosie and even bots come in.


> Seeing the list of things that Google keeps in a monorepo, vs things that Google keeps in Git repos, it seems like they might be thinking similarly. They've really only got a precious few products that typically run on non-Google-owned hardware, and apparently the major ones live outside the monorepo.

I always thought it was more that the things which take open source contributions are hosted in Git while the internal things would be hosted in Google3.


First, I agree with you about companies worrying prematurely or unnecessarily about “google scale”. You saw this a lot in the NoSQL hype days. You can go pretty darn far with even a single MySQL instance.

Second, source level dependencies bs binary level dependencies has s a choice and a commitment.

Even at large companies release schedules can really hinder you.

I didn’t hear about the JUnit issue but I can believe it. With code bases this large you have to get really good at static analysis (dynamic languages are your enemy here), tooling for refactoring and just general hygiene of the code baee.


For the first concern you have things totally flipped. A monorepo actually seems best suited for a small company, with a small codebase and a small set of services.

If anything, the stage of company where it makes most sense to have many small repos is when you have a large company with multiple unrelated products, services, teams, etc.


You're right that monorepos are just fine for small company. The problem is that at some point they don't scale without code review discipline and sophisticated tooling (such as Google's); thus there is a bottleneck where it becomes harder and harder to scale it until you get your tools right.


Yes, but that point is probably when you have hundreds of engineers, at which point you can afford to have (and probably will already have) a few engineers working on internal tooling.


>>> First, like in other areas, I see companies that want to "google scale" and blindly copy the idea of monorepos but without the requisite tooling teams or cloud computing background / infrastructure that makes this possible.

Mono repo will work fine for most small and medium companies without issue, even on top of git.

The need for special tooling and performance issues will only pops up when you have millions upon millions of lines of code.


> broadly there are four repos: www, fbobjc, fbandroid and fbcode

This used to be true, but today these are all in fact the same hg repo (www as a possible exception, I'm unsure). The "sparse checkout" machinery disguises it, but for engineers working cross platform (e.g. React Native) it's routine to make commits that span platforms.


It would be really interesting to compare internal tools across both the biggest tech companies and various startups.

(All of it is kind of depressing when one's comparison is non-tech companies, though.)


I'd add that most technical companies today, unless at the scale of Google/Facebook do not necessarily (and most likely don't) have very sophisticated tooling in place. You can imagine Jenkins is always in place and codes are splitted into "self-contained" repos; I don't know if Google/Facebook uses Jenkins, but I know Netflix certainly does.

I haven't heard much about Microsoft or Amazon, though I do know from a friend working at Apple their toolings are not always consistent from team to team. I would appreciate if we have someone from these other big tech companies to discuss their development workflow.

As a SRE/DevOps, I love working on internal toolings because I get to feel like creating my own programming language - I can be creative but focus on solving problems in my domains.


> I don't know if Google uses Jenkins

Yes, Google runs hosted Jenkins internally.

[Video] https://www.youtube.com/watch?v=rJXmGGu1kf8

[Slides] https://www.cloudbees.com/sites/default/files/2016-jenkins-w...


With the caveat that Jenkins is not the main form of CI at Google -- that spot goes to TAP.


Why would it be interesting?

Startups don't need most of the things that big companies use. Trying to use them before you need them seems like an absurd waste of time.


Big, non-tech companies use completely crap tools in a lot of cases, compared to even 50 person startups. Google's a clear outlier in terms of tool quality, even for a tech giant, but I've seen some great in-house or newer tools used by a lot of startups, too. There are often huge differences in tool quality vs. task for companies of the same size/stage, too.


I respect that opinion, but it's different from my experience. My experience is that non-tech companies just don't care (i.e., we can do what we want), and that startups spend way too much time worrying about the tools they will need when they get as big as google. Instead of, you know, getting as big as google.


Why would you so willingly give up what are intended to be company secrets like that?


These are only secrets in the most tendentious sense. Half the people at Facebook worked at Google before anyway, and there are maybe a half dozen companies that could plausibly get the benefits of going Google-scale for source control. And they already have all their internal systems: if they lack all the same capabilities as Google, it's not because Google's systems are secret but because of other challenges.


Heh. Youngsters. They still think that ideas are the bottleneck. :)

Google could hand you the source code, and you still wouldn't be able to implement what they have and compete against them.

Execution is more important than almost anything else.

Occasionally you need to cough up a really clever idea. Those days are really rare, though.


Yeah, the mono repo / distributed division is almost a red herring.. A key component to the division though is fault and responsibility in making something as relatively unimportant as the source control and build systems work well.

Google has put a lot of money and effort to make their system nice. Working in a mono repo without that much effort is very frustrating, doubly so because there's nothing individual teams can do about it. It's especially worse if you can't even make team specific branches on the mono repo to try and isolate yourself from the steady stream of breaking changes elsewhere.

However if you're a team lucky enough to get out and do most things on your own git repo, then you're now the only ones responsible for making that better or worse. Fortunately there's a ton of open source to learn from and use, so taking control of your own team's destiny to get to a point better than before doesn't have to mean much work.


" Google has put a lot of money and effort to make their system nice. Working in a mono repo without that much effort is very frustrating, doubly so because there's nothing individual teams can do about it."

Sure there is. They can architect code in a way that it doesn't break heavily when other people do things. IE abstract things reasonably. They can test things well. and they can complain when other teams aren't doing either and it's making them less effective.


A refreshing comment. I feel like people often overlook the importance of execution. Though, basing one's execution off shaky ideas is not really the best either.


Because it's not really secret? Numerous articles have been written about this stuff for years and years and also it's not hard to get [G|X]ooglers to talk about this stuff candidly in casual convo.


GraphQL backwards compatibility has nothing to with cross-repo constraints. It’s because there are millions of clients running old native code.


Is there a distributed execution backend like Forge at FB? If so, how similar is it to Forge?


>> There was (and maybe is?) parts of the Git codebase that didn't scale because they were O(n).

This is indeed true, the various efforts to upscale git (msft gvfs, etc) run into this and try to upstream things, but it is slow going.


I think that what is missing when using multiple git repos is the ability to make a code change that spans multiple projects. We're open to adding that to GitLab


This will help if you don't want to run a monorepo, but many in the industry consider a monorepo suitable for their organisation. The biggest blocker with Gitlab in my opinion is not being able to only run a job if some specific folder was modified (https://gitlab.com/gitlab-org/gitlab-ce/issues/19232).


I agree that functionality would be great to have. We planning it in the next 3 months but we also very open to someone contributing it earlier.


In places I've worked in before that needed to join concurrent changes across repositories, we would have the build system always build from a super-repo that had the other repositories tracked as submodules. Co-dependent changes were pulled in via a single commit in the super-repo.

This was kind of cumbersome to maintain TBH, and the fact that changes to different repos can be dependent on one another seems to strongly suggest that the code should be together in the same repo. Personally, I opt for mono-repos until I'm forced to change for whatever reason.


Is this not something that is already being worked on? Making commits across even submodules seems to have little support anywhere.


I think the monorepo is better but it's really hard to pull off without Google-like tooling and engineering practices. Git + meticulous dependency tracking and strict versioning conventions is probably the better move for most companies.


+1, I really dislike the "Let's do it because Google does it" mentality in software engineering. Google operates at a scale and encounters problems that the average company would probably never encounter.


If someone's interested, I made a summary of the paper a year back:

https://www.extreg.com/blog/2017/02/googles-ultra-large-scal...


It's a good write up.

To me, Piper is a monolithic version control system which is geared towards good engineering practices.

As far as I know there are only two such systems in use today and the other one is very dated and older than a lot of things out there.

When people say they have worked in a monolithic repo, then they typically mean a 1 repo under one of the open source version control systems, but none of these actually do or support what is needed when working with a monolithic repo AND modern/good engineering practices.

For that specialised VCS is required and there is very few examples of that, none of which is in open source software.

Git probably could be made to do this kind of stuff, but it would require some extensions to the DAG as well as extend on it's already verbose command line set. But I think it is doable.

The question is who can do it? Most are probably under some strict NDAs.


How would they do CI/CD if it's all in one big repo?

I suppose you could do it if you had a very strict rule where absolutely everything that could affect a "unit" was inside its own directory (but never and nothing higher up than that "project root").

So you could check which sub-directory is affected within a commit and so on.


This is one part of a _very_ big answer, but Bazel (internally Blaze) lets you do reverse dependency querying [0] with the query language, i.e. "Bazel, give me the list of targets that depend on this target that I've just modified"

$ bazel query 'rdeps(//foo/my:target, //...)'

Of course, this query in the monorepo will take a long time or not work, because the target universe of "//..." is far too large. This is where other systems come in.

[0] https://docs.bazel.build/versions/master/query.html#rdeps


I imagine this is yet another thing that can be cached, but you can expand on 'other systems'?


I'm unsure about the confidentiality of those systems, so I'm erring on the side of not expanding.

However by deriving from first principles, yes, there is no reason to re-query the transitive closure of unchanged targets' reverse deps, so caching can happen here.


Blaze/bazel solves this. All build targets (buildable units) explicitly define their direct dependencies. If you modify something, you can then run all tests for all units that transitively depend on your change.

For surface level changes this is often quite small. For changes to core libraries, well, you run a lot of tests.


Oh it works really well. One benefit is all tests affected by your target is run as you presubmit. The benefit is everything is at head so things like library bugs, security bugs are all handled naturally as part of the new releases. This usually happens twice a week for most server binaries.


So long as the presubmits are configured correctly, or someone ignores failures and force submits. :)

It's usually quite nice though, and changes that break lots of projects are rolled back quite fast. Usually.


As much as people hate, force submits are a fact of life.


Rollback in some cases happen automatically, even.


So, how long does the commit with the checks take take?


Depends. Mobile tests that run on emulators are the worst. Unit tests finish relatively fast, integration tests that bring up server tend to be slow (30-40mins best case almost for these for projects im working on). The cost of this gets amortized two ways: you can run immediate unit tests manually on command line as the fastest signal. Then when you send for code review, presubmit runs. During code reviews you may choose to run them as you go. Eventually when you submit they run again.

If there has not been any changes to your commit/cl and there is an already passing submit, it will just skip and submit.


What can you possibly be doing in an integration test that takes 40 minutes?


Integration tests bring up servers, and once you have tens or hundreds, this happens.


Oh wow. We have a test tenancy that's carried throughout production, so you make requests against real backends (read/write data in the test namespace, sometimes read-only production data). There's a proxy in front doing rate limiting, endpoint whitelisting, audit logging, emergency lockdown, etc. I never thought of deploying a whole separate environment just for integration testing.

Still, seems you could keep a handful of integration test environments always running? Time spent waiting your turn for one of them could well be less than time spent spinning up a whole bunch of servers.


There is an effort to make everything hermetic. Namespacing is hard, but not always possible, and touching production servers (and potentially crashing them) could cause significant revenue damage.

I don't think all tests should be hermetic - the effort to make such things happen usually do not overweigh the time it takes to do them, but hey - that's what we are doing.


In a single integration test? That'd be pretty absurd.

At least in our project, each integration test has a certain amount of overhead. Some backends are fakes (when I request X, you provide Y), some are actually booted up with the test, e.g. persistence.

Multiply this across N integration tests, have lots of demand for the same CPUs, and you're up to 30-40 minutes of integration test time.

Though, that said, some integration tests can be crazy long if they have a lot of "waitFor" style conditions. "Do this, then wait for something to happen in backend Z. Once that's done, do this, and this, and this..."


>Multiply this across N integration tests,

But in theory with enough servers all the integration tests could be run in parallel. So it would only take as long as the longest single test.


The longest single test + the time it takes to setup the test environment (booting servers, etc).

Parallelizing tests has diminishing returns unless you manage to dramatically reduce the setup time.


(Not at Google, but Twitter has its own monorepo) Generally submit queue takes 10-20 minutes, longer if you're changing a core library.

I find the larger factor is what kind of test has to run. Feature tests can take a while on CI if you're spinning up lots of embedded services (dependent services, MySQL, storage layers, etc).


What fraction of Google's code is in the monorepo these days? Android, Chrome, and ChromeOS are not, and those are certainly large projects.


I believe anything not open source, or reasonably expected to become open source.


Plenty of opensource projects are in google3(that frequently gets merged into external)


Yep. Projects use Copybara to import/export OSS code. https://github.com/google/copybara


+1. Many OSS projects live in google3 (e.g. gRPC, protobufs, etc.)


I think that's an effect, not a cause. It's annoying to open source google3 code because it means untangling dependencies. Meanwhile Android has lots of SOC-specific and other development happening behind the wall. So the open source question seems orthogonal.

Google today has separate repos (android, chrome, chromeos, google3), each with its own build system: Gradle, gyp/ninja, Portage, Blaze. There's hysterical raisins, but I wonder if Google considers it to be a good thing these projects are so different, or a wart they would prefer to fix?


I'd bet pretty much everything that runs in Google data centers.


I scanned this, the usual mono/multi arguments. All made moot by BitKeeper's nested approach:

http://mcvoy.com/lm/bkdocs/nested.html

in an open source system http://bitkeeper.org

What nested brings to the table is the semantics of a mono repo with the advantages of a multi repo. The whole thing walks in lockstep, if you have a 3 week old version of the kernel and you add in the testing component (subrepo) then you get the 3 week old version of the testing component, all lined up with the same heads in the same tip commit.

I get it that git won but at least steal the ideas.

Edit: BTW, bk has a bk fast-import that usually works (it doesn't like octopus merges but other than that....)


The lazy me thinks maybe 1 kitchen sink repo is better.

Here's my problem:

I can be working on multiple projects at the same time. Each project has multiple modules (core, api, www, admin, android, etc -- I use microservices on Google App Engine). Sometimes, some modules have "feature" branches. Oh, did I tell you I work on both Desktop and Laptop?

The problem is syncing. Before traveling, I need to make sure the projects/modules I'll be working on the go all have latest commit from Desktop.

My question:

Is there some "dashboard/overview" for all Git projects? So I can quickly tell, "Ok, all projects are at latest commit, and oh I'm working on feature branch for project X and Y."


There is a dashboard called github? Or what are you looking for?

I think if you want to make sure that all your changes are in, you need to do what most programmers do and learn to finish a programming session with a commit and push, just like you finish a sentence with a dot. Once you are used to it, the chance of forgetting are really low.


GitHub.com doesn't tell you if you have uncommitted, and even unpushed commit.

(the desktop app does, but you need to check one by one for each project)


Maybe 'myrepos' will do what you want. http://myrepos.branchable.com/


This is something I’ve been idly thinking about too.

I agree with regards to committing regularly, but sometimes life happens and one can forget.

I’ve been considering writing a python script that checks my local repos for uncommitted/unpushed changes and - now I think about it - perhaps also runs when I start a new terminal session just for good measure.


Does anyone have "aha" moments to share about working in a single repository?

Stuff that just made you say, "Great, I wouldn't have been able to do that if it was in separate repositories!".


Once, I noticed that a minor piece of how Google's Python system tests could be very slightly cleaner and more consistent. It would be a tiny change, it seemed very safe, and it was easily accomplishable with a short sed script, but it'd also be a backward-incompatible change across the projects of hundreds of teams and thousands of build targets. I was able to make that change with only a few commits and without needing to bother most of those teams.

These sorts of small, general, large scale cleanup commits are quite common at Google, and they're encouraged. They help keep the codebase healthy. There are special groups that review them so that all of the individual teams affected don't have to bother, and there are tools to manage the additional testing and approval requirements for such a change.

At my previous company, making such a change would have been a major undertaking. I never would have considered a refactor of that scale without a critical need. They had thousands of packages, each of which had its own repository and an incredibly complex web of build and runtime dependencies. It was a nightmare, and fiddling to find a working sets of versions of internal dependencies took up way, way too much of my time each day.


For me, refactors were the largest "aha" moment. On large scale projects you can move a lot faster if you don't have to maintain backwards compatible API's. We use Facebooks version of a mono-repo (BUCK [1]) for iOS dev. It's really easy to change an API, see all of the upstream breakages, write tests, fix upstream and submit a diff (pull request).

With a fragmented large code base you're in a world of hurt because you're dealing with versioning. There is no guarantee of when every other dependency will migrate to the latest code path.

But again, if you're in a 1-10 person team working on some trivial codebase, a monorepo might not be helpful. If you have 500 engineers working on a single codebase, tradeoffs change.

[1] - https://buckbuild.com


1. Huge changes that affect the API's of multiple sub-packages.

2. Having no friction to change anything makes you far more productive and ambitious.

3. Scripting at a org-level means you can automate things more easily and more in depth.

We run an entirely node stack so Lerna enables this in the first place. Given that, I'd never move to more than one repo if possible. It's almost all downside: more overhead/fragmentation, less control, more wasted time/mental overhead moving between things, API friction that reduces ambitious change.

Only downside of monorepo is Github not supporting them well. If you want to release some sub-packages as OSS, or want to use GH to track issues you're stuck using one big repo to handle everything. I'd bet Github fixes this within the next year or so though.


I wonder what do the bots commit in the monorepo. It is something that has been making me curious for a while.


Skynet plans, nothing big.

Source: Googler.


Not much, mostly just prototype messaging apps.


Nothing too interesting. They're restricted so they can only update their own source code.


Except for that _one_ time. ZRH SRE has never been the same.


The clear advantage of a mono repo over a mono-purpose repo is technical dept. The goal is that you never build up technical debt and immediately patch all references in an atomic change.

Example: Let's say you introduce a breaking change in lib A that is used in libs B and C. First problem is visibility, that A does not necessarily see that it is used in C and D. Second, you the build should break immediately and not until someone build C/D.


Tech debt isn't just breaking changes though, and mono repo does nothing to curb all the other types of tech debt: * Accumulation of FIXMEs * Partial refactors cut short after change in business reqs * Quick hacks near release time * "this could be done better if I had time" * Overdue re-architecture after accumulated changes and additions * Orphaned code * Commented out tests * etc etc


Can someone further explain how the pre commit phase works? I don't get how/why "pre commits" work without feature branches?

How are my changes shared with the reviewer, if I there is no feature branch? Is my local code uploaded to that review tool mentioned in the article? And then what happens if the reviewer requests changes?

I probably did get this completely wrong, so thanks in advance for pushing me in the right direction.


You got it right. There is a separate set of tooling layered on top of the VCS, which maintains a sub-history of each commit. The tool (Gerrit, Phabricator, etc.) tracks this relationship between commits, and whatever is eventually merged into the repo.

This architecture assigns each line of code a nested history: the public commit log, and also the sub-history of each commit, which evolved during code review.

IMO it would be better if the code review changes were manifest in the public commit log (e.g. via feature branches), instead of being tracked separately. The code review layers add duplicative complexity.


Perforce (the origin of Piper) has a concept of changelists. Some changelists are submitted (committed) while others are not. So the review works by uploading your changelist to Perforce, then pointing people to that changelist. It's like an unnamed feature branch that can't easily be rebased off of. Changelists do have a base, and you do a "g4 sync" to essentially rebase off of master. Does that make sense?


When people talk about making a single global change to update all clients when they change an interface, are they updating unsubmitted changelists too?


No.

Unsubmitted changes at Google usually come in one of two flavors, short-lived (abandoned or submitted within a few days) or perpetual. The latter flavor is often for "I think we might want this". It's not uncommon for those to be completely rewritten if they're actually needed. There's usually a preference for submitting useful things (with tests!) and flag gating them to cut down on bitrot.

I have seen exceptions -- I reported a bug in a fiddly bit of epoll-related code and an engineer on my team had a multi-year-old fix -- he hadn't submitted it because he wasn't confident he'd found an actual bug. The final changelist number was more than double the original CL number (unsubmitted changes get re-numbered to fit in sequence when they're submitted -- the original number redirects to the final submitted version in our tooling).


Well, the act of submitting a changelist essentially runs a test suite which requires that the changelist has no merge conflicts with the head and that the relevant tests pass. From that it follows that if someone changes an interface, you'll get either merge conflicts or test failures on your own changelist. Meaning - it's the changelists authors task to sync it up to the current head state so the refactorers won't touch unsubmitted changelists.

It's pretty much the same as GitHub pull requests - the changelists are supposed to be decently short lived and if the master code changes it's up to you to resolve conflicts and get it into a mergeable state again.


every CL (changelist) gets 2 CL numbers, an original (OCL) and committed (CL) number. so CL #s are monotonically increasing, but less than 1/2 are ever actually committed.

when the CL is first uploaded or sent for review the OCL is assigned, the CL number is assigned on commit.

More

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: