Hacker News new | past | comments | ask | show | jobs | submit login
We shrunk our Javascript monorepo git size (jonathancreamer.com)
334 points by kwantaz 84 days ago | hide | past | favorite | 213 comments



For those wondering where this new git-survey command is, it's actually not in git.git yet!

The author is using microsoft's git fork, they've added this new command just this summer: https://github.com/microsoft/git/pull/667


I assume full-name-hash and path-walk are also only in the fork as well (or in git HEAD)? Can't see them in the man pages, or in the 2.47 changelog.


Yep. Path-walk is currently pending review here: https://lore.kernel.org/all/pull.1813.git.1728396723.gitgitg...

It more or less replaces the --full-name-hash option (again a very good cover letter that explains the differences and pros/cons of each very well!)


[flagged]


Oh for crying out loud.

"EEE" isn't a magic incantation, it's the name of an actual policy with actual tangible steps that their executives were implementing back when the CEO thought open source was the greatest threat to their business model.

Microsoft contributing to a project doesn't automatically make it EEE. For one thing, EEE was about adopting open standards in proprietary software. Microsoft during EEE didn't publish GPL code like this is.


Well, most of their extensions to VSCode are proprietary. When their dominance in software development becomes irreversible, it's obvious that they will close things down and create new sources of income. The incentives are clear.


VSCode is _their_ product. It doesn't make sense to say that they are EEEing their own product. EEE is when you take some existing open standard, support it in proprietary a product, and then extend it in proprietary ways, thereby taking over the standard. It doesn't apply for a product that you originally created.


.... Fork not an original creation

that's how effective eee is, you don't know (or likely care) where MS ripped all this code.

This is not an accident. It's the point.


Are you saying that VSCode is a fork of a non-Microsoft product? Which one?


So what? You can use VSCodium and the OpenVSX marketplace if you like, no one is stopping you. It DOES mean you won’t be able to use some extensions that are published exclusively on the VSCode marketplace but guess what? You’re not entitled to every extension being accessible from all the stores, and you’re even less entitled to demand that all extensions are open source.

If Microsoft want to develop some proprietary extensions for VSCode it’s fine, everyone has this right. It has nothing to do with EEE.


What has VS Code got to do with any of this?


Why? You do realize their fork is open source?

The fix described in this post have been submitted as a patch to the official Git project. The fix is improving a legitimate inefficiency in Git, and does nothing towards "embracing", "extending", or "extinguishing" anything.


Can you imagine their fork extending git with a feature which is incompatible to mainline git and then forcing user's to switch to their fork via github? I can, and it will give them the power to extinguish mainline git and force everything they want on their users (telemetry, licence agreements, online registration...). That might be the reason they're embracing git right now. The fork being open source doesn't help at all.

I'm not saying this shouldn't be merged, but I think people should be aware and see the early signs.


There is no fork. It's some new stuff they're working on and have sent patches to upstream git for (and will presumably get merged in due time – or at least, it's certainly written with the intent to get merged upstream).

https://lore.kernel.org/git/7d43a1634bbe2d2efa96a806e3de1f1f...


Sure, I can imagine. But this isn't what's happening.


you are in the early extend phase.

it will look good, until the extensions get more and more proprietary- but absurdly useful.


The extend phase starts when they make extensions which only work in their proprietary version. Putting extensive work into contributing them back is not the same.


Ok. There are a dozen examples of exactly this behaviour, and exactly this argumentation in response over the years.

Right now the most important thing for them is for people to start thinking the microsoft fork is the superior one, even if things are “backported”.


I note the conspicuous lack of examples, and it’s irrelevant in this case where they are working to get the changes merged upstream exactly the way, say, Red Hat might have something they work on for a while before it merges upstream.

VS Code is the most common example people have, but it’s not the same: that’s always been their project so while I don’t love how some things like Pylance are not open it’s not like those were ever promised as such and the core product runs like a normal corporate open source project. It’s not like they formed Emacs and started breaking compatibility to prevent people from switching back and forth. One key conceptual difference is that code is inherently portable, unlike office documents 30 years ago – if VSC started charging, you could switch to any of dozens of other editors in seconds with no changes to your source code.

I would recommend thinking about your comment in the context of the boy who cried wolf. If people trot out EEE every time Microsoft does something in open source, it’s just lowering their credibility among anyone who didn’t say Micro$oft in the 90s and we’ll feel that loss when there’s a real problem.


Ok, examples:

* SMTP

* Kerberos (there was a time you could use KRB4 with Windows because AD is just krb4 with extensions: now you have to use AD).

* HTML (activex etc)

* CALDAV // CARDDAV

* Javas portability breakage

* MSN and AOL compatibility.

“oh, but its not the same”. It never is, which is why I didnt want to give examples and preferred you speak to someone who knows the history more than a tiny internet comment that is unable to convey proper context.


You understand in these cases the issue was not contributing back right?


Part of the game


Yeah… that was the issue.


No examples offered. And zero that I know of with respect to Git. This is how all open source development is done with big features - iterated on in a fork and proposed and merged in.

There are so many good things to criticize Microsoft for. When this is what people come with, it serves as a single of emotion-based ignorance and to ignore.


If you cry foul every time microsoft Embraces something, you'll be proven right about EEE a lot of the time.

But you'll also be wrong a lot of the time.

This is not the Extend in EEE. We might get there, and we should be generally wary of microsoft, but this doesn't show that we're already there.


Which examples?


VSCode is a prominent one that is in everyones mind, its starting its journey into extinguish.

For more examples I would consult your local greybeard; since the pattern is broad enough that you can reasonably argue that “this time, its different” which is also what you hear every single time it happens.


What is being embraced, extended and extinguished by vscode?


A lot of new and popular features in VSCode are only available in the official MS version of VSCode. Using any of the forks of VSCode thus becomes a lesser experience.

Microsoft Embraced by making VSCode free and open source. Then they Extended by using their resources to make VSCode the go to open source IDE/Editor for most use cases and languages, killing much of the development momentum for none VSCode based alternatives. Now they're Extinguishing the competition by making it harder and harder to use the ostensibly open source VSCode codebase to build competing tools.


From the wikipedia definition EEE goes like this:

> Embrace: Development of software substantially compatible with an Open Standard.

> Extend: Addition of features not supported by the Open Standard, creating interoperability problems.

>Extinguish: When extensions become a de facto standard because of their dominant market share, they marginalize competitors who are unable to support the new extensions.

As I see it, there no open standard that Microsoft is rendering proprietary through VSCode. VSCode is their own product.

I see your point that VSCode may have stalled development of other open source editors, and has proprietary extensions... but I don't think really EEE fits. It's just competition.


To add to this, there are also official Microsoft extensions to VSCode which add absurdly useful capabilities behind subtle paywalls. For example, the C# extension is actually governed by the Visual Studio license terms and requires a paid VS subscription if your organization does not qualify for Visual Studio Community Edition.

I'm not totally sold on embrace-extemd-extinguish here, but learning about this case was eyebrow raising for me.


C# extension is MIT, even though vsdbg it ships with is closed-source. There's a fork that replaces it with netcoredbg which is open.

C# DevKit is however based on VS license. It builds on top of base C# extension, the core features like debugger, language server, auto-complete and auto-fixers integration, etc. are in the base extension.


It’s open-source at start, later it turns into open-core.


Is this fix evidence of that?


No pure speculation


Not relevant, then.


Can you elaborate how exactly git is at risk here? These posts never do.


They will extend git so that it works extremely well with their proprietary products, and just average with other tools and operating systems. That's always the goal for MS.


You know who the maintainer of Git is right?


Junio Hamano. Or did you confuse git and GitHub?


This comment is downvoted, however you can be sure that managers in these corporations make these decisions deliberately - like half the time.

I find these insightful reminders. Use the vanilla free versions if the difference is negligeble.


No, this is cathedral vs bazaar development


> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

What's up with folks in Europe that they can't clone a big repo, but others can? Also it sounds like they still won't be able to clone, until the change is implemented on the server side?

> This meant we were in many occasions just pushing the entire file again and again, which could be 10s of MBs per file in some cases, and you can imagine in a repo

The sentence seems to be cut off.

Also, the gifs are incredibly distracting while trying to read the article, and they are there even in reader mode.


> For many reasons, that's just too big, we have folks in Europe that can't even clone the repo due to it's size.

I read that as an anecdote, a more complete sentence would be "We had a story where someone from Europe couldn't clone the whole repo on his laptop for him to use on a journey across Europe because his disk is full at the time. He has since cleared up the disk and able to clone the repo".

I don't think it points to a larger issue with Europe not being able to handle 180GB files...I surely hope so.


The European Union doesn't like when a file get too big and powerful. It needs to be broken apart in order to give smaller files a chance of success.


Ever since they enshrined the Unix Philosophy into law, it's been touch-and-go for monorepotic corporations.


People foolishly thought the G in GDPR stood for "general" when it's actually GIANT.


My guess is that “Europe” is being used as a proxy for “high latency, low bandwidth” – especially if the person in question uses a VPN (especially one of those terrible “SSL VPN” kludges). It’s still surprisingly common to encounter software with poor latency handling or servers with broken window scaling because most of the people who work on them are relatively close and have high bandwidth connection.


And given the way of internal corporate networks, probably also "high failure rate", not because of "the internet", but the pile of corporate infrastructure needed for auditability, logging, security access control, intrusion detection, maxed out internal links... it's amazing any of this ever functions.


Or simply how those multiply latency - I’ve seen enterprise IT dudes try to say 300ms LAN latency is good because nobody wants to troubleshoot their twisted mess of network appliances and it’s not technically down if you’re not getting an error…

(Bonus game: count the number of annual zero days they’re exposed to because each of those vendors still ships 90s-style C code)


Or high packet loss.

Every once in a while, my router used to go crazy with seemingly packet loss (I think a memory issue).

Normal websites would become super slow for any pc or phone in the house.

But git… git would fail to clone anything not really small.

My fix was to unplug the modem and router and plug back in. :)

It took a long time to discover the router was reporting packet loss, and that the slowness the browsers were experiencing has to do with some retries, and that git just crapped out.

Eventually when git started misbehaving I restarted the router to fix.

And now I have a new router. :)


Sounds, based on other responders, like high latency high bandwidth, which is a problem many of us have trouble wrapping our heads around. Maybe complicated by packet loss.

After COVID I had to set up a compressing proxy for Artifactory and file a bug with JFrog about it because some of my coworkers with packet loss were getting request timeouts that npm didn’t handle well at all. Npm of that era didn’t bother to check bytes received versus content-length and then would cache the wrong answer. One of my many, many complaints about what total garbage npm was prior to ~8 when the refactoring work first started paying dividends.


I can actually weigh in here. Working from Australia for another team inside Microsoft with a large monorepo on Azure devops. I pretty much cannot do a full (unshallow) clone of our repo because Azure devops cloning gets nowhere close to saturating my gigabit wired connection, and eventually due to the sheer time it takes cloning something will hang up on either my end of the Azure devops end to the point I would just give up.

Thankfully, we do our work almost entirely in shallow clones inside codespaces so it's not a big deal. I hope the problems presented in the 1JS repro from this blog post are causing similar size blowout in our repo and can be fixed.


The repo is probably hosted on the west coast, meaning it has to cross the Atlantic whenever you clone it from Europe?


> What's up with folks in Europe that they can't clone a big repo, but others can?

They might be in a country with underdeveloped internet infrastructure, e.g. Germany))


I do t think there’s any country in Europe with internet infrastructure as underdeveloped as the US. Most of Europe has fibre-to-the-premise, and all of Europe has consumer internet packages that are faster and cheaper than you’re gonna find anywhere in the U.S.


There's (almost) no FTTH in Germany. The US used to be as bad as Germany, but it has improved significantly and is actually pretty decent these days (though connection speed is unevenly distributed).

Both countries are behind e.g. Sweden or Russia, but Germany by a much larger margin.

There's some trickery done in official statistics (e.g. by factoring in private connections that are unavailable to consumers) to make this seem better than it is, but ask anyone who lives there and you'll be surprised.


The east has fibre everywhere, but the west is still a developing country(side). Shipping code on a truck would be faster, if you are not on some academic fibre net


upd: silly mistake - file name does not include its full path

The explanation probably got lost among all the gifs, but the last 16 chars here are different:

> was actually only checking the last 16 characters of a filename > For example, if you changed repo/packages/foo/CHANGELOG.md, when git was getting ready to do the push, it was generating a diff against repo/packages/bar/CHANGELOG.md!


Derrick provides a better explanation in this cover letter: https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitg...

(See also the path-walk API cover letter: https://lore.kernel.org/all/pull.1786.git.1725935335.gitgitg...)

The example in the blog post isn't super clear, but Git was essentially taking all the versions of all the files in the repo, putting the last 16 bytes of the path (not filename) in a hash table, and using that to group what they expected to be different versions of the same file together for delta compression.

Indeed in the blog it doesn't work, because foo/CHANGELOG.md and bar/CHANGELOG.md is only 13 chars, but you have to imagine the paths have a longer common suffix. That part is fixed by the --full-name-hash option, now you compare the full path instead of just 16 bytes.

Then they talk about increasing the window size. That's kind of a hack to workaround bad file grouping, but it's not the real fix. You're still giving terrible inputs to the compressor and working around it by consuming huge amounts of memory. So that was a bit confusing to present it as the solution. The path walk API and/or --full-name-hash are the real interesting parts here =)


Thank you! I ended up having to look at the PR to make any sense of the blog post, but your explanation and links makes things much clearer


I'll update the post with this clarity too. Thanks!


I wish they had provided an actual explanation of what exactly was happening and skipped all the “color” in the story. By filename do they mean path? Or is it that git will just pick any file with a matching name to generate a diff? Is there any pattern to the choice of other file to use?


+1


> file name does not include its full path

No, it is the full path that's considered. Look at the commit message on the first commit in the `--full-name-hash` PR:

https://github.com/git-for-windows/git/pull/5157/commits/d5c...

Excerpt: "/CHANGELOG.json" is 15 characters, and is created by the beachball [1] tool. Only the final character of the parent directory can differntiate different versions of this file, but also only the two most-significant digits. If that character is a letter, then this is always a collision. Similar issues occur with the similar "/CHANGELOG.md" path, though there is more opportunity for differences in the parent directory.

The grouping algorithm puts less weight on each character the further it is from the right-side of the name:

  hash = (hash >> 2) + (c << 24)
Hash is 32-bits. Each 8-bit char (from the full path) in turn is added to the 8-most significant bits of hash, after shifting any previous hash bits to the right by two bits (which is why only the final 16 chars affect the final hash). Look at what happens in practice:

https://go.dev/play/p/JQpdUGXdQs7

Here I've translated it to Go and compared the final value of "aaa/CHANGELOG.md" to "zzz/CHANGELOG.md". Plug in various values for "aaa" and "zzz" and see how little they influence the final value.


Sounds like it needs to be fixed to FNV1a


No, the problem isn't the hash. It does what it was designed to do. It's just that it was optimal for a particular use case that fits the Linux kernel better than Microsoft's use case. Switching the hash wouldn't improve either situation. If you want to understand this deeper, see the linked PRs.


Thanks for the deep dive!


File name doesn’t necessarily include the whole path. The last 16 characters of CHANGELOG.md is the full file name.

If we interpret it that way, that also explains why the filepathwalk solution solves the problem.

But if it’s really based on the last 16 characters of just the file name, not the whole path, then it feels like this problem should be a lot more common. At least in monorepos.


It did shrink Chromium’s repo quite a bit!


yes, this makes sense, thanks for pointing it out, silly confusion on my part


I was also bugged by that. I imagine that the meta variables foo and bar are at fault here, and that probably the actual package names had a common suffix like firstPkg and secondPkg. A common suffix of length three is enough in this case to get 16 chars in common as "/CHANGELOG.md" is already 13 chars long.


Sorry about the gifs. Haha. And yeah I guess my understanding wasn't quite right either reading the reply to this thread, I'll try to clean it up in the post.


I just tried this on nixpkgs (~5GB when cloned straight from Github).

The first option mentioned in the post (--window 250) reduced the size to 1.7GB. The new --path-walk option from the Microsoft git fork was less effective, resulting in 1.9GB total size.

Both of these are less than half of the initial size. Would be great if there was a way to get Github to run these, and even greater if people started hosting stuff in a way that gives them control over this ...


The article mentions Derick Stolee who dig the digging and shipped the necessary changes. If you're interested in git internals, shrinking git clone sizes locally and in CI etc, Derrick wrote some amazing blogs on GitHub blog:

https://github.blog/author/dstolee/

See also his website:

https://stolee.dev/

Kudos to Derrick, I learnt so much from those!


> Large blobs happens when someone accidentally checks in some binary, so, not much you can do

> Retroactively, once the file is there though, it's semi stuck in history.

Arguably, the fix for that is to run filter-branch, remove the offending binary, teach and get everyone setup to use git-lfs for binaries, force push, and help everyone get their workstation to a good place.

Far from ideal, but better than having a large not-even-used file in git.


There's also BFG (https://rtyley.github.io/bfg-repo-cleaner/) for people like me who are scared of filter-branch.

As someone else noted, this is about small, frequently changing files, so you could remove old versions from the history to save space, and use LFS going forward.


The main issue is not a binary file that never changes. It’s the small binary file that changes often.


filter-repo is the recommended way these days:

https://github.com/newren/git-filter-repo


It’s easier to blame Linus.


Hacking Git sounds fun, but isn't there a way to just not have 2.500 packages in a monorepo?


Code line count tends to grow exponentially. The bigger the code base, the more unreasonable it is to expect people not to reinvent an existing wheel, due to ignorance of the code or fear of breaking what exists by altering it to handle your use case (ignorance of the uses of the code).

IME it takes less time to go from 100 modules to 200 than it takes to go from 50 to 100.


Yeah, have 2500 separate Git repos with all the associated overhead.


Can’t we split the packages into logical groups and maybe have 20 or 30 monorepos of 70-100 packages? I doubt that all the devs involved in that monorepo have to deal with all the 2500 packages. And I doubt that there is a circular dependency that requires all of these packages to be managed in a single monorepo.


People act like managing lots of git repos is hard, then run into monorepo problems requiring them to fix esoteric bugs in C that have been in git for a decade, all while still arguing monorepos are easy and great and managing multiple repos is complicated and hard.

It's like hammering a nail through your hand, and then buying a different hammer with a softer handle to make it hurt less.


> all while still arguing monorepos are easy and great

I don't know anyone who says monorepos are easy.

To the contrary, the tooling is precisely the hard part.

But the point is that the difficulty of the tooling is a lot less than the difficulty of managing compatibility conflicts between tons of separate repos.

Each esoteric bug in C only needs to be fixed once. Whereas your version compatibility conflict this week is going to be followed by another one next week.


At Amazon, there is no monorepo.

And the tooling to handle this is not even particularly conceptually complicated - a "versionset" is a set of versions - a set of pointers to a particular commit of a repository. When you build and deploy an application, what you're building is a versionset containing the correct versions of all its dependencies. And pull requests can span across multiple repositories.

Working at Amazon had its annoyances, but dependency management across repos was not one of them.


> And pull requests can span across multiple repositories

This bit is doing a lot of work here.

How do you make commits atomic? Is there a central commit queue? Do you run the tests of every dependent repo? How do you track cross-repo dependencies to do that? Is there a central database? How do you manage rollbacks?


Thad exactly the problem. At least tooling can solve mono repo problems. But commits , which should span multiple repos, have no tooling at all. Except pain. Lots of pain.


Don't forget that git was made for Linux and Linux isn't a monorepo and works great with tens of thousands of devs per release


> Linux isn't a monorepo

I assume you meant to write "is" there?


Changing 100 CI pipelines is a giant pain in the ass. The third time I split the work with two other people. The 4th time someone wrote a tool and switched to a config file in the repo. 2500 is nuts. How do you even track red builds?


This was exactly my first thought as well. This seems like an entirely self-manufactured problem.


When you have hundreds of developers you’re going to get millions of lines of code. Thats partly Parkinson’s Law but also we have not fully perfected the three way merge, encouraging devs spread out more than intrinsically necessary in order to avoid tripping over each other.

If you really dig down into why we code the way we do, the “best practices” in software development, about half of them are heavily influenced by merge conflict, if not the primary cause.

If I group like functions together in a large file, then I (probably) won’t conflict with another person doing an unrelated ticket that touches the same file. But if we both add new functions at the bottom of the file, we’ll conflict. As long as one of us does the right thing everything is fine.


This is one of the interesting benefits of https://www.unison-lang.org/ . A codebase of immutable functions inherently cannot have merge conflicts.


Thanks for this post. Really interesting and a great win for OSS!

I've been watching all the recent GitMerge talks put up by GitButler and following the monorepo / scaling developments - lots of great things being put out there by Microsoft, Github, and Gitlab.

I'd like to understand this last 16 char vs full path check issue better. How does this fit in with delta compression, pack indexes, multi-pack indexes etc ... ?


> Really interesting and a great win for OSS!

Are they going to be opening a merge request to get their custom git command back in git proper then?



Nice to see that Microsoft is dog-fooding Azure DevOps. It seems that more and more Azure services only have native connectors to GitHub so I actually thought it was moving towards abandonware.


Having someone in arms reach to help out that knows the inner workings of Git so much must be a lovely perk of working on such projects at companies of this scale.


Certainly being in an org which has close ties to entities like GitHub helps, but any team in any org with that number of developers can justify the cost of bringing in a highly specialized consultant to solve an almost niche problem like this.


> we have folks in Europe that can't even clone the repo due to it's size

Officer, I'd like to report a murder committed in a side note!


They call him Linux Torvalds over there?


    > We work in a very large Javascript monorepo at Microsoft we colloquially call 1JS.
I used to call it office.com.. Teams is the worst offender there. Even a website with a cryptominer on it runs faster than that junk.


We were all impressed with google docs, but office.com is way more impressive.

Collaborative editing between a web app, two mobile anpps and a desktop app with 30 years of backwards compatibility and it pretty much just works. No wonder that took a lot of JavaScript!


We use MS Teams at my company. The Word and Excel in the Windows Teams app are so buggy that I can almost never successfully open a file. It just times out and eventually shows a "please try again later" message nearly every time. I've uninstalled and reinstalled the Teams app four or five times trying to fix this.

We've totally given up any kind of collaborative document editing because it's too frustrating, or we use Notion instead, which for all it's fault, at least the basic stuff like loading a bloody file works...


This is specific to your company’s configuration - likely something related to EDR or firewall policies.


I'm the one who set it up. It's a small team of 20 people. I've done basically no setup beyond the minimum of following docs to get things running. We've had nonstop problems like this since the very start. Files don't upload, anytime I try to fix it I'm confronted with confusing error messages and cryptic things like people telling me "something related to EDR". What the hell is EDR? I just want to view a Word doc.

I've come to realize that Teams should only be used in large companies who can afford dedicated staff to manage it. But it was certainly sold to us as being easy to use and suitable for a small company.


EDR: https://en.wikipedia.org/wiki/Endpoint_detection_and_respons...

I mentioned that because security software blocking things locally or at the network level is such a common source of friction. I don’t think Teams is perfect by any means but the core functionality has been quite stable in personal use, both of my wife’s schools, and my professional use so I wouldn’t conclude that it’s hopeless and always like that.


Thank you, I appreciate the support. But this doesn't explain the intermittent nature of the issues. For example, just now I tried to open a word file. I got the error message. But then I tried several times and restarted the app twice, and eventually the file did load. It just took five+ minutes of trying over and over.

I also had to add a new user yesterday, so I went to admin.microsoft.com in Edge. 403 error. Tried Chrome and Firefox. Same. Went back to Edge and suddenly it loaded. The like an idiot I refreshed, 403 error again. Another 5 or six refreshes and it finally loaded again and I was able to add the new user. There's never any real error messages that would help me debug anything, it's just endless frustration and slowness.


Really it's anyone using teams on older or cheaper hardware.


So you’ve tested this with clean installs on unfiltered networks? Just how old is your hardware? It works well on, say, the devices they issue students here so I’m guessing it’d have to be extremely old.


> [...] and it pretty much just works.

I beg to differ. Last time I had to use PowerPoint (granted, that was ~3 years ago), math on the slides broke when you touched it with a client that wasn't of the same type as the one that initially put it there. So you would need to use either the web app or the desktop app to edit it, but you couldn't switch between them. Since we were working on the slides with multiple people you also never knew what you had to use if someone else wrote that part initially.


could it be a font issue?


If I remember correctly I had created the math parts with the windows PowerPoint app and it was shown more or less correctly in the web app, until I double clicked on it and it completely broke; something like it being a singular element that wasn't editable at all when it should have been a longer expression, I don't remember the details. But I am pretty sure it wasn't just a font issue.


That's the thing, though, the compat story is terrible. I can't say much about the backwards one, but Microsoft has started the process of removing features from the native versions just to lower the bar for the web one catching up. Even my most Microsoft-enamoured colleagues are getting annoyed by this (and the state of all-MS things going downhill, but that's another story)


> That's the thing, though, the compat story is terrible.

It really is. With shared documents you just have to give up. If someone edits them on the web, in Teams, in the actual app or some other way like on iOS, it all goes to hell.

Pages get added or removed, images jump about, fonts change and various other horrors occur.

If you care, you’ll get ground into the earth.


To be fair, we were impressed with Google Docs 15 years ago. Not saying office.com isn't impressive, but Google Docs certainly isn't impressive today. My company still uses GSuite, as I don't like being in Microsoft's ecosystem and we don't need any advanced features of our office suite but Google Docs and the rest of the GSuite seem to be intentionally held back to technology of the early 2010's.


Google docs certainly haven't changed much the last 5-10 years. I wonder if that's an intentional choice, or if it is because those that built it and understand how it works are long gone to work on other things.


Actually I did see a few long awaited improvements landing in gdocs lately (e.g. better markdown support, pageless mode).

I think they didn't deliver much new features in early 2020s because they were busy with a big refactoring from DOM to canvas rendering [0].

[0] https://news.ycombinator.com/item?id=27129858


No more development? Time for Google to kill Google Docs!


What's impressive is that MS has such well trained customers that it can get away with extremely buggy and broken web apps. Fundamental brokenness like collaborative editing frequently losing data and thousand cuts of the more mundane bugs.


You must be kidding about "just works". There are so many bugs in word and excel that you could spend the rest of your life fixing. And the performance is disastrous.


> No wonder that took a lot of JavaScript!

To the point where they quickly found the flaws in JS for large codebases and came up with Typescript. I think. It makes sense that TS came out of the office for web project.


Hey, I worked with Jonathan on 1JS a while ago (on a team, Excel).

Just a note OMR (the office monorepo) is a different (and actually much larger) monorepo than 1JS (which is big on its own)

To be fair I suspect a lot of the bloat in both originates from the amount of home grown tooling.


I thought Microsoft had one monorepo. Isn't that kind of the point? How many do they have?


The point of a monorepo is that all the dependencies for a suite of related products are all in a single repo, not that everything your company produces is in a single repo.


Most people use the "suite of related products" definition of monorepo, but some companies like Google and Meta have a single company-wide repository. It's unfortunate that the two distinct strategies have the same name.


Teams is the running version of that repository... It is hard for them even to store on git.


> we have folks in Europe that can't even clone the repo due to it's size.

What is it about Europe that makes it more difficult? That internet in Europe isn't as good? Actually, I have heard that some primary schools in Europe lack internet. My grandson's elementary school in rural California (population <10k) had internet as far back as 1998.


Let's pretend you didn't write the last 2 sentences...

first of all "internet in Europe" makes close to zero sense to argue about. The article just uses it as a shortcut to not start listing countries.

I live in a country where I have 10Gbps full-duplex and I pay 50$ / month, in "Europe".

The issue is that some countries have telecom lobbies which are still milking their copper networks. Then the "competition committees" in most of these countries are actually working AGAINST the benefit of the public, because they don't allow 1 single company to start offering fiber, because that would be a competition advantage. So the whole system is kinda in a deadlock. In order to unblock, at least 2 telecoms have to agree to release fiber deals together. It has happened in some countries.


What european countries still dont have fiber?

//Confused swede with 10G fiber all over the place. Writing from literally the countryside next to nowhere.


If you really need it pointed out, take it from a German neighbor: Telekom is running some extortion scheme or so here. Oh we could have gotten fiber to our house already ... if we paid them 800+ Euro! So we rather stick with our 100MBits or so connection that is not fiber but copper. If the German state does not intervene here, or the practices of ISPs and whoever has the power to build fiber changes, we will for the foreseeable future still be on copper.

Then there are villages, which were promised fiber connections, but somehow after switching to the fiber connection made them have unstable Internet and ofter no Internet. Saw some documentary about that, could be fixed by now.

Putting fiber into the ground also requires a whole lot of effort opening up roads and replacing what's there. Those costs they try to push to the consumers with their 800+ Euro extortion scheme.

But to be honest, I am also OK with my current connection. All I worry about is it being stable, no package loss, and no ping spikes. A consistently good connection stability is more important than throughout. Sadly, I cannot buy any of those guarantees from any ISP.


FWIW, Sweden subsidized fiber digging but we still had to pay 2000 EUR to get it connected.

Government will pay the extra fees, which can easily end up close to 10000 EUR due to large distances.

If all you need to pay is 800 EUR, then I don't understand what is your issue? Just pay it.


Is 800 euros that bad? In the US, we were quoted $10k a few years back. Even if fiber is already at the road, $800 is probably a fair price just to trench the line from the road to your home and install an entry point. If they provide free installation, then they have to make up the cost by raising your rates.


I think private households paying 800 Euro for what should be public infrastructure, being milked by ISPs is pretty bad.


Germany.

Deutsche Telekom is the former monopoly that was half-privatized around 1995 or something. The state still owns quite a large stake of it.

They milk their ancient copper crap for everything they can while keeping prices high.

They are refusing useful backbone interconnects to monopolize access to their customers (Actually they are not allowed to refuse. They just offer interconnections only in their data centers in the middle of nowhere, where you need to rent their (outrageously priced) rackspace and fibres because there is nothing else. They are refusing for decades to do anything useful at the big exchanges like DECIX).

And if there should ever be a small competitor that on their own tries to lay fibre somewhere, they quickly lay their own fibre into the open ditches (they are allowed to do that) and offer just enough rebates for their former copper customers to switch to their fibre that the competitor cannot recoup the invest and goes bankrupt. Since that dance is now known to everyone, even the announcement of Telekom laying their own fibres kills the competitors' projects there. So after a competitor's announcement of fibre rollout, Telekom does the same, project dead, no fibre rollout at all.

Oh, and since it is a partially-state-owned former monopoly/ministry, the state and competition authorities turn a blind eye to all that, when not actively promoting them...

Then there is the problem of "5G reception" vs. "5G reception with usable bandwidth". A lot of overbooking goes on, many cells don't have sufficient capacity allocated, so there are reports of 4G actually being faster in many places.

And also, yes, you can get 5G in a lot of actually populated areas. But you certainly will pay through the nose for that, usually you get a low-GB amount of traffic included, so maybe a tenth of the Microsoft monorepo in question. The rest is pay-10Eur-per-GB or something.


It is almost as bad as you say, except that I recently noticed several instances of competitors offering cheaper fiber than Telekom and surviving. Still, overall fiber buildout is low, like... I looked it up, reportedly 36% now.


Wait, I live in that area. Does that mean I'm allowed to lay my own fiber into their open ditches too, or do they have special rights no one else has?


Afaik the special right is granted to everyone providing fibre services to the public to be informed about any ditches on public ground being dug and getting the opportunity to throw their fibre in before the ditch is closed again.


Germany, GP's situation smells like their policies.


I pay 42USD for 250Mbit in a larger Swedish city. What is that magic ISP I should be using?


Change landlord. I used to pay about 100 SEK for bahnhof in svenska bostäder before I moved away. It came with public IP and everything.


Sounds like you are already using a magic ISP (rural USA here).


They’re probably downloading from a server in the states, being much further away makes a big difference with a massive download.


This.


I've experienced interruptions mid-clone (with no apparent way to resume them) when trying to clone repos on unreliable connections, and perhaps a similar issue is happening with connections between continents.


The only reliable route I’ve found is to use SSH clone. HTTPS is lousy and as you mention, is not resumable. Works fine in Antarctica even over our slower satellite. Doesn’t help if you actually drop, but you can clone to a remote and then rsync everything over time.


It's issues cloning super huge repo over crappy protocols across ocean especially when VPNs get included in the problem


Most european countries have connections with more bandwith and less base latency for cheaper than the US, it's not a connection issue. If there was an issue it's that the repo itself is hosted on the other side of the world, but even so the sidenote itself is odd.


I wouldn't say it's odd at all - it's basically what's justifying actually trying to solve the problem rather than just going "huh... that's weird..." then putting it on the backlog due to it not being a showstopper.

This sort of thing has been a problem on every project I've worked on that's involved people in America. (I'm in the UK.) Throughput is inconsistent, latency is inconsistent, and long-running downloads aren't reliable. Perhaps I'm over-simplifying, but I always figured the problem was fairly obvious: it's a lot of miles from America to Europe, west coast America especially, and a lot of them are underwater, and your're sharing the conduit with everybody else in Europe. Many ways for packets to get lost (or get held up long enough to count), and frankly it's quite surprising more of them don't.

(Usual thing for Perforce is to leave it running overnight/weekend with a retry count of 1 million. I'm not sure what you'd do with Git, though? it seems to do the whole transfer as one big non-retryable lump. There must be something though.)


In most EU countries we have multi-gigabit internet (for cheap too). Current offers are around ~5 GBIT speeds for 20 bucks a month.


Sadly, I'm in Germany. Which is a third world country when it comes to decent connectivity. They are rolling out some fiber now in Berlin. Finally. But very slowly and not to my building any time soon. Most of the country is limited to DSL speeds. Mobile coverage is getting better but still non existent outside of cities. Germany has borders with nine countries. Each of those have better connectivity than Germany.

I'm from the Netherlands where over 90% of households now have fiber connections, for example. Here in Berlin it's very hard to get that. They are starting to roll it out in some areas but it's taking very long and each building has to then get connected, which is up to the building owners.


> Mobile coverage is getting better but still non existent outside of cities.

According to the Bundesnetzagentur over 90% [1] of Germany has 5G coverage (and almost all of the rest has 4G [2]).

[1] https://www.bundesnetzagentur.de/SharedDocs/Pressemitteilung...

[2] https://gigabitgrundbuch.bund.de/GIGA/DE/MobilfunkMonitoring...


Those statistics are a half-truth at best.

The "coverage" they are reporting is not by area but by population. So all the villages and fields that the train or autobahn goes by won't have 5G, because they are in the other 10% because of their very low population density.

And the reporting comes out of the mobile phone operators' reports and simulations (they don't have to do actual measurements). Since their license depends on meeting a coverage goal, massive over-reporting is rampant. The biggest provider (Deutsche Telekom) is also partially state-owned, so the regulators don't look as closely...

Edit: accidentially posted this in the wrong comment: Then there is the problem of "5G reception" vs. "5G reception with usable bandwidth". A lot of overbooking goes on, many cells don't have sufficient capacity allocated, so there are reports of 4G actually being faster in many places.

And also, yes, you can get 5G in a lot of actually populated areas. But you certainly will pay through the nose for that, usually you get a low-GB amount of traffic included, so maybe a tenth of the Microsoft monorepo in question. The rest is pay-10Eur-per-GB or something.


I usually lose connectivity on train journeys across Germany. I'm offline most of the way. Even the in train wifi gets quite bad in remote areas. Because they depend on the same shitty mobile networks. There's a stark difference as soon as you cross the borders with other countries. Suddenly stuff works again. Things stop timing out.

I also deal with commercial customers that have companies in areas with either no or poor mobile connectivity and since we sell mobile apps to them, we always need to double check they actually have a good connection. One of our customers is on the edge of a city with very spotty 4G at best. I recently recommended Star Link to another company that is operating in rural areas. They were asking about offline capabilities of our app. Because they deal with poor connectivity all the time. I made the point that you can get internet anywhere you want now for a fairly reasonable price.


When I travel in Germany I use a Deutsche Telekom pay as you go SIM in a 5G hotspot, and generally get about 200Mbit throughtput, which is far higher than you can expect any place you're staying to provide. It's €7 a day (or €100 a month) but it's worth it to avoid the terrible internet.


Oh, that is an incentive for them not to improve anything. Wouldn't want customers to stop purchasing mobile Internet for 100 Euro a month.


Well good for you. On my side of europe, I pay €50/- for a cheap 50Mbps(1 month cancellation notice period). I could get a slightly cheaper 100Mbps from a predator for €20/- for first 6 month but then it goes up to €50/- and they pull bs about not being able to cancel if you even move because your new location is also in their coverage area(over garbage copper) and suffers at least 20 outages per month while there are other providers with much cheaper rates and better service.

Some EU is still suffering from Telekom copper barons.


Not in the UK. Still on 80Mbit VDSL here.


You must be unlucky, according to Openreach "fibre broadband is already available in more than 96.59 per cent of the UK."


Is that "fibre" or "full fibre".

They lied a lot for a good few years saying "OMG fibre broadband!" When in reality is was still copper for the last mile so that "fibre" connection in reality was some ADSL variant and limited to 80/20mpbs.

Actual full fibre all the way from your home to the internet is I think still quite a way behind. Even in London (London! The capital city with high density) there are places where there are no full fibre options.


According to ThinkBroadband's tracking [1], the headline figures are 85.20% of premises are gigabit capable (FTTP/FTTH/Cable [DOCSIS]) with 71.86% being full fibre.

[1]: https://www.thinkbroadband.com/news/10343-85-gigabit-coverag...


Maybe myself and my friends are lucky as we're all on ftth


Only a few I know are on ftth. I guess I live in a fairly affluent area in Zone 3 which is lower density than average - zero flats etc, all just individual houses so perhaps not worth their effort rolling out


Coming next year apparently. I won’t hold my breath.


I and many I know have Gb fiber in the UK


At least here in Western Europe, in general the internet is great. Though coverage in rural areas varies by country.


Some countries in Europe (even Poland) definitely offer faster Internet and for cheaper than the US, and without most of the privacy issues that US ISPs have.


I was not sure what this meant either. I know personally I have downloaded and uploaded some very very large files transatlantic (e.g. syncing to cloud storage) with absolutely no issues, so not sure what they are talking about. I guess perhaps there are issues with git cloning such a large amount of data, but that is a problem with git and not the infrastructure.

FWIW every school I've seen (and I recently toured a bunch looking at them for my kids to start at) all had the internet and the kids were using iPads etc for various things.

Anecdotally my secondary school (11-18y in UK) in rural Hertfordshire was online in the 1995 region. It was via I think a 14.4 modem and there actually wasn't that much useful material for kids then to be honest. I remember looking at the "non-professional style" NASA website for instance (the current one is obviously quite fancy in comparison, but it used to be very rustic and at some obscure domain). CD-based encyclopedias we're all the rage instead around that time IIRC - Encarta et al.


Effective bandwidth can be influenced by roundtrip time. Fewer IP4 numbers means more NAT with more delay and yet another point where occasionally something can go wrong. Last but not least there are some areas in the EU like the Canary Islands where the internet feels like going over a sat.


The problem is probably that the repo is not hosted in Europe.


My knowledge is a bit outdated, but we used to say:

* in America, peering between ISPs is great, but the last-mile connection is terrible

* In Europe, the last-mile connection is great, but peering between the ISPs is terrible (ISPs are at war with each other). Often you could massively improve performance by renting a VPS in the correct city and routing your traffic manually.


> > we have folks in Europe that can't even clone the repo due to it's size.

> I have heard that some primary schools in Europe lack internet.

Maybe they lack internet but teach their pupils how to write "its".


I'm surprised they are actually using Azure DevOps internally. Creating your own hell I guess.


I find the “Boards” part of DevOps doesn’t work well for us a small org wanting a less structured backlog, but for components like Pipelines and the Git repositories it’s neither here nor there for us.

What aspects of Azure DevOps are hell to you?


Some examples, in no particular order.

Hampering the productivity:

- Review messages get sent out before review is actually finished. It should be sent out only once the reviewer has finished the work.

- Code reviews are implemented in a terrible way compared to GitHub or GitLab.

  - Re-requesting a review once you did implemented proposed changes? Takes a single click on GitHub, but can not be done in Azure DevOps. I need to e.g. send a Slack message to the reviewer or remove and re-add them as reviewer.

  - Knowing to what line of code a reviewer was giving feedback to? Not possible after the PR got updated, because the feedback of the reviewer sticks to the original line number, which might now contain something entirely different.
- Reviewing the commit messages in a PR takes way too many clicks. This causes people to not review the commit messages, letting bad commit messages pass and thus making it harder for future developers trying to figure out why something got implemented the way it did. Examples:

  - Too many clicks to review a commit message: PR -> Commits -> Commit -> Details

  - Comments on a specific commit does not shown in the commits PR
- Unreliable servers. E.g. "remote: TF401035: The object '<snip>' does not exist.\nfatal: the remote end hung up unexpectedly" happens too often on git fetch. Usually works on a 2nd try.

- Interprets IPv6 addresses in commit messages as emoji. E.g. fc00::6:100:0:0 becomes fc00::60:0.

- Can not cancel a stage before it actually has started (Wasting time, cycles)

- Terrible diffs (can not give a public example)

- Network issues. E.g. checkouts that should take a few seconds take 15+ minutes (can not give a public example)

- Step "checkout": Changes working folder for following steps (shitty docs, shitty behaviour)

- The documentation reads as if their creators get paid by the number of words, but not for actually being useful. Whereas GitHub for example has actually useful documentation.

- PR are always "Show everything", instead of "Active comments" (what I want). Resets itself on every reload.

- Tabs are hardcoded (?) to be displayed as 4 chars - but we want 8 (Zephyr)

- Re-running a pipeline run (manually) does not retain the resources selected in the last run

Security:

- DevOps does not support modern SSH keys, one has to use RSA keys (https://developercommunity.visualstudio.com/t/support-non-rs...). It took them multiple years to allow RSA keys which are not deprecated by OpenSSH due to security concerns (https://devblogs.microsoft.com/devops/ssh-rsa-deprecation/), yet no support for modern algos. This also rules out the usage of hardware tokens, e.g. YubiKeys.

Azure DevOps is dying. Thus, things will not get better:

- New, useful features get implemented by Microsoft for GitHub, but not for DevOps. E.g. https://devblogs.microsoft.com/devops/static-web-app-pr-work...

- "Nearly everyone who works on AzDevOps today became a GitHub employee last year or was hired directly by GitHub since then." (Reddit, https://www.reddit.com/r/azuredevops/comments/nvyuvp/comment...)

- Looking at Azure DevOps Released Features (https://learn.microsoft.com/en-us/azure/devops/release-notes...) it is quite obvious how much things have slowed down since e.g. 2019.

Lastly - their support is ridiculously bad.


> I'm surprised they are actually using Azure DevOps internally. Creating your own hell I guess.

Even the hounds of hell may benefit from dogfooding.


houndfooding?


Ain't nothing but a hound dog.


Oh hey I know that name, Stolee. Fellow JSR grad here.


> those branches that only change CHANGELOG.md and CHANGELOG.json, we were fetching 125GB of extra git data?! HOW THO??

Unrecognized 100x programmer somewhere lol


I recently had a similar moment of WTF for git in a JavaScript repo.

Much much smaller of course though. A raspberry pi had died and I was trying to recover some projects that had not been pushed to GitHub for a while.

Holy crap. A few small JavaScript projects with perhaps 20 or 30 code files, a few thousand lines of code for a couple of 10s of KBs of actual code at most had 10s of gigabytes of data in the .git/ folder. Insane.

In the end I killed the recovery of the entire home dir and had to manually select folders to avoid accidentally trying to recover a .git/ dir as it was taking forever on a poorly SD card that was already in a bad way and I did not want to finally kill it for good by trying to salvage countless gigabytes of trash for git.


People who use git in monorepos don't understand git


I think the title misses the "Honey, " part


better question - does the changelog need to be checked in the first place?


They fixed a bug on a tool that is widely used. In what world is questioning why an organization is checking in a file that you have no context on a “better question”.


Paraphrasing meat of the article:

- When you have multiple files in the repo which have the same trailing 16 characters in the repo path, git may wrongly calculate deltas, mixing up between those files. In here they had multiple CHANGELOG.md files mixed up.

- So if those files are big and change often, you end up with massive deltas and inflated repo size.

- There's a new git option (in Microsoft git fork for now) and config to use full file path to calculate those deltas, which fixes the issue when pushing, and locally repacking the repo.

```

git repack -adf --path-walk

git config --global pack.usePathWalk true

```

- According to a screenshot, Chromium repacked in this way shrinks from 100GB to 22GB.

- However AFAIU until GitHub enables it by default, GitHub clones from such repos will still be inflated.


I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Also, thank you for the TLDR!


> I don't think GitHub, or any other git host, will have objections to using it once it's part of mainline git?

Fixing an existing repository requires a full repack, and for a repository as big as Chromium it still takes more than half a day (56000 seconds is 15h30), even if that's an improvement over the previous 3 days it's a lot of compute.

From my experience of previous attempts, trying to get Github to run a full repack with harsh settings is extremely difficult (possibly because their infrastructure relies on more loosely packed repositories), I tried to get that for $dayjob's primary repository whose initial checkout had gotten pretty large and got nowhere.

As of right now, said repository is ~9.5GB on disk on initial clone (full, not partial, excluding working copy). Locally running `repack -adf --window 250` brings it down to ~1.5GB, at the cost of a few hours of CPU.

The repository does have some of the attributes described in TFA, so I'm definitely looking forward to trying these changes out.


Wouldn't a potential workaround be to create a new barebones repository and push the repacked one there? Sure, people will have to change their remote origin but if it solves the problem that might be worth the hassle?


It breaks the issues, PRs, all the tooling and integration, …

For now we’re getting by with partial clones, and employee machines being imaged with a decently up to date repository.


> in Microsoft git fork for now

Wait, what? Has MS forked git?


MS has had their fork of git for years, and they contributed many performance features for monorepos since then to the mainline.


Companies fork Git in order to work on things internally until they ready to be proposed for inclusion into Git itself. I’m pretty sure that GitHub and GitLab (and?) do the same thing.

These are not forks-going-their-own-way forks.


Thank you to the AI that summarised the article. ;-)


Did anybody else shudder at "Shrunked"?


Shrunken, shrunked ain't no language I ever heard of.


Honey, I done shrunked them kids


English is my third language, also yes.


Shrank


Would be correct if it is "We shrank", but from my poor memory of the terminology that is the transitive form, shrunken is the intransitive form. But once again from my poor memory.


I've spoken English as my native language for almost five decades and I've never seen/heard the word "shranked" before.

This surely cannot be correct. Even the title of the linked article doesn't use "shranked". What?


Commonly (since ca. 19th century), shrank is used as the past tense of shrink, shrunk as the past particle, and shrunken as an adjective. The title of the linked article uses "shrunk" as past tense and the submitted title was changed to "shrunked" for some reason. "Shranked" was not mentioned anywhere. (But "shrinked" has had some use in the past.)


I was in the pool!


I think I prefer shrunked in this context.


Shrinky dinky


Honey, I shrunk the git!


[flagged]


As a German, I assumed he's talking about poor connection speeds.


You Germans have slow internet speed? Why’s that?


https://youtu.be/W1ZZ-Yni8Fg?si=493ozTdkEsXJnPpB does a good job of explaining it.


Size doesn't matter, it's how you use it (no invalid diffs on paths sharing trailing part).


They're not actually smaller. It just looks like it because they're further away.


the gif memes were very distracting...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: