We've been really pleased that Microsoft chose to put their @types packages into the npm registry rather than a separate, closed system, and in general happy with Microsoft's support of node and npm. We're confident we can make the new features of VSCode work, we just need to work with Microsoft to tweak the implementation a little.
This was an honest mistake on their part, and we caught it in time that there was very little impact visible to any npm users.
Fun fact: at its peak, VSCode users around the world were sending roughly as many requests to the registry as the entire nation of India.
From my outside perspective, it doesn't seem like a mistake on their part at all. Later in the thread you say this accounted for 10% of traffic, mostly 404s. This is (i assume) a hell of a lot of requests, but given npm's position as developer infrastructure, I don't think they could have reasonably expected to melt it. It would have been good of them to give a heads up, but I don't think I'd start assigning blame to the Code team.
I feel like the npm team have once again failed to own their problems and instead tried to push the blame elsewhere. This is just an outside perspective, but I really feel like it would have been more honest and accurate to at least admit to the possibility that npm isn't perfect, and "blame" (which I'm not sure is even a helpful concept in this instance) is shared between parties more equitably.
Once we determined 404s were the problem we put mitigation in place that worked fine, but the problem of request volume remained: the 10% figure I gave was at a 5% rollout of VSCode. A full rollout would therefore have meant the registry became 3x bigger overnight and two thirds of that would have been 404s to VSCode users. At that point the issue is financial, not technical, which is another reason the rollback happened.
Many times I've seen someone on HN write a negative/flaming reply to a comment, which then nets a bunch of further agreement and consensus, and the original commentator is nowhere to be seen.
You quickly responded and fully acknowledged the faux pas (nuking any negative consensus), then you replied twice more, and one of those replies was to a request for technical info.
Microsoft maintain the @types scope. Instead of providing their own metadata endpoint listing available typings to filter requests on, they lazily opted to just mass bombard a repository they maintain, hosted on a free service they don't fund, for any and all possible package names, even though they themselves maintain the list of packages and should know in advance which don't exist.
Edit: Answered at https://news.ycombinator.com/item?id=12861118
Microsoft publishes a list of known good declaration files for popular npm packages to npm, under the scope @types: https://www.npmjs.com/~types
The 1.7 release of VSCode helpfully tries to automatically load type declarations for any npm package you use by requesting the equivalent declaration package under @types. When the package exists this is fine, because it's cached in our CDN.
What they forgot to consider is that most CDNs don't cache 404 responses, and since there are 350,000 packages and less than 5000 type declarations, the overwhelming majority of requests from VSCode to the registry were 404s. This hammered the hell out of our servers until we put caching in place for 404s under the @types scope.
We didn't start caching 404s for every package, and don't plan to, because that creates annoying race conditions for fresh publishes, which is why most CDNs don't cache 404s in the first place.
There are any number of ways to fix this, and we'll work with Microsoft to find the best one, but fundamentally you just need a more network-efficient way of finding out which type declarations exist. At the moment there are few enough that they could fetch a list of all of them and cache it (the public registry lacks a documented API for doing that right now, but we can certainly provide one).
Might I suggest having a bloom filter containing all the existing type declaration (which would be quite small) and only querying the registry if the bloom filter reports the type declaration as a positive.
Since the filter can be really small it will probably scale a lot better than a complete list of all type-declarations, and a new filter could be downloaded by the clients every now and then.
Sounds like a good CDN-busting DDoS vector.
In general, it is an extremely efficient response. It took a huge number of users all hammering on the same set of 404 handling routes to get our attention, and we were able to handle the load, though it wasn't trivial to do so. The end user impact was minimal.
If it hadn't been a known-good actor, we had some options to shut down the flood a bit more forcefully, but we didn't want to inadvertently cause errors for vscode users. Like my colleagues have said in this thread already, we really dig what VSCode is doing, and as operational fires go, this one got put out very swiftly and did very little harm.
All that being said, knowing the npm devops team, this will no doubt be a source of insights for making the registry even more resilient in the future :)
at which point they would be back to the annoying race conditions for fresh publishes, no?
Can you speak to why it is so expensive on the NPM side to serve a 404? Would a bloom filter like another commenter mentioned be helpful?
It's refreshing to read actual engineers' writing. After this, going back to tear-jerking snark-filled twitter and medium gnashing of teeth will be hard.
You optimize for the use patterns you anticipate or see in normal usage, because, well, see famous saying about premature optimization. The use pattern we see most often is people installing from pre-determined lists in package.json, so 404s aren't all that common ordinarily.
Requests over time where user in India ~= requests over time where user is a vs code user
Overall when you "tested" something, but it still breaks in production and requires a rollback it's usually a sign that your testing strategy isn't could use some improvement - what is the point of testing if it doesn't prevent failures from happening
... also ... not going to lie, this was the first time we've gotten to test several of the checks and balances we have in the npm registry which I was jazzed about :)
On that note, however, respectfully I believe that features which have the potential of hitting the registry so bad should first be beta tested on a private registry and moved on to the high traffic serving CDNs of npm.
And 10% of the daily traffic is from India??? Whoa, every day is a school day.
Well, 17% of the world's population lives in India, so doesn't seems surprising.
You're right that our handling of 404s was naive, and that's definitely something we'll be improving as a result of what we've learned from this incident.
This will likely lead to more fault tolerant systems on both projects and hopefully more collaboration & features in the future.
VSCode is used by a non-negligible number of users, and seems to rely on npm to operate at its best. It would have been good etiquette to let npm know, even though they couldn't forecast this exact situation.
However, it isn't bad etiquette and I'm sure Microsoft could get in touch with the devs. Interesting thought.
I'm not sure I would call my feature "great" if it could have brought down npm.
The feature was so great that npmjs couldn't keep up with it, it was yuuuuuge!