Slightly off topic, but it's so odd/annoying to me that they used a GIANT TikTok logo image as the header photo for this article. Then the only ref to TikTok after is:
> A large number of corrupted packages use names related to game cheats, free resources, and social media platforms, such as "free-tiktok-followers" and "free-xbox-codes," to entice users to click the links and direct them to multiple well-designed phishing webpages.
So just because the spammers are using the TikTok brand/name in their spam (also mentioned is "xbox" but they didn't use the Microsoft logo), they used THAT as the image? Not the NPM logo or something much more relevant? I know it's minor, but that rubs me the wrong way as very "sus" and clickbait for when the article is used on social sharing sites and they most likely pick that image as the article open graph image. You also know it's most likely for clickbait because the avg person is much more likely to click some link with the TikTok logo and "phishing" in the title, vs some NPM logo that they may not even know.
The lead is buried at the bottom of the article (imo).
> "While being flooded with spam is never good, it gets immediately noticed and mitigated. It's harder for open source projects to spot and stop rare one-offs"
This is the real problem that NPM and other ecosystems face. A determined attacker that is trying to "poison" a popular Open Source package just has to feign as a maintainer long enough to succeed[0].
Defeating these types of attacks will require rethinking how we think about trust of packages. Projects like Deno are one approach (fork the ecosystem) while projects like Packj (mentioned elsewhere here), Socket.dev, and LunaTrace[1] are taking the other angle (make it harder to install malware).
It's hard to say which approach is better right away. (Probably a hybrid of both, realistically) It's just non-trivial to fix this in one clean swoop. It's messy.
It's hard for the enterprise world because they need to mirror everything internally. (That's a doable problem and I believe Google proxies all Golang installs already.)
There are still issues with things like history being re-written if you're pinned to a tag instead of a commit... since potentially the hash might change and you can't necessarily understand what changed unless you have a copy of the original code still.
That said, I feel like the bigger challenge here is with transitive dependencies (and a large total number of dependencies) because they create more places for malware to "hide", so to speak.
If every time I bump a package, it bumps 50 others... am I really going to notice/review the changes in all of those? Odds are I'm just trying to ship some feature or do something else! The libraries are rarely the center of my focus. (Maybe you get a PR with a security fix that you land it without thinking. Surprise, malware!)
It's an insidious problem without a very clear cut solution. Do we stop using Open Source code? Probably not. So how do we adapt? TBD!
I don't understand how a registry eliminates any of your critiques. The packages on registries come from git anyway, by way an unknown packaging process, rather than getting the source directly.
> If every time I bump a package, it bumps 50 others... am I really going to notice/review the changes in all of those?
Is there a difference between registry and codehost here?
> That said, I feel like the bigger challenge here is with transitive dependencies
This seems to be more a result of the stdlib and ecosystem, Node & NPM being the poster child for this
> There are still issues with things like history being re-written if you're pinned to a tag instead of a commit...
The Go team runs a sumdb server to catch this, and NPM does something to prevent tag drift, though we don't do this with our container images. It is trivial to move an image tag on docker hub or any image registry.
You should have a go.sums committed next to the go.mods file which also checks that the code at the tag is the same as you expect, so the sumdb is not necessarily needed if you trust the hash you originally downloaded and committed to git.
> It's hard for the enterprise world because they need to mirror everything internally.
We explored this, NPM and PyPi present the same issues as Go, the shear volume and maintenance makes it practically not worth it.
---
What about having additional accounts with a registry/hub in the software supply chain, that have been demonstrated to be vulnerable to account take over? Adding this extra middle layer in the supply chain seems like an inefficiency and increases risk.
No. The tradeoffs don't make any sense. With a registry you have some sort of guarantees, mirrors, etc. The same isn't true with individually owned repositories.
I find it a bit strange that NPM and PyPI decided to have a single global namespace for packages. When there aren't very many packages it is more convenient, but you very quickly run into issues of ownership of the names.
This happened before during the glory days of SourceForge where project names were global, and there was all the resulting control and misdirection issues. Thankfully Github made it a two level namespace OWNER/PROJECT which vastly reduced the problems. IMHO the repositories will need to do something similar.
I can understand not liking it, but I don't understand why you find it strange: global namespaces are the norm in language package distribution. RubyGems, NPM, PyPI, and Crates all use flat namespaces (either by default or as their only supported mode).
Namespace partitioning has advantages, but it also doesn't solve squatting problems on its own: anybody (or any process) that's fooled by `requests` vs `pyrequests` is also going to be fooled by `some-owner/requests` vs. `requests-org/requests` (the latter being the fake one).
I agree that namespacing doesn't solve all the problems, but I do think it significantly reduces the attack surface. It is sort of like a checksum for a package, combining the publisher name with the package name instead of having just bare package names.
Agreed! The problem is that the publisher's name isn't always clear: a lot of open source maintainers (myself included) publish major packages under their own name; a less familiar user could easily be confused into choosing the "official looking" name instead.
I think the most "complete" solution here needs to involve code signing.
I actually agree, but that's not namespacing in the way that the GP meant: Perl has curated topical namespaces that multiple unrelated users share, rather than identity or organization-owned namespaces.
It has namespaces, but for the most part, packages are in the global namespace, most people publish packages there and most programmers use packages there. Namespaces aren't unused or useless in the NPM, but the fact that most packages people use aren't in namespaces means they don't do much for security.
Plug: We've been building Packj [1] to detect such malicious, abandoned, typo-squatting, and other "weak links" in your software supply chain. Supports auditing as well as install-time sandboxing of PyPI/NPM/Ruby packages. It carries out static/dynamic/metadata analysis to look for "suspicious” attributes such as spawning of shell, use of SSH keys, network communication, use of decode+eval, mismatch of GitHub code vs packaged code (provenance), change in APIs across versions, and several more.
How would you deal with malicious packages that randomize or time delay their malicious behavior?
Say you have a package that only triggers the evil activity one out of 10,000 evaluations? It'd have a 99.99% of passing validation but could still wreak havoc on a heavily used client side application.
Ditto for a package that simply delays it's malicious payload until some point in the future, after the scan has occurred? I guess this is a bit easier to test for by future dating the system clock, but again what if it's a random window? (e.g. every other day for five minutes, at 4:59pm UTC).
You're right; a sophisticated bad actor can employ time- or event-based hiding techniques. However, these limitations apply to dynamic/runtime analysis only. Packj performs static+dynamic+metadata analysis. As static analysis scans every line, it will flag use of risky APIs (e.g., file delete, net send, process fork) regardless of when they are being invoked at runtime.
Having said that, obfuscated code (e.g., base64 encoded) or runtime payloads (e.g., downloaded from pastebin) will not be analyzed with static analysis, but it will tell you if the recent package version uses base64 or exec APIs in the code (file/line) for you to take a deeper look. We've some ideas on runtime lightweight sandboxing, and will eventually implement those.
This is a valid concern. That is why we're building this in the open. You can take a look at the code. We welcome your feedback/suggestions and code contributions to add new security checks/policies.
>That is why we're building this in the open. You can take a look at the code.
...You mean, in the same manner everyone should have been auditing code all along, but hasn't, thereby creating the problem you're trying to solve?
It's alright. You're trying, and there's nothing wrong with that answer, but the people you're trying to sell a shovel to don't want to dig. That's the chief problem.
> people you're trying to sell a shovel to don't want to dig.
That's exactly where a tool like Packj can help. It can cut the digging time down significantly. Auditing hundreds of direct/transitive dependencies manually is impractical, but Packj can quickly point out a "risky" behavior (file and line number). Using obfuscation to hide malicious code itself is a suspicious behavior, which is flagged. After all, we have to start somewhere.
You’re not wrong, but at some point, the dependency problem will slip onto someone’s desk in then legal department and rules will be pushed back across the hall into the engineering department.
It’s not a bad idea to stake out a product that’s mature and ready to go when that happens. It doesn’t even have to be foolproof. It just needs to be earnest enough to check off some boxes and show that “best practice” effort was made.
Yes, strace works on Linux. Mac has dtrace that we will integrate. Packj sandbox should, however, also work under WSL on Windows. May be someone can test and share contributions.
But won't this require your target users turning off SIP [1] on Macs to enable dtrace? A lot of dtrace tools require SIP to be disabled before they can produce any meaningful output.
More supply-chain attacks, though these (clickbait package names, monetized with referral links) are certainly less sophisticated and less dangerous than the wave that hit PyPI (typos of common package names, monetized by altering crypto addresses in your clipboard / paste buffer). As they say, the best time to start mitigating these attacks was 10 years ago; the next best time is today.
I don't quite agree with your ancient Chinese proverb here. While it is ideal, we don't live in an ideal world.
The world would have to be aware of these attack vectors and attack tactics to be able to mitigate them. That's not to say they haven't existed in concept for the last 10 years (which many have), just that very few utilized them maliciously until the last few years across the industry. (Also see the exponential increase of supply chain attacks in the last couple years alone and the vector/tactics used)
Security tends to be a cat and mouse game in my experience. If you wanted to even go down the Sun Tzu perspective, the best defense is to subdue the enemy without fighting. That allows one to understand the strengths and weaknesses of one and their enemy and to develop a defense that is flexible and adaptable to the moment and harden it for the future.
To be specific, I was thinking that around the same time we started earnestly working on browser sandboxing and fine-grained extension/app privileges, we should have been doing the same thing for our packages. When Joe Normal tries to install an extension, Chrome tells him it wants to access his microphone; we could have had a similar thing here. “tennsserflow.py requests permission to modify the root filesystem”, “PERM: free-fortnite-skin asks to open URLs. grant / deny / uninstall ?”, etc.
This needs support at the language/runtime level to be even remotely robust. Especially when we’re talking about highly dynamic and popular languages like JS/Python etc.
I'm now double and triple checking when I add a package to the manifest. You have to check not only that it's the right package, you also check that it has the number of downloads that you're expecting.
The problem with using downloads as a proxy for "safety" is that a bad actor can use CI infra/bots/scripts to inflate the number. This is why in Packj [1] we check for multiple attributes (e.g., dependents).
I just tried this and the output is very very very noisy. It treats any dependent package that hasn't been updated in a while as a "RISK" and it manually tries to resolve each github repository (and runs into API request limits).
We've made it highly customizable [1]. For some folks, a package that has not been updated for a while could be classified as abandonware. You can turn off the alerts that do not apply to you by commenting out in .packj.yaml file.
We're working on making it similar to email spam. Will post more soon. Meanwhile, we welcome changes that can make it less noisy for the community.
[Edit:] one can specify Github API token to get around rate limiting. We'd deeply appreciate if could create an issue for this on the repo.
most engineers blindly trust things when another engineer or vendor has signed off/certified on it. There's simply not enough time in the world to go through the certification date for every single part you use in a project.
We trust that there has been a testing process that validates that things do what they say they will do. Admittedly when dealing with non-OSS vendors you can sue them if something goes badly wrong.
It’s perfectly normal to trust your peers in a collaborative setting regardless of the profession.
The issue is rather that this is only a largely collaborative one. Which makes the thing kind of worse, because you can _often_ get away with simply trusting everyone and there are economic and practical pressures to do so.
I really don’t like one bit the design of npm. Projects should specify their dependencies not packages. For example if one of my dependencies uses leftPad in my project I should have to specify leftPad as a dependency. This would encourage much better behaviour with dependencies because the costs would be crystal clear if it has 50000 dependencies you’d probably avoid installing it. It would also discourage lots of very small packages or you might even choose to add a version of leftPad to your codebase rather than let someone else be in charge of what is in it. Maybe the ship has sailed on JS/TS now but it would also discourage needless package churn.
Under the npm model you are trusting the authors of your direct dependencies with judicious choice of transitive dependencies, and then those authors in turn trust the authors of their dependencies. In essence npm is a system that relies on a web of trust.
A system like that cannot give any modicum of control, because typically a real world project will involve hundreds of package publishers, any one of which can break that trust with a poisoned patch version of a transitive dependency (or can be hacked), and it is effectively impossible to review all those dependencies every time they do a patch version bump. So developing with npm is always an act of faith that everyone will do the right thing. The remarkable part is how rarely it breaks down.
It’s really only a web of trust in a very naive sense.
Most JS OSS is MIT licensed. There are no guarantees. It’s on _you_ to ensure the software works.
Secondly people don’t really “trust” packages. They trust the usage, weekly downloads and quality of docs. They just assume this implies that it’s going to be OK, or that when deps break it’s nobody’s fault. Shit happens.
What problem does this try to solve? That a lot of packages are being downloaded?
You could already do what you're describing today, but odds are, you don't, because you value your time more than you value "having small amount of packages" for whatever reason.
I’m just saying building a system that encourages/hides dependencies leads to bad software engineering in my opinion. Elixir does it the way I’ve outlined and I think that’s one of the many decisions they got right.
Small amount of packages might very well lead to curated and well maintained libraries that do many things (as opposed to having 1 library per string function).
Did you read the article? This attack was not about naming malicious packages similar to legit ones or common typos of them.
> A large number of corrupted packages use names related to game cheats, free resources, and social media platforms, such as "free-tiktok-followers" and "free-xbox-codes," to entice users to click the links and direct them to multiple well-designed phishing webpages.
Those honestly sound like they're just using NPM as a platform to host their spam. All the spammers care about is that they get to host their content on a trusted domain with a keyword-dense URL; there's no expectation that anyone actually install the package.
I feel that that is one of those ideas that sound great in theory, but is impractical in reality. There are way too many packages with similar names already but are completely valid.
What makes me wonder is why it’s always NPM that’s hit with these kind of things, and not, say, Maven? Arguably the attack surface of Maven is much larger and intrusive (lots of enterprises), but the mechanisms of package distribution are completely different. There are a shitload of scans and sanity checks happening when you push a package to a Maven repo such as Sonatype, which I don’t see as much with NPM (and also Pypi for that matter)
(fwiw, I am responsible for package distribution of all APIs of a reasonably popular timeseries database)
Most of these package systems being attacked run arbitrary code on your system when you install the package in order to allow native extensions to compile. Maven/Java simply downloads a (relatively) inert zip archive that your IDE might do some static analysis on to provide autocomplete.
Along with all the scanning and what not, I think that’s the biggest reason you see attacks primarily on npm, PyPi, and to an extent Ruby Gems.
If you don't know the author, signatures do nothing. Anybody can sign their package with some key. Even if you could check the author's identity, that still does very little for you, unless you know them personally.
It makes a lot more sense to use cryptography to verify that releases are not malicious directly. Tools like crev [1], vouch [2], and cargo-vet [3] allow you to trust your colleagues or specific people to review packages before you install them. That way you don't have to trust their authors or package repositories at all.
That seems like a much more viable path forward than expecting package repositories to audit packages or trying to assign trust onto random developers.
The .NET ecosystem has this [1], so far adoption has been low. Similar to HTTPS until LetsEncrypt, we need something that’s more convenient if we want developer adoption.
You have now just transferred risk. You have done nothing to mitigate it, and created more work for those of us that actually do read code.
Then again, if someone wants to pay me to do nothing but read and vouchsafe code, hell, why not. As QA I basically do that anyway. Will have to think on this.
I don't see it as "transferred risk". As individuals we already have people we trust and people we don't know. This system allows you to get trust in your dependencies without requiring to trust the specific rando that authored the specific library you want to use.
Yes ultimately someone you trust has to vouch for the library for you to trust it. That's unavoidable either way. Those systems only allow for that person to be anyone instead of just the author.
The signatures would have to be rooted to a trust cert model (with a way for devs to verify that the "Microsoft" they're pulling a package from is the real "Microsoft").
Hypothetically, this could be done: put up a public key at a known domain, sign packages via public key, and if the package manager extends to include a "fetch from URL provided by the package and check signature" tool, we end up rooting trust in packages in trust in domain name ownership; not too shabby. We still need devs to actually bother to do the check and, like, read the domain name to make sure they haven't accepted authorization from "micr0soft.com", but the end result is a distributed signing solution that doesn't require any centralized authority (except, indirectly, the SSL cert issuers and DNS providers of the world).
> A large number of corrupted packages use names related to game cheats, free resources, and social media platforms, such as "free-tiktok-followers" and "free-xbox-codes," to entice users to click the links and direct them to multiple well-designed phishing webpages.
So just because the spammers are using the TikTok brand/name in their spam (also mentioned is "xbox" but they didn't use the Microsoft logo), they used THAT as the image? Not the NPM logo or something much more relevant? I know it's minor, but that rubs me the wrong way as very "sus" and clickbait for when the article is used on social sharing sites and they most likely pick that image as the article open graph image. You also know it's most likely for clickbait because the avg person is much more likely to click some link with the TikTok logo and "phishing" in the title, vs some NPM logo that they may not even know.