My hope is to be automating a large amount of the review in the next few months, however I think this is a good argument for never having it be fully automatic. Having a human sanity check submissions isn't a terrible idea if we can keep the workload down.
Certainly this doesn't prevent a malicious author from posting a legitimate package and then changing the contents to be malicious, but that can be somewhat solved by turning off automatic updates.
Thanks for keeping Package Control high quality, I know it's highly appreciated :-)
One step to mitigate things like this as well would be to have some sort of "crowd-sourcing" command in the package manager program... like "npm flag coffe-script" or something like that to alert repository maintainers of possible issues.
Perhaps you could make this safer by adding an automatic check for how much the package has changed since the last version? And at least warn the user when they want to update?
http://incolumitas.com/data/thesis.pdf section 5 "Practical implications". Just wanted to point out that in case you skipped it it's worth a read, some interesting proposals there that are worth discussing with package manager maintainers.
I particularly like the preemptive approach of auto-blacklisting common typos by simply monitoring the number of times a specific unexisting package is requested over time (5.10). So if a lot of people regularly attempt to install the unexisting package "reqeusts", it could signal that it's a common typo and should be blacklisted to prevent malicious use in the future. False positives could always be sorted out manually by communicating with the package manager maintainers.
- The package name is something lot of people regularly attempt to install, but it doesn't exist (per above)
- The package name is 1-2 chars off from the name of another package which has more than X downloads
- The package is frequently installed then uninstalled in a short time
I think that this clearly falls under the heading 'naming issue.' People know what they want, but do not enter it properly.
I can't think of a 100% off-hand, which isn't surprising, because it's a hard problem.
pmontra's suggestion to use typo blacklisting ain't a bad idea. Maybe some sort of reputation-per-name could help?
I wonder if you could do something similar here - enter the name of the package and a code of some sort. I haven't thought this through in a lot of detail.
That doesn't work with arbitrary names because they are, well, arbitrary.
This could get mildly annoying every once in a while when there are legitimate non-clashing names. A better metric/typo recognition technique is probably possible. Or else some manual process for requesting exceptions (maybe with a tiny fee to help fund the overall project) would also address this problem.
EDIT: Just downloaded and read the thesis abstract. The author actually suggests the first idea: "The analytical part generates ideas
for countermeasures that allow repository maintainers or users to detect typosquatting attacks
in the future. For this purpose potential typosquatting candidates could be generated for each
legitimate package name with the help of the Levenshtein distance algorithms or Bayesian
networks. Another option that can be considered is the Metaphone algorithm."
Who would use that?
Package managers have humans to deal with edge cases (removing malicious packages, investigating package errors, etc.) and this is no different. It wouldn't significantly increase their burden because only a small fraction of package names should require human validation.
- Maintainership can change over time.
- Multiple people may trade off releasing a package, but it's still the same package.
- There may be multiple repos (consider you may want to run a local company repo for non-redistributable modules).
I imagine in the end, one of the better approaches to the installation name typo problem might be to scan the code for what packages are required (utilizing as much specific information as possible), and confirming that exists as a local package that can be installed or offering to install it. Package installers should be able to take a source file or files, and install modules listed within. This won't solve all cases (dynamically determined and loaded modules may be a problem still), but it will solve quite a bit of them.
Github allows transferring of repos to another "namespace" (username), and will even forward requests from the old one to the new one for a while (how long i'm not sure...)
Thinking about it a bit more that kind of "mutability" might not be the best idea in a package manager...
Still, i think the namespaces can help more than they hurt if the platform is designed with them in mind, as even "namespace-less" systems still suffer from some of those issues like wanting to rename a package or split it up into multiple smaller packages.
For a while I bumped into projects that tried to follow the old Linux model of even/odd version numbers for telegraphing API stability. Long term support and backported security enhancements are another case where maybe the guys working on new functionality are exactly the wrong people to take responsibility.
There could also be some other cool tricks you could apply (This is the first time you are installing a package from "Maintaner", would you like to continue?)
The maintainer-level confirmation could be of slight assistance to advanced users, but it's no panacea.
For example, on the Python Package Index five people have authorization to publish a new Django release. Creating a "Django" org namespace wouldn't help, since someone could typo the org name and hit a squatted malicious version (and that's almost certainly what it would end up being; our github org is named "django").
It's pretty mind blowing how big of a blindspot package installers are. I guess running everything inside a e.g. Docker container/VM would be a partial interim solution for the paranoid?
It's a bit better - there is only one possible source of compromise rather than everyone on the network path. Given that npm/pip likely keep archives of all packages uploaded, it would be much harder (perhaps impossible) to attack someone secretly this way, at least in the long term.
Good package managers require signing of uploads (e.g. maven central requires every package to have a GPG signature; Debian goes further, and requires your key to be signed by an existing member of the organization). If the client checks the signatures you end up with a system that's perhaps actually secure.
A signed package doesn't really tell you that much. In the best case scenario it tells you the package you're installing in fact came from developer X and contains code Y (which you kinda already know since you have the source code). This works as long as you know and trust developer X, or did your due diligence reading through the code (which you can already do today).
I can't think of an end solution that wouldn't have to rely on network effects and social proof, which strikes me as rather fragile. Maybe formal verification and AI can help, but that's a long way off (?)
I'm curious to hear your opinion about a combination of digital signing with e.g. keybase/blockchain + reputation system, a sandboxed development environment (mitigates the "short con" risk) and a sandboxed production environment, with the minimum set of permissions required to operate (as well as auditing of course).
Call me pessimistic but I don't see developers taking on the extra friction given the status quo. Though a major data breach or two might change things, as I'm sure we'll find out sooner or later.
It does raise the barrier to entry, but it would prevent typosquatting and regular namesquatting.
EDIT: Does any major package manager provide a "did you mean" functionality, offering a list of actual package names similar to what you typed?
and then also have perfect memory of all packages and notice that similarly named package is too (for some value of "too") similarly named to some already existing one... even if e.g. both are a correct dictionary word.
Which Debian has, because submitting a new package is a much more involved processes than sudo apt-get publish.
I used the Ruby code at the beginning of http://stackoverflow.com/questions/16323571/measure-the-dist... to calculate the distance between the package names at page 60 of the thesis and their typos.
The maximum is 2.
I checked some similar package names from a Gemfile.lock of a project of mine. Unfortunately the two gems hike and hirb are also at distance 2. Probably many short names are close with this metric.
A combination of the two approaches could be ok: knowing that a name was blacklisted should be an indicator that's not a good name, despite the distance with any other name, plus an approval of the maintainers for distance 2.
But a blacklist could generate another type of squatting, with people trying to pre-blacklist perfectly legit names. Only one thing is sure: there is more work to do for the maintainers and this extra friction is not good.
Edit: the distance suffers from the same problem.
I see what you did.
Also, doesn't point out that the bigger threat is that this is wormable.
The acknowledgements mention 2 of the university advisers and a PyPi admin consented to the "notification program".
Still, people with good intentions have been prosecuted and convicted for less. I would be very concerned for this student.
That is a crime under the CFAA in the USA. Not sure what it is in Germany/EU.
The prosecuting attorney is going to tell a jury of twelve of your non-technical "peers" that it is hacking.
Your client can either go to trial for seventeen thousand two hundred and eighty nine counts of felony hacking, and risk half a million years in prison, or they can plea bargain to 5 years in prison and a felony on his record.
Or your client can hang himself, but I'm pretty sure a federal prosecutor counts that as a win too.
Anyway, this is all part of why I always try to build inside a container, or at least in a virtualenv where I don't need to sudo the install.
>17000 computers were forced to execute [unauthorized] arbitrary code
Certainly a crime in the US, not sure about Germany.
Nice execution though!
What packages do this?
I was thinking that a simple way this would be illegal in the US would be
"[accessing] a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer"
See a2C here: https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act#C...
I'd assume you can make a decent case that the person only authorized the installation of a piece of software, not the gathering of identifying information.
IP addresses can be used as identifying information especially when paired with a timestamp.
Being an American citizen living in the US I would not want my name on this paper.
If I intentionally leave an infected USB drive on the ground, someone picks it up and sticks it into it's computer, am I liable?
Seems like it could go either way.
The two solutions here are user-local packages (pip --user, for example) and virtual environments.
I wonder what kind of steps we can take to prevent this risk.
That way authors can continue to use any name they want, and the emphasis is on letting installers know that they might be installing the wrong package.
That'll be fun to automate around in puppet or ansible.
Now that there's a strategy for finding fakers:
1) You have an attacker-defender arms race. The attacker will always be one step ahead of the defender.
2) You have the extra burden of keeping up in this race, otherwise your security feature is a facade. At best, this is useless. At worst, it lulls your users into a false sense of security.
As attacker, my next strategy is create a bunch of agents (<10K should be enough) to download my typo packages.
Your move, defender ;)
But seriously, my point has less to do with the particular tactics of the adversaries and more to do with how the proposed strategy of automatically detecting potential typos invites gaming.
Ones dev environment should be a place where remote code execution is a high probablity and we need better tools to partition that from high value data.
I think most languages these days are a bit smarter and avoid this beginner mistake (for various reasons).
The only way this is 'solved' is if some third party authority hands out top level names and refuses to register names that are similar to other names for some definition of similar. The number of levels between top level and package name is irrelevant.
There's another solution (like debian does), auditing what the package itself does, so that you don't allow malicious code into the repository.
While attacking a single package would be possible, covering any interesting amount of "typo"-space would require registering huge amounts of namespaces.
If package manager developers are smart, the allocation of namespaces is also handled externally and associated with some cost (e. g. domain names).
Therefore these kinds of attacks become impractical.
Package managers like these approach social networks, which has many advantages but carries the disadvantage of opening users to attacks that resemble social network phishing attacks. We could mitigate this by rolling back to package managers with higher barriers to entry, but I think that is not likely to happen.
You clearly would prefer to use a more adjudicated, managed package manager, with a higher barrier to publish and stronger rules about naming. That's a reasonable thing to want, but it would be better of you if you didn't act like people who want something which conflicts with that goal are stupid.
That's something that can be flagged for manual review before it gets too far.
but if you are targeting a package `someuser/popularpackage` can you not just register your own malicious `popularpackage` under a typo namespace like `smoeuser`?
They can see someone registering popular package names under something with a similar namespace and can flag them for manual review (which can be done for namespace-less packages, but there will be much more noise), they can apply things like "This is the first time you are installing a package from 'smoeuser' would you like to continue?", or even require adding a specific namespace "out of band" depending on how paranoid it wants to be.
> "This is the first time you are installing a package
> from 'smoeuser' would you like to continue?"
And unless the account name of the package maintainers is brought front-and-center, you aren't necessarily going to know it shouldn't be different until it's too late.
I. I've installed anything from the author "foo" before
on this machine, implying that I trust "foo".
A. On a system with namespaced packages, I attempt to
install "fpp/bar". I've never installed anything
from the author "fpp" before, so I get a prompt.
B. On a system without namespaced packages, I attempt
to install "bsr".
1. If "bsr" is by an author I trust, then it will be
installed. This will be confusing, but is not a
security vulnerability. because this author is
already running code on my machines.
2. If "bsr" is by an author I don't trust, then I get
a prompt, as in scenario I.A.
II. I've never installed anything from the author "foo"
before on this machine.
A. On a system with namespaced packages, I attempt to
install "fpp/bar". The system prompts me, as in
scenario I.A., but because I expect this prompt I
don't bother reading it and blindly accept it. The
prompt does reiterate the name of the author, but if
I didn't catch the typo the first time, there's
little chance I'll catch the typo this time.
Remember: the value of the prompt is not the
reiteration of the name, it's in its unexpected
nature, because research has repeatedly shown
that users, even power users, do not bother
reading routine prompts (this is why, e.g., Chrome
no longer allows users to bypass the enormously
scary warning page that appears when a secure site
has a certificate error). My system gets owned.
B. On a system without namespaced packages, I attempt
to install "bsr". The system prompts me, as in
scenario I.A., but because I expect this prompt I
don't bother reading it and blindly accept it. My
system gets owned.
* Perhaps many of those are automated build systems, which would also explain the high number of systems with admin access (for example, if you use travis without docker, every build runs in a clean vm with admin access).
* People download one package and install it multiple times? Seems unlikely
Any other ideas?
sudo pip install lumpy (instead of numpy)
Ran it again because it 'didn't work'
but even this just tries to put the problem under carpet. you could still for example have requests package which just installs request package, works as expected, just sends request/response to your own server from time to time. ie. when there's http basic auth used only.
You can also make this the default, with npm config set ignore-scripts true (and then --ignore-scripts false at install time if you wish to run them).
Consider the following:
requests - a python package for making HTTP requests.
requestr - a python package for a fictional startup that allows you to send requests to your nearest and dearest.
Given they both could be typos of each other:
1) How do we determine which one to use? What if someone accidentally also tries "requestd", somewhere between the two ?
2) How do we apply the principle of least surprise - I asked to install requests, and everything installed just fine, but now I can't import it?!
$ pip install requestr
Package "requestr": did you mean "requests"? [Y/n]
(reason for this warning: similar spelling and requests is much more popular)
Pass --no-spell-warnings to disable this feature.
https://rubygems.org/gems/bundle Total downloads 1,800,600
Source (empty) at https://github.com/will/bundle and interesting README.
https://rubygems.org/gems/bundler Total downloads 92,116,090
It's almost the 2%.
In Python, "pytables" (should be "tables") and "skimage" (should be "scikit-image") come to mind.
Error: Cannot find module 'x'
npm install x