
Typosquatting programming language package managers - xrstf
http://incolumitas.com/2016/06/08/typosquatting-package-managers/
======
wbond
We've gotten flack from package developers submitting new packages to Package
Control [0] because all additions to the default channel are hand reviewed.
Part of this process is to prevent accidentally close package names, to try
and encourage collaboration and to encourage developers to actually explain
what their package does and how to use it.

My hope is to be automating a large amount of the review in the next few
months, however I think this is a good argument for never having it be fully
automatic. Having a human sanity check submissions isn't a terrible idea if we
can keep the workload down.

Certainly this doesn't prevent a malicious author from posting a legitimate
package and then changing the contents to be malicious, but that can be
somewhat solved by turning off automatic updates.

[0] [https://packagecontrol.io](https://packagecontrol.io)

~~~
SCdF
Hey Will,

Thanks for keeping Package Control high quality, I know it's highly
appreciated :-)

~~~
notduncansmith
Another grateful Package Control user here.

------
Mahn
> In the thesis itself, several powerful methods to defend against typo
> squatting attacks are discussed. Therefore they are not included in this
> blog post.

[http://incolumitas.com/data/thesis.pdf](http://incolumitas.com/data/thesis.pdf)
section 5 "Practical implications". Just wanted to point out that in case you
skipped it it's worth a read, some interesting proposals there that are worth
discussing with package manager maintainers.

I particularly like the preemptive approach of auto-blacklisting common typos
by simply monitoring the number of times a specific unexisting package is
requested over time (5.10). So if a lot of people regularly attempt to install
the unexisting package "reqeusts", it could signal that it's a common typo and
should be blacklisted to prevent malicious use in the future. False positives
could always be sorted out manually by communicating with the package manager
maintainers.

~~~
nailer
You'd Bayesian that.

\- The package name is something lot of people regularly attempt to install,
but it doesn't exist (per above) \- The package name is 1-2 chars off from the
name of another package which has more than X downloads \- The package is
frequently installed then uninstalled in a short time

------
zeveb
Reminds me of the quote, 'there are only two hard things in computer science:
naming things, cache invalidation and off-by-one errors.'

I think that this clearly falls under the heading 'naming issue.' People know
_what_ they want, but do not enter it properly.

I can't think of a 100% off-hand, which isn't surprising, because it's a hard
problem.

pmontra's suggestion to use typo blacklisting ain't a bad idea. Maybe some
sort of reputation-per-name could help?

~~~
blowski
Banks have a similar problem when people write cheques or set up standing
orders. You have to put a name and the account number.

I wonder if you could do something similar here - enter the name of the
package and a code of some sort. I haven't thought this through in a lot of
detail.

~~~
Klathmon
Or just refer to packages by 2 names.

    
    
        Maintainer/PackageName
    

It solves so many problems, this included.

~~~
kbenson
This is all _half_ of a much larger problem, which is package identification.
Perl 6 specced out[1] quite a bit of a future system to handle a lot of this,
and I believe a lot of it is now implemented. A few things you need to
consider:

\- Maintainership can change over time.

\- Multiple people may trade off releasing a package, but it's still the same
package.

\- There may be multiple repos (consider you may want to run a local company
repo for non-redistributable modules).

I imagine in the end, one of the better approaches to the installation name
typo problem might be to scan the code for what packages are required
(utilizing as much specific information as possible), and confirming that
exists as a local package that can be installed or offering to install it.
Package installers should be able to take a source file or files, and install
modules listed within. This won't solve all cases (dynamically determined and
loaded modules may be a problem still), but it will solve quite a bit of them.

1:
[http://design.perl6.org/S11.html#Versioning](http://design.perl6.org/S11.html#Versioning)

~~~
Klathmon
Those are some good points, and I guess in my head I'm thinking of how Github
does repos on their site as my "example".

Github allows transferring of repos to another "namespace" (username), and
will even forward requests from the old one to the new one for a while (how
long i'm not sure...)

Thinking about it a bit more that kind of "mutability" might not be the best
idea in a package manager...

Still, i think the namespaces can help more than they hurt if the platform is
designed with them in mind, as even "namespace-less" systems still suffer from
some of those issues like wanting to rename a package or split it up into
multiple smaller packages.

~~~
kbenson
I'm not arguing for no namespaces, much the opposite. I'm arguing that the
whole way most languages implement modules is fairly haphazard, and that that
leads to this problem. If you review the link I included previously, you can
see some examples of how you could definitively specify a particular module
version. E.g.

    
    
        use OldDog:name<Dog>:auth<cpan:JRANDOM>:ver<1.2.1>;
    

This would use Dog from the CPAN repository, author JRANDOM, and version
1.2.1, and namespace it as OldDog. You could also just "use Dog;" to use the
canonical Dog package from the canonical sources (in order). If we could just
point our package manager at this source code and it could determine "Hmm, you
have a Dog module of that version, but not that author and repo, and you have
a Dog module from that repo and author but not that version. Looks like we
need to install it." that would leave us in a much better place, both for code
using definitive versions of packages, and admins/programmers installing
packages and making sure they get the right one, if it's been defined.

------
szx
When you think about it, how different is the destructive potential of an
npm/pip install from curl | bash that (some) people tend to froth at the mouth
about?

It's pretty mind blowing how big of a blindspot package installers are. I
guess running everything inside a e.g. Docker container/VM would be a partial
interim solution for the paranoid?

~~~
lmm
> When you think about it, how different is the destructive potential of an
> npm/pip install from curl | bash that (some) people tend to froth at the
> mouth about?

It's a bit better - there is only one possible source of compromise rather
than everyone on the network path. Given that npm/pip likely keep archives of
all packages uploaded, it would be much harder (perhaps impossible) to attack
someone secretly this way, at least in the long term.

Good package managers require signing of uploads (e.g. maven central requires
every package to have a GPG signature; Debian goes further, and requires your
key to be signed by an existing member of the organization). If the client
checks the signatures you end up with a system that's perhaps actually secure.

~~~
szx
Signing is definitely part of the answer but there's still the question of
trust.

A signed package doesn't really tell you that much. In the best case scenario
it tells you the package you're installing in fact came from developer X and
contains code Y (which you kinda already know since you have the source code).
This works as long as you know and trust developer X, or did your due
diligence reading through the code (which you can already do today).

I can't think of an end solution that wouldn't have to rely on network effects
and social proof, which strikes me as rather fragile. Maybe formal
verification and AI can help, but that's a long way off (?)

------
eudox
I'm a fan of the approach of personally submitting projects to the repository
maintainer (e.g. through GitHub issues), and having the maintainer personally
approve them.

It does raise the barrier to entry, but it would prevent typosquatting and
regular namesquatting.

EDIT: Does any major package manager provide a "did you mean" functionality,
offering a list of actual package names similar to what you typed?

~~~
philjackson
That's a massive burden on the poor person who has to ok the package -
especially at NPM's scale, for example.

~~~
seldo
We believe npm's scale is a direct result of having the lowest ceremony to
publish a package. Turning the dial in the direction we did has pros and cons.

------
baby
After watching this awesome Defcon talk
[https://www.youtube.com/watch?v=YqxaKGA9Lnc](https://www.youtube.com/watch?v=YqxaKGA9Lnc)
I wondered if there was any use cases for bit/typo squating in crypto. This is
a pretty cool one! Not crypto but interesting none-the-less :)

------
pmontra
Probably the maintainers of the package managers know which typos their users
do, because of the 404s in the logs or equivalent errors. A preventive action
could be starting to blacklist any name resolving to 404. If somebody
eventually tries to upload a package in the blacklist, a maintainer should
check the code and whitelist the name. Obviously people can be very crative
with typos and with squattinq and there is no real protection against
mistakes.

~~~
utexaspunk
Might it work to mandate that the name of an uploaded package have a minimum
levenshtein distance (or similar calculation) from the names of all the
existing packages? Then you wouldn't have to worry about maintaining a
blacklist.

~~~
wycats
That would mean that, for example on crates.io, you couldn't create a `libm`,
because `libc` is already very popular. I don't think that works.

~~~
utexaspunk
True- levenshtein isn't the best algorithm for the purpose. Is there an
algorithm that takes key proximity into account? Like, 'libm' and 'libc' are
sufficiently different to preclude typos, but 'lib[n/j/k]' or 'lib[x/d/f/v]'
are not?

~~~
sqeaky
Key proximity on which of the hundreds of keyboard layouts?

~~~
utexaspunk
Good question... I'd imagine your standard QUERTY makes up a sizeable majority
of programmers, but then I have no data to back that up... :)

------
Mizza
This seems like pretty unethical research to me.

Also, doesn't point out that the bigger threat is that this is wormable.

~~~
throwawaysocks
There was no actual intrusion, so this feels like fair game to me. Especially
since mitigating a very possible attack vector is a direct result of running
experiment. Still, hopefully the researchers got an IRB to sign off on the
experiment setup...

~~~
placeybordeaux
The research got computers to execute code on them without authorization and
extracted information from them.

That is a crime under the CFAA in the USA. Not sure what it is in Germany/EU.

~~~
billyhoffman
"Your honor, my client created and published a software library. The so-called
victims here wrote code that specifically referenced my client's software
library, by name mind you. My client in no way compelled or solicited the
victims to do so. Now how can that be called 'without authorization?'"

~~~
random28345
> "Your honor, my client created and published a software library. The so-
> called victims here wrote code that specifically referenced my client's
> software library, by name mind you. My client in no way compelled or
> solicited the victims to do so. Now how can that be called 'without
> authorization?'"

The prosecuting attorney is going to tell a jury of twelve of your non-
technical "peers" that it is hacking.

Your client can either go to trial for seventeen thousand two hundred and
eighty nine counts of felony hacking, and risk half a million years in prison,
or they can plea bargain to 5 years in prison and a felony on his record.

Or your client can hang himself, but I'm pretty sure a federal prosecutor
counts that as a win too.

~~~
cplease
Yup. I'm surprised an advisor would sign off on this thesis.

------
PeterisP
Part of the problem is the many packages that require sudo permissions to
install - IMHO that should be an exceptional case, but it isn't.

~~~
nneonneo
Packages often require sudo in order to install to the global interpreter -
it's a security hazard otherwise. Imagine a Python package which overrides the
sys module. If it didn't require sudo, anyone could install it and compromise
Python for everyone else (or, for instance, compromise setuid programs).

The two solutions here are user-local packages (pip --user, for example) and
virtual environments.

------
cormacrelf
And 'npmjs.org' is misspelled as 'npmsjs.org' in the introduction. Nice.

------
nichochar
Wow, this a very good study and explanation of what typo squatting is, and I
really liked how he proved it's effectiveness.

I wonder what kind of steps we can take to prevent this risk.

~~~
trungaczne
I think we will have to rely on crypto hash in some form. Similar to download
checksum. It won't be convenient, but it will be safe(r).

~~~
bpicolo
That doesn't really save you from typos

~~~
trungaczne
I was thinking something along the line of a mandatory hash/checksum along
with the name of the software you are trying to install from a package
manager. It does not have to be very long, just enough to avoid common
collisions.

------
ysavir
Instead of blacklisting, why not respond with a "You requested package ABD,
but we think you might mean package ABC. Enter 'yes' to continue or anything
else to start over."

That way authors can continue to use any name they want, and the emphasis is
on letting installers know that they might be installing the wrong package.

~~~
VLM
"You requested package ABD, but we think you might mean package ABC. Enter
'yes' to continue or anything else to start over."

That'll be fun to automate around in puppet or ansible.

~~~
voltagex_
I hope you're using a local package cache for puppet or ansible or even
specifying via hash (think git commit)

------
zmanian
We need operating system vendors to give us a mechanism for easily creating
and managed sandboxed dev environments.

Ones dev environment should be a place where remote code execution is a high
probablity and we need better tools to partition that from high value data.

------
airless_bar
This only seems to be an issue for languages where packages reside in a global
namespace, like Python, Rust etc.

I think most languages these days are a bit smarter and avoid this beginner
mistake (for various reasons).

~~~
tatterdemalion
This is obviously not true. If `serde` resided at `erickt/serde` (as the
counterproposal for Rust would've had it), I could create `erict/serde` or
`erick-t/serde` or any other variations of erickt's handle.

The only way this is 'solved' is if some third party authority hands out top
level names and refuses to register names that are similar to other names for
some definition of similar. The number of levels between top level and package
name is irrelevant.

~~~
zardeh
Well, you could also solve it by saying that the post slash names are unique.
ie. There can't exist zardeh/serde if erickt/serde already exists. Then the
author-name works as a logical checksum, and you aren't any worse off than you
were with a global namespace.

~~~
kibwen
The purpose of a namespace is to make it possible to disambiguate two
otherwise identical identifiers. If you force package names to be unique
across all namespaces, then you don't have namespaces at all, you just have a
single global namespace where you're forced to prepend an author name to the
package name.

~~~
zardeh
I know, I wasn't suggesting this as a namespacing solution, but instead a
typo-prevention one.

------
bennofs
Did anyone else find it surprising the the number of total requests (45334) is
so much higher than the number of _unique_ total requests (17289)? It is more
than twice the number of unique requests!

Possible explainations:

* Perhaps many of those are automated build systems, which would also explain the high number of systems with admin access (for example, if you use travis without docker, every build runs in a clean vm with admin access).

* People download one package and install it multiple times? Seems unlikely

Any other ideas?

~~~
Guillaume86
I think he forgot to define a baseline (could be wrong, I didn't read the
paper). He should have generated a few packages with a completely innocent
name (and maybe some packages with just a GUID as a name) to see how much
downloads / installs they get too.

------
mirekrusin
with npm there should be at least an option which prompts for Y/N/A when
package has preinstall hook.

but even this just tries to put the problem under carpet. you could still for
example have requests package which just installs request package, works as
expected, just sends request/response to your own server from time to time.
ie. when there's http basic auth used only.

~~~
seldo
It is possible to disable install hooks at install time by running npm install
with --ignore-scripts.

You can also make this the default, with npm config set ignore-scripts true
(and then --ignore-scripts false at install time if you wish to run them).

------
mbroshi
Maybe this is overly naive, but when I make a typo in the Google search bar,
it doesn't even search for my typo-ed term (even if it would have gotten some
hits), it searches for what I actually meant to type. Can't package managers
have a similar feature?

~~~
abstractbeliefs
The main problem is when you really did mean to search for the typo term.
There's no inherent problem in two packages having similar names.

Consider the following:

requests - a python package for making HTTP requests. requestr - a python
package for a fictional startup that allows you to send requests to your
nearest and dearest.

Given they both could be typos of each other:

1) How do we determine which one to use? What if someone accidentally also
tries "requestd", somewhere between the two ?

2) How do we apply the principle of least surprise - I asked to install
requests, and everything installed just fine, but now I can't import it?!

~~~
ekimekim

        $ pip install requestr
    
        Package "requestr": did you mean "requests"? [Y/n]
        (reason for this warning: similar spelling and requests is much more popular)
    
        Pass --no-spell-warnings to disable this feature.

------
ryanmarsh
So last week my client discovered there's a gem named bunlder... _sigh_

~~~
pmontra
There is a gem called bundle which doesn't do anything but preventing a
typosquat

[https://rubygems.org/gems/bundle](https://rubygems.org/gems/bundle) Total
downloads 1,800,600

Source (empty) at
[https://github.com/will/bundle](https://github.com/will/bundle) and
interesting README.

[https://rubygems.org/gems/bundler](https://rubygems.org/gems/bundler) Total
downloads 92,116,090

It's almost the 2%.

~~~
rspeer
I think the authors here missed an opportunity for even more effective
squatting like that: cases where the name you import, name you type at the
command line, or name you commonly call the package by is different from the
name in the repository.

In Python, "pytables" (should be "tables") and "skimage" (should be "scikit-
image") come to mind.

~~~
nathancahill
Yeah. I think it's becoming a reflex for programmers when they get an import
error like:

    
    
        Error: Cannot find module 'x'
    

to quickly type:

    
    
        npm install x

------
jogjayr
I thank my stars every time I get a "Package not found" error due to a typo,
because I'm reminded that it could have been much worse.

------
jwilk
Trying to parse the title made my head hurt. It should be "Typosquatting
software package names" or something.

------
tbrock
The homebrew model where packages and changes to packages are reviewed takes
care of this problem quite nicely.

------
andrewstuart
Ouch. This really hurts. So hard to protect against human error.

------
sheerun
Glad to hear bower is stated to be safe in this regard :)

------
irremediable
Really cool applied research. If I get the time, I'll check out the author's
thesis.

------
optimuspaul
I'm confused.. is it 17 computers or 17000 computers? inconsistent use of
decimals in this article.

~~~
cialowicz
17000\. In Europe a common decimal format it #.###,##. See here:
[https://en.wikipedia.org/wiki/Decimal_mark#Examples_of_use](https://en.wikipedia.org/wiki/Decimal_mark#Examples_of_use)

