

Why Package Signing is not the Holy Grail - donaldstufft
https://caremad.io/blog/packaging-signing-not-holy-grail/

======
ChuckMcM
I think this is a good introduction to some of the depth of the problem. As
with all big problems, you start by breaking them down into what you can and
cannot do.

So in the field of package signing, you can, if you choose to, nominate a
single signature authority. This is basically what Microsoft does, they are
the signature authority for all keys that sign things that go into Microsoft
products. When you choose a signature authority that becomes the first lemma
in your calculus "I trust <foo>."[1] Now, once you have that lemma you can
build on it with statements like, "I trust Microsoft, and Microsoft trusts
Dell, so I will trust that the key Microsoft says is Dell's really is Dell's."

Now you've created a transitive trust relationship, in that you not only trust
Microsoft's internal processes, but you trust that it will do a good job of
auditing Dell's before it gives its blessing to Dell.

Now Don makes a minor error when he says, "PyPI allows anyone to sign up and
make a release which makes verifying authors an unmanageable problem." the
issue is that PyPI doesn't create a durable relationship between people you
trust and someone you don't know.

Because of this you have to assume there are bad actors who are making
packages and simply not install them. This isn't an "unmanageable problem"
this is "inconvenient." (not being snarky, trying to be precise). You can run
a grocery store where everyone puts the money they owe into the cash register
before they leave there are no checks on the cash register, everyone is
trusted to do the right thing. The store going out of business isn't a
"problem" it is the expected outcome. This comes up in policy debates as well
where person A wants their free speech rights but they also don't want person
B to make remarks they consider to be hateful (thus impinging on person B's
free speech rights). Its the 'incompatible constraints' problem.

There are a number of interesting zero knowledge proof techniques in
cryptography, things which allow someone you know nothing about (zero
knowledge) to prove they are a particular person, and starting from there you
can create durable audit trails of actions (useful in digital cash systems
among others) but it also allows you to identify bad actors, sadly after the
fact, because you can make statements about the person who identified and the
package they produced. You cannot say anything about motive (they may have
built their package on a compromised machine) but you can trace back the
action to where it entered the system, and with sufficient audit trails back
it out of the system.

As you can imagine these systems are challenging to get right, challenging to
build, and take an expertise that is generally highly valued so rarely
available "for free."

But working on challenging problems has its own rewards and so I recommend it
highly. It is always a workable approach to start with assumptions (lay them
out and keep them front and center) to gain an understanding of the more
challenging aspects, and then work from there. Since you aren't building life
support systems you may find that just good auditing trails is enough if they
allow you to back up to any previous state. Sometimes using a centralized
authority gives you enough to build a system around it. I highly recommend
Schneir's "Applied Cryptography" for a pretty approachable take on these
topics.

[1] People will attempt to derail this example with "but I don't trust
Microsoft!" or some such. Which is fine, pick an authority you do trust and
use that as a place holder.

~~~
dllthomas
> You can run a grocery store where everyone puts the money they owe into the
> cash register before they leave there are no checks on the cash register,
> everyone is trusted to do the right thing. The store going out of business
> isn't a "problem" it is the expected outcome.

Kind of not the main point here, but I've been in a store that was left
unattended and there was a basket of money out for people to make their own
change. As far as I know it's still there...

~~~
ChuckMcM
The drug store was kind of like that in Endicott when I was a kid, when the
druggist was out for lunch the sign asked you to put what you owed in the
basket next to the register. I asked my Grandmother why people didn't just
take stuff out of the store and she said, "Oh no, everyone likes the druggist,
no one would want to inconvenience him like that." and so the community was
small enough that the trust metric was simply we like the guy behind the
counter.

It isn't like that today, and Endicott is quite a bit larger than it was then,
and people are sadly a lot less neighborly than they were when I was a kid
(could be nostalgia though).

I believe the trust issue was "resolved" by locking the store when the
pharmacist was out, and eventually raising prices so that they could pay the
salary of someone to be there to help.

And so it is with the PyPI community, everyone is trusting everyone not to do
anything bad. And that is a perfectly reasonable strategy, you just have to
accept the risk that comes along with it. Sometime in the future something bad
_could_ happen if someone chooses to violate that trust.

Now in a classic vulnerability analysis you'd ask "What is the motive or
payoff?" You don't put a steel vault door on your home because chances are if
somebody wants into your house that badly they are going to come through the
window. You accept a certain amount of risk, perhaps you have a home safe for
really valuable stuff and insurance for the rest.

~~~
dllthomas
Indeed. As long as we're recommending Schneier's books, his most recent
("Liars & Outliers") would also seem quite relevant.

------
rlpb
Bootstrapping trust is hard.

Debian is a good example here. A requirement of being a Debian Maintainer or
Developer is that you must have met real existing Debian Developers in person
and they must have signed your key (after checking your identity)[1].

This certainly does make it harder to become a DM or DD. But it gives us the
Debian keyring, which is the distro trust implementation for Debian as
explained in the article.

However, despite the difficulty in bootstrapping developers, Debian have
achieved it. Thanks to them, there are many more people in the strong set[2],
and you can probably find one near you[3]. Can we use this to bootstrap other
communities, and end up in a situation where it's normal to already have been
"introduced" into the strong set if you're a software author in any project?

[1]: [http://www.debian.org/devel/join/nm-
step2](http://www.debian.org/devel/join/nm-step2) [2]:
[http://pgp.cs.uu.nl/plot/](http://pgp.cs.uu.nl/plot/) [3]:
[http://wiki.debian.org/Keysigning/Offers](http://wiki.debian.org/Keysigning/Offers)

~~~
lifeisstillgood
That's an interesting idea - there could be easily be a Python core key ring
and each library / package could connect off that trust circle.

Interesting -

~~~
rlpb
The catch is that signing a key does not usually convey "trust", just
"identity". In Debian, trust is assigned at the end of a successful
application process to becoming a Debian Maintainer or Developer, and this
trust is conveyed for a particular identity with presence in the official
Debian keyring. So in Debian, trust is centralized.

However, it doesn't have to be exactly like this. I can imagine a
decentralized system built on the same tooling. For example: you could require
three certifications from existing members in the set, bootstrapped from
Guido, for inclusion in the trusted keychain. (You might additionally limit
the maximum degrees of separation from Guido to assist with auditability).

------
IgorPartola
The problem is how permissive PyPI is. I can upload a package called django2
with the description that this is the new generation of the Django framework
and run arbitrary code on a whole lot of machines of many unsuspecting
developers. The reason Linux distributions do not generally have this problem
is that there is an air gap between the upstream developer and the repository
in the form of a maintainer who reads the code and verifies that it is not
malicious. Incidentally, GitHub has the same problem as PyPI. With good SEO I
can create a repo called django and have it be the first thing that new
developers find.

One possible solution to this is to have a core group of maintainers verify
the most popular packages. I place my trust in the maintainer and the
maintainer signs a specific version of the package. This will not scale to
100% of the packages, but that is fine. As long as django, flask, psycopg2,
etc. are signed I can take responsibility for reviewing django-pretty-snarfs-
with-smiles myself. That is essentially our trust model with GitHub already:
you blindly download large official looking projects but read through the code
of small stuff (right?). Perhaps after a while some developers become trusted
and get to have their signing process fast-tracked.

At the end of the day if I could have one improvement it would be a big fat
warning that pip gives you that says "what you are about to install is likely
malicious code. Do not do it without verifying that it is not by manually
downloading it and reading the source."

BTW, simply running "pip install foo" will let foo's setup.py run arbitrary
code on your machine. It is such a giant security hole by design that I cringe
every time I have to use it.

Edit: another potential model is for PyPI to punt the verification onto
outside entities, such as GitHub, Bitbucket, etc. Basically instead off
building my packages on my (potentially compromised machine), I would tag a
version of it on GitHub and instruct PyPI to build it. It is easy to establish
trust between PyPI and GitHub, so the guarantee you get with PyPI is "this
package came from exactly this source" where you can read the source before
installing it. Like I said above, GitHub also has a trust model problem but
this would reduce the amount of work one would need to do to verify a package
from two places to one.

~~~
pekk
Suppose that 'pip install foo' did not run a setup.py. Now eventually you are
going to run some code from foo. If you weren't, why are you installing it? At
that time, you could have 'arbitrary code' running on your machine.

~~~
IgorPartola
My point is that you don't even get a chance to review the source before it
runs. pip runs in trusted code automatically, without any verification
whatsoever. At least when you run 'git clone' that does not happen.

~~~
donaldstufft
The recently released pip 1.4 allows you to install from Wheels which does not
use setup.py during install and executes no code from the package during
install.

~~~
IgorPartola
Yes, except I have seen very few packages that are not sdist's. Another issue
is that unless the package maintainer disables it explicitly on PyPI, pip will
go searching over the package's home page for a newer version. This is just
broken since even if PyPI was to implement some type of package verification,
example.com/~/devfoo/ might not. Often times these pages do not even have
HTTPS set up.

As you said, this is a complex problem, but I think the first step is to take
care of the easy things. Let me know if you'd like to chat off HN about this.

------
zobzu
Nope. Package signing in the exposed ways means that if you trust the dev
once, you will trust him for every subsequent package, no matter where you get
it from.

The conclusion is especially _bad_.

 _" we'll get a solution where the end user has the relationship with the
source of trust and not the package author."_

That's extremely dumb, on many levels. I'll go with the possible most
disturbing.

If you don't care about the author and only the source, the author just has to
put a backdoor in the code. DONE.

That's why you have to trust the author and NOT the source. The author is the
first creator of the code and thus the person with the most ability to sign
the code.

Package signing ensure the package has not been modified (because you see,
replacing SHA hashes is trivial. sha1sum file > thehashfile.txt).

Yes, you'll have to trust someone, at some point. Then you'll build more trust
as time goes. That's how it is. When you get your debian package, you trust
the debian devs, a bunch of them. When you trust a arch package, you trust
that the devs you trusts make good trust decisions, and you trust them as well
(arch uses the "web of trust").

When you trust an SSL server, you trust that your browser made the right trust
decisions, and that the CA made the proper checks, and that the server owner
is trustable.

When you turn on your computer you trust that intel didn't backdoor their
microcode.

And so on. Trust is an infinite "issue" and securing the transport will never
ensure the source is correct. Never.

~~~
donaldstufft
> That's why you have to trust the author and NOT the source. The author is
> the first creator of the code and thus the person with the most ability to
> sign the code.

Source of Trust, not Source of The Thing You Downloaded. The author would
still sign the package, it's just how do you get from where you're at to
trusting that person. The way that browsers, most (All?) Linux distributions,
Microsoft etc work is by hard baking a list of trust roots. This has the
effect that we have in the modern CA system where because there's a hard baked
system and the trust relationship is between the "author" and the source of
trust that you can't reasonably not trust say Verisign or a significant
portion of the internet breaks. It's about Trust Agility not about trusting
the place you downloaded the package from. It's an idea similar to
[http://convergence.io/](http://convergence.io/).

------
nadaviv
Bitcoin offers a neat solution for public key verification. I'm about to use
that in a Bitcoin-related service that I'm about to launch. Here's an
explanation from the security page:

    
    
        #### Public key verification
        To prevent an attacker from modifying our published Bitcoin public key,
        its permanently embedded into Bitcoin blockchain in a way that is
        [nearly impossible](https://en.bitcoin.it/wiki/Weaknesses#Attacker_has_a_lot_of_computing_power)
        to modify (and becomes exponentially more difficult as time goes by).
    
        The public key can be verified by taking the following procedure:
    
        1. Take the SHA256 of the domain name ("****.com")
        2. Create a Bitcoin address using that hash as the private key
        3. Find the first transaction with that address as its *output address*
        4. The *input address* of that transaction is our public key
    
        If its ever required to change the public key, the announcement
        will be signed with the old public key.
    

Software packages could use the package name instead of a domain name, or the
authors can attach the public key to their usernames and use it to sign all
their software.

------
zdw
Two comments:

1\. Signing should be different from source integrity. For example, signing
the manifest that contains a bunch of hashes should be sufficient - you don't
necessarily need to sign the entire compressed package. Newer package
management systems like the Illumos IPS work this way (which is not without
it's faults). The veneration of the "church of tarball" needs to go away -
we've been living in a DVCS world for the past few years and it's awesome
compared to the dark ages of before.

2\. We need some sort of heuristics to help make decisions. For example - a
diff between versions is much easier to audit, and having statistics on
contributors might be quite useful. For example, red flags might go up if a
previously slow moving project had a massive number of commits with a new
developer - this might just be new blood getting into a project, or a
maliciously intended merge.

In the end, eyeballs are needed. We need to make it easier for those eyeballs
to go further on less.

~~~
zobzu
1\. When you sign a file.. _oh snap_. It generates a checksums and signs the
checksum. That's how signing _works_. Thats exactly the same as what you
propose. In fact.. the way you propose and without a custom implementation of
the signing algorithm, its generally going to generate a hash of the list of
hashes and then sign that. Slower, more memory.

The only advantage is if you need to _remotely_ sign a file _locally_. You can
gain time by not having to transfer the file if you sign only the list of
hashes (and of course, you've to trust that they've been transported properly,
so your trust is less perfect, because further away from the original)

2\. The heuristic is the number of signatures you trust. Having thumbs up/down
just makes bots happy. As for malicious merge, yes "eyeballs" are the "only
decently reliable solution", but not only this is mission impossible, they'll
probably still miss some smart backdoors in the code. All you need is an off
by one.. (or what not..) You can however sign commits, and blame whoever
introduced the issue whenever its discovered/public (and decide if it was a
legit error or not)

Note: its still a good idea to have automatic checks so that all the "easy
stuff" is filtered out, of course.

~~~
donaldstufft
There's a difference between signing a tarball in it's entirety (in which case
you'd get only one checksum) and checksumming every file inside the tarball
and then signing _that_ file.

~~~
zobzu
How do you think gpg works?

------
charlesap
Well crap. Source code reputation is exactly the problem we are trying to
solve at [http://rputbl.com](http://rputbl.com) (sorry, just a splash page for
now.)

We are nowhere near close to launch yet and there's a chicken-and-egg problem
of having a sufficiently large database of hashes to make an arbitrary source
code file reputation check worth your time, but we have some ideas about what
to do about that too.

------
enaeseth
If the package repository acted as a certificate authority, and generated
distributors' certificates by verifying the distributor's appropriate
_virtual_ identity (GitHub, BitBucket, DNS), then I think you can at least be
pretty sure that the person making the release is someone who would have had
access to commit to the project's source code anyway.

Is that at least "good enough"? If the root certificate for the package repo's
CA were itself signed by a real-world CA, then I think what you end up
trusting is the security of the repo, of GitHub, and of the project's
developer(s). Projects with multiple committers could have several of them
receive certificates, and require a quorum of signatures before a release is
published, to mitigate the risk of any one core dev being compromised.

To keep package names trustable, I think the only sane scheme would be using
the project's URL on the service which was used to prove its identity; Rails
would be "[https://github.com/rails/rails"](https://github.com/rails/rails"),
though surely you could use aliases (github:rails/rails) to reduce typing.

~~~
jrochkind1
As the OP says, if the same organization -- and even likely the same _server_
-- is being the certificate authority and distributing the signed releases....

> _However in this model if someone is able to send you a malicious package
> they are also likely able to send you a malicious key._

------
jrochkind1
I agree it's a problem without an easy solution. It's not just a problem with
python, it's exactly the same thing with rubygems (where it's been getting
similar discussion, especially after a rubygems vulnerability a few months
ago).

But: _The elephant in the room when talking about package signing is what
exactly we are trusting._

I think it's actually relatively clear. When I install "rails", I want to know
that it _really did_ come from the "rails team", and not from a third party
man in the middle.

That's the most that can be expected, and that's sufficient. There's no way to
technologically ensure that the rails team itself isn't intentionally
including malware in their release. And of course no way to technologically
ensure that the release doesn't have bugs or vulnerabilities.

The goal is just ensuring the release is really from who it says it's from.
Which is of course hard enough already, for the reasons dealt with in OP, and
because who is "the rails team" exactly anyway?

------
VLM
"This isn't an already solved problem nor is it an easy to solve one."

This isn't even a defined problem, much less solved or easily solved. A goal
would help a lot. I searched the article and the comments for the word "goal"
and found nothing. It is an OK laundry list of several tools and techniques
and their strengths and weaknesses in general. Most of the tools listed do
pretty easily solve certain problems; however how those easily solved problems
relate to the undefined problem is apparently inadequate. That is not the
fault of the tools. Also it is not an exhaustive list of all possible
techs/tools. Perhaps the topic he has not googled for yet which meets the
unspecified goal is "SSL certs" or "SSH host keys in DNSSEC secured zones".

Without a defined goal, rudderless drifting is unavoidable. This applies to a
lot more than computational system designs.

Reading between the lines and doing some creative writing about something I
find interesting, I am guessing what he's asking for is some kind of mandatory
peer review/code audit of git commits aka the dude who writes code is never
allowed to merge the code, at minimum, and the dude who writes unit tests is
never allowed to write code, and all devs have to be in the "big" GPG WoT off
a major public keyserver not just one little project. Because I have no idea
what problem he's trying to solve, this particular solution to a problem,
although somewhat interesting, may not have anything to do with his actual
goal.

The irony is hilarious because there is at least some disagreement on what the
"Holy Grail" really is, outside this discussion. So yes, "Package signing is
not the Holy Grail" because we have no idea what the Holy Grail Really Is both
archeologically/historically and WRT python package distribution. Maybe both
are just a coffee cup.

------
peterwwillis
An entire economy (e-commerce) is tied into the premise of bundling your keys
with the first download (the browser) or OS install (browser shipped with OS).
So far, not a lot of problems with this method, for most people. Why not try
it? Can't be worse than no signing at all.

~~~
donaldstufft
There actually is a good bit of problem with this method even in the browser
space.

For one, which key do you bundle? If it's a central root key see the part
about Linux as that's essentially what they do. Also the part about not having
the work force that the for profit CA vendors have is relevant here too.

~~~
peterwwillis
tl;dr no, there is no problem with it in the browser space

Step 1. Red Hat creates a CA and ships key with OS.

Step 2. Red Hat creates a keystore that allows any developer to add their own
key.

Step 3. Package is created.[1]

Step 4a. Package is signed by Red Hat when they create the package.[2]

Step 4b. If 4a not followed, Package is signed by the package creator using
their key from the Red Hat keystore.

Step 5. User downloads package.

Step 6. User installs package. Package signature is verified.[3]

[1] As with all Linux distros, packages are either created/approved by a
distribution release manager or made by 3rd parties completely independent of
the distro.

[2] Just like with browsers and certs, you can install a 3rd-party CA for a
3rd-party package if you want, but the key distribution is left up to the user
to do safely. Most people don't do this.

[3] Again, just like with browsers and certs, a package signature is either
signed by a CA or by a developer. If it's CA-signed, the OS already has the
cert, and it is verified. If it's developer-signed, the developer's public
cert is shipped with the package. Some crypto math tells you whether this
developer cert was created by the CA or is a phony.

Red Hat can do this for virtually no cost; they just need to host a public web
service that lets you create a signed key. They already host tons of free
public services, so I don't see how this would be an issue for them. Not to
mention someone could host a distro-independent service that does this exact
same thing, and every distro could include its CA.

The only "problem" here is people have wacky expectations of trust. Bob
creates a distro, Sally creates some software, and Frank creates a package of
the software for the distro. You have to trust all three of them - which you
can do by accepting all of their signed keys.

But realize that there is no "easy" way to trust Frank. Frank's essentially a
stranger. We don't trust Frank in the browser world, so I don't know why we're
expected to trust him with our packages.

------
lifeisstillgood
The more I think about this the more confused I become

\- the obvious first step is for pypy to generate a nonce for each upload
request. If that is stored on the original developers source repo then we know
that they control that repo. And ?

\- send the nonce via a side channel - say email. Ok the owner of the repo
also has access to that email. Which is slightly more helpful

But each turn I take I ask myself what am I trying to trust? That the code is
not malicious? That's a third party test process. Who will do that?

So firstly ask the right question ...

