
Codehash.db – A public database for software and firmware hashes - andrewdavidwong
https://github.com/rootkovska/codehash.db
======
cjbprime
This is like a version of Certificate Transparency for software. Attempts have
already been made to port CT to software ("Binary Transparency"), and I like
them better than this approach.

Specifically, you can demand a CT receipt from your downloads, proving that
the existence of the download had been made public.

Without that, and in this scheme, it's still possible to simply target someone
with tailored malware and assume they won't bother to check hashes against one
of these databases.

With CT, the download client itself can automatically refuse a download that
has not been publicly announced.

~~~
andrewdavidwong
Fair enough, but this system isn't intended for end users who, as you point
out, are unlikely to bother checking the hashes of their downloads. Quoting
Joanna Rutkowska:

"Also, in case it wasn't clear: the primary audience for such a DB should be
developers or admins (e.g. IT department in a large organization), I think.
Not users. Users are always somehow fated to trust the 'last mile' vendor, and
there is little feasibility in implementing any form of trust distribution for
them."

[https://secure-
os.org/pipermail/desktops/2016-November/00014...](https://secure-
os.org/pipermail/desktops/2016-November/000147.html)

~~~
cjbprime
Your quote's incorrect, in my opinion. Under CT, users are able to mistrust
the issuing CA -- we start by assuming they want to give us malware (in the CT
case, MITM certs) and trust them only to the extent that they are distributing
publicly announced and tracked artifacts to us. This happens at the end user's
computer, when their browser refuses to accept a cert with no CT announcement
attached. This all happens in running Chrome browsers today.

If other software (e.g. your Linux distro) similarly checked for publicly
announced artifacts (e.g. an offered package upgrade) then you would be
protected against targeted malware from your last mile vendor. The malware has
to be either offered to everyone (ensuring detection) or no-one.

I think the CT mechanism is simply better than a system that "isn't intended
for end users", because a CT mechanism projects both administrators _and_ end
users.

~~~
andrewdavidwong
I don't speak for Joanna, but I interpret that quotation as saying something
like:

"Users are always fated to trust the 'last mile' vendor because the last mile
vendor (e.g., Google Chrome), has control over what the user sees and does
(i.e., sends and receives). If your Chrome browser is compromised or
malicious, it can silently ignore the fact that no CT announcement is attached
to a cert. In this sense, the user is fated to trust Chrome.

"Moreover, there's little feasibility in implementing any form of _trust
distribution_ for them, but this is not to say that there's little feasibility
in implementing a system that keeps them relatively secure. Users running a
non-malicious, non-compromised instance of Chrome do not have any form of
trust distribution, since they place all their trust in Chrome (though they
probably don't realize it). Nonetheless, Chrome may be keeping them relatively
secure as long as it's working properly."

~~~
cjbprime
Thanks, "last mile" confused me because it implies a transfer. With CT applied
to software updates, I think you really could be suspicious-by-default of your
software vendor.

Do you have any thoughts on why codebase.db should exist, versus pushing the
same hashes to a CT log and having clients check for CT announcements? Seems
like CT is a clear improvement.

~~~
andrewdavidwong
I honestly don't know enough about CT to have an opinion.

------
JoshTriplett
Interesting idea. The Software Heritage project
([https://www.softwareheritage.org/](https://www.softwareheritage.org/)) has
the goal of doing this for all software source code; perhaps they might be
interested in extending that to binaries as well? That seems compatible with
their goal of preservation.

~~~
andrewdavidwong
Software Heritage looks excellent, but it sounds like the two projects may
have different goals. It sounds like Software Heritage is focused on
collecting, preserving, and sharing code (and, as you say, potentially
compiled software), whereas codehash.db is focused on allowing people to
securely authenticate it after they've obtained it through some other means.

------
pmorici
How is this different than the NSRL and why wouldn't you use that instead?

~~~
andrewdavidwong
The main difference seems to be that the NSRL does not include PGP signatures
(or any substitute), so there's no way to verify that the hashes are
authentic, in the sense that the hashed software is bitwise identical to the
software that the developer intended to distribute. This is precisely the
problem that codehash.db is designed to solve. Without any way to verify the
authenticity of the hash values, we have to rely on the authority of the NSRL
itself. (In addition, the fact that the NSRL appears to have close ties to the
U.S. government might make it even harder for some people to trust it.)

~~~
zxv
The NSRL dataset has signatures that are typically used to verify both
integrity and veracity.

[http://www.nsrl.nist.gov/RDS/rds_2.54/split-
hash.txt](http://www.nsrl.nist.gov/RDS/rds_2.54/split-hash.txt)

Alleging the NSRL is untrustworty is inconsistent with the track record of the
NSRL and NIST scientists.

Please be aware that there are thousands of forensic experts who have relied
on the NSRL over the last decade or more as a basis for testimony in court.
Those experts verify hashes for everything they do, and for every case, and as
a result there has been significant amount of independent peer review of the
contents.

While Codehash.db provides a hash for a package, the NSRL provides hashes for
individual installed files.

This in no way diminishes the value of the Codehash.db design. They target
different use cases.

~~~
andrewdavidwong
> The NSRL dataset has signatures that are typically used to verify both
> integrity and veracity. > [http://www.nsrl.nist.gov/RDS/rds_2.54/split-
> hash.txt](http://www.nsrl.nist.gov/RDS/rds_2.54/split-hash.txt)

Can you explain this signature scheme? I'm not familiar with it. The link you
provided just appears to show hashes and sizes for a file that has been split
into four pieces.

> Alleging the NSRL is untrustworty is inconsistent with the track record of
> the NSRL and NIST scientists.

I'd just like to point out that neither I nor anyone else here has alleged
that.

> Please be aware that there are thousands of forensic experts who have relied
> on the NSRL over the last decade or more as a basis for testimony in court.
> Those experts verify hashes for everything they do, and for every case, and
> as a result there has been significant amount of independent peer review of
> the contents.

I'm genuinely glad to hear that! That's good to know.

> While Codehash.db provides a hash for a package, the NSRL provides hashes
> for individual installed files.

I don't think that's necessarily true. Codehash.db is open to hashes for
anything (source code, ISO, package, binary installer).

> This in no way diminishes the value of the Codehash.db design. They target
> different use cases.

Likewise, my remarks aren't meant to be in any way derogatory toward the NSRL.
As far as I'm concerned, it's OK if they do, in the final analysis, target the
same use case. If that's the case, the best solution should be adopted,
whichever one that turns out to be. :)

------
vcdimension
This would be great for intrusion detection if there were some tools that
users could use to automatically query the database, and repository
maintainers could use to upload hashes.

~~~
jonstewart
There are a fair number of tools that do just this, either with NIST's NSRL or
with commercial hash sets, the notable one being Bit-9. Bit-9 has an order of
magnitude or two in size over NSRL (which itself has several orders of
magnitude over this database).

------
n1000
Don't packet managers like homebrew in principle do something similar? Would
be interessting to join forces I guess.

~~~
dom0
Automated package building usually looks like this:

    
    
        wget http(s)://... | tar x
        make
        package
        sign_with_gpg
        upload_to_ftp
    

Only if a package maintainer gets involved there is a chance that release
signatures are actually verified. But even then, a whole lot of upstream
projects just don't sign their releases. Some distros don't sign their
packages, either. Or even their ISOs (iirc Linux Mint only started doing this
fairly recently).

Also, "web of trust" only works for a tiny subset of people. If I'm a "lone
wolf" FOSS developer, my key won't be signed by anyone, there won't be any WoT
to verify. Downstream packagers just have to swallow that or TOFU.

~~~
n1000
Sorry, I am new to this, but my understanding is that homebrew will verify the
hash before installing, right?

~~~
dom0
It probably does, but what good does that do if the source code wasn't
transported securely and verified from the upstream to the packager?

Disclaimer: I don't know anything about brew packaging practices. Maybe they
always require verification. Maybe they don't.

------
deavmi
Cool idea.

