It allows you to "pin" dependencies by specifying the sha256sum of the jar you're expecting.
I am totally happy donating $10 to whisper systems for this work instead of forcing me to donate $10 to Apache Foundation (although a worthy cause) to be able to get https access to Maven Central.
> When JARs are downloaded from Maven Central, they go over HTTP, so a man in the middle proxy can replace them at will. It’s possible to sign jars, but in my experimentation with standard tools, these signatures aren’t checked. The only other verification is a SHA1 sum, which is also sent over HTTP.
If I understand correctly, SHA-256 is part of the SHA-2 family of hash algorithms, and like SHA-1, when used alone it is subject to length extension attacks.
SHA-384 is also a member of the SHA-2 algorithm family, but is immune to length extension attacks because it runs with an internal state size of 512 bits -- by emitting fewer bits than its total internal state, length extensions are ruled out. (Wikipedia has a great table for clarifying all these confusing names and families of hashes: .) Other hashes like BLAKE-2 , though young, also promise built-in immunity to length-extension attacks. mdm  is immune to this because the relevant git datastructures all include either explicit field lengths as a prefix, or are sorted lists will null terminators, both of which diffuse length extension attacks by virtue of breaking their data format if extended.
Not that it's by any means easy to find a SHA-256 collision at present; but should collisions be found in the future, a length extension attack will increase the leverage for using those collisions to produce binaries that slip past this verification. An md5 Collision Demo by Peter Selinger is my favourite site for concretely demonstrating what this looks like (though I think this publication by CITS mentions the relationship to length extension more explicitly).
(I probably don't need to lecture to you of all people about length extensions :) but it's a subject I just recently refreshed myself on, and I wanted to try to leave a decent explanation here for unfamiliar readers.)
I'm also curious how you handled management of checksums for transitive dependencies. I recall we talked about this subject in private back in April, and one of the concerns you had with mdm was the challenge of integrating it with existing concepts of "artifacts" from the maven/gradle/etc world -- though there is an automatic importer from maven now, mdm still requires explicitly specifying every dependency.
Have you found ways to insulate gradle downloading updates to plugins or components of itself?
What happens when a dependency adds new transitivity dependencies? I guess that's not a threat during normal rebuilds, since specifying hashes ahead of time already essentially forbids loosely semver-ish resolution of dependencies at every rebuild, but if it does happen during an upgrade, does gradle-witness hook into gradle deeply enough that it can generate warnings for new dependencies that aren't watched?
This plugin looks like a great hybrid approach that keeps what you like from gradle and while starting to layer on "pinning" integrity checks. I'll recommend it to colleagues building their software with gradle.
P.S. is the license on gradle-witness such that I can fork or use the code as inspiration for writing an mdm+gradle binding plugin? I'm not sure if it makes more sense to produce a gradle plugin, or to just add transitive resolution tools to mdm so it can do first-time setup like gradle-witness does on its own, but I'm looking at it!
Edited: to also link the wikipedia chart of hash families.
Tldr: The length extension property of the Sha2 family has nothing to do with collisions. If you are afraid of future cryptanlytic breakthroughs regarding the collision resistance of Sha2 use the concatenation of SHA-256 and SHA3-256.
You are mightily confused about the connection of length extensions and collisions. (Bear with me, I know you are already familiar with length extensions, but I need to introduce the notation to explain your misunderstanding. Also it is nice to introduce the other readers to the topic.) All actually used cryptographic hash functions are iterative hash functions. That is the message is first padded to a multiple of the block size and then cut into blocks b_1,...,b_n. The core of the hash function is a compression function c (or maybe we should better call it mix-in function if we also have the Sha3 winner Keccak in mind) that takes the internal state i_k and a block b_k+1 and outputs the new internal state i_k+1: i_k+1=c(i_k,b_k+1). The hash function has a pre-defined initial state i_0. At last there is a finalization function f that takes the last internal state i_n (the state after processing the last message block) and outputs the result of the hash function. MD5, SHA1 and the SHA2 family have the problem that the finalization step is just the identity function, i.e. i_n is directly used as output of the hash function. Thus if you take the output you can continue the hash chain directly and calculate the hash of a longer message b_1,...,b_n,b_n+1,...,b_n+m.
Normally this is not that interesting (you could have just calculated the hash of the long message directly and avoided problems with padding that I'm ignoring to keep it simple). But in some situations you might not actually know b_1,...,b_n. For example if you calculate an authentication tag as t=h(k||m) where k is secret key you share with the API and m is the message you want to authenticate. As t is published an attacker an pickup the internal state i_n=t and calculate the authentication tag of an extension: t2=i_n+m=h(k||m||m2) authenticates m||m2. Twitter had this problem with its API: http://vnhacker.blogspot.de/2009/09/flickrs-api-signature-fo... . If you want to do authentication tags right, use HMAC: https://en.wikipedia.org/wiki/Hash-based_message_authenticat...
Now a collision describes the case where two messages b_1,...,b_n and d_1,...,d_m produce the same hash, i.e. the output of the finalization function is identical. As SHA2 does not have a finalization function and I would not believe that collisions in SHA3 would only be possible in the finalization step (squeezing step in the terminology of the Keccak sponge function) if it is possible at all I'll ignore it as a source for collisions for the moment. Thus a collisions means that after processing the k blocks of b_1,...,b_k and the l blocks d_1,...,d_l the internal state is identical. As the hash function is iterative we can continue both block lists with identical blocks e_1,...,e_j and still have the same internal state. As the state is identical it does not matter which finalization function we apply the output is identical and b_1,...,b_k,e_1,...,e_j and d_1,...,d_l,e_1,...,e_j hash to same value. Thus whenever you have one collision in the internal state you can always produce infinitely many colliding messages. This property is independent of the finalization function and blake2, Sha-384, Skein as well as Sha3 allow such attacks (once one collision is found). If anything a fancy finalization function can make things worse by mapping two different internal states to the same output. But because of the problems with the length extension attacks (see twitter) a one-way finalization function is a standard requirement for hash functions nowadays.
If you are really concerned about future cryptanalysis of SHA2 you can spread the risk on several hash functions where each has to be broken to break the security of the overall construction. These are called hash combiners. The most simple one is the concatenation of two hashes: Sha256(m)||Sha3-256(m). This one will be collision resistant if either Sha256 or Sha3-256 is collision resistant. There are also combiners for other properties, like pseudo randomness, and also multi-property combiners. See for example http://tuprints.ulb.tu-darmstadt.de/2094/1/thesis.lehmann.pd... .
It amplifies collision finding by ﬁnding r-collision in time log2(r)2n/2, if r=2t for some t. Figure 1 in section 3 is a very intuitive picture of how.
The third paragraph of section 5 states that the attack is inapplicable on hashes with truncated output.
Although, now that you mention it, my dusty crypto knowledge is not enough for me to explain why this technique would be inapplicable on the internal states of a decomposed hash even if it's inapplicable on the final output. I shall have to read more.
I agree that if one can find a way to apply an HMAC it should also obviate concerns around length extension, and also that combining hashes (carefully) should be able to break only when both are broken.
They actually refer to the large internal state size that makes the generic attack infeasible (for a state size of n bit you need 2^(n/2) many tries to find a collision on average).
> in the second, the attacker needs collisions in the full internal state of the hash function, rather than on the truncated states.
But as both sha256 and sha3-256 have internal state sizes >= 256 bit these are definitely enough for the foreseeable future to protect against generic attack. More interesting is the question whether you can combine specialized cryptanalysis two different hashes to build multicollisions. Apparently you can, at least for MD5 and SHA1: http://www.iacr.org/archive/asiacrypt2009/59120136/59120136....
Edit: of course, the question of how to determine which keys to trust is still pretty difficult, especially in the larger Java world. The community of Clojure authors is still small enough that a web of trust could still be established face-to-face at conferences that could cover a majority of authors.
The situation around Central is quite regrettable though.
No one uses Clojars on its own, so if an attacker were able to perform a MITM attack, they could inject a spoofed library into the connection to Central even if the library should be fetched from Clojars.
Brian are you speaking as a representative of Sonatype, or are you a 3rd party?
The reality of cross build injection has been discussed for many years, I even linked to an XBI talk in my blog post announcing the availability of SSL.
The reality is that prior to moving to a CDN, it was going to be pretty intensive to offer SSL on the scale of traffic we were seeing. The priority at that time was ensuring higher availability and providing multiple data centers with worldwide loadbalancing.
On our first CDN provider, they could not perform SSL certificate validation and thus were themselves susceptible to a MITM attack. So the decision at that point was to run SSL off of the origin server. We wanted to make it essentially free but wanted to ensure that the bandwidth was available for those that cared to use it, hence the small donation.
The situation is different today with our new CDN, they can validate the certificates all the way through and that's how we intend to deploy it.
We won't be able to enable full https redirections for all traffic since this would cause havok in organizations that are firewall locked and for tools that don't follow redirects. Each tool would need to adopt the new url. I've already suggested this change occur in Maven once we launch.
We strongly do not believe that you should entrust your private key to anyone else for signing, which is what others have done to make it easy....yet less secure.
This is exactly why I built mdm: it's a dependency manager that's immune to cat memes getting in ur http.
Anyone using a system like git submodules to track source dependencies is immune to this entire category of attack. mdm does the same thing, plus works for binary payloads.
Build injection attacks have been known for a while now. There's actually a great publication by Fortify where they even gave it a name: XBI, for Cross Build Injection attack. Among the high-profile targets even several years ago (the report is from 2007): Sendmail, IRSSI, and OpenSSH! It's great to see more attention to these issues, and practical implementations to double-underline both the seriousness of the threat and the ease of carrying out the attack.
Related note: signatures are good too, but still actually less useful than embedding the hash of the desired content. Signing keys can be captured; revocations require infrastructure and online verification to be useful. Embedding hashes in your version control can give all the integrity guarantees needed, without any of the fuss -- you should just verify the signature at the time you first commit a link to a dependency.
Why can't we embed the hash of the dependencies we need
in our projects directly?
With code signing, you can (or hypothetically could, I don't know if anyone does this) check the latest version is signed by the same key as the previous version - whereas just pinning the hash wouldn't allow that.
I agree pinning the hash is useful if the signing key is captured.
If large amounts of your code change across different projects at the same time, those projects don't have a very stable API nor are the evidentally actually developing separately, so there's no major reason to pretend they are isolated or introduce a binary release protocol between them. Projects like this will probably find the greatest ease of operation by just sticking to one source repository -- otherwise, commits for a single "feature" are already getting smeared across multiple repositories, and stitching them back together with some hazy concept of "latest" at a particular datetime isn't helping anyone.
The biggest indicator of over-coupled projects that are going to face friction is when ProjectCore depends on ProjectWhirrlyjig, but the Whirrlyjig test code still lives in ProjectCore. This tends to make it very difficult to make releases of ProjectWhirrlyjig with confidence, since they won't be tested until getting to ProjectCore. If projects are actually maintaining stable features on their own in isolation, this shouldn't be what your flow looks like.
Projects that are well isolated generally don't seem to have a hard time committing to stable (if frequent) release cycles. Furthermore, it actually encourages good organizational habits, because it actively exerts pressure against making changes that would cross-cut libraries or make it difficult to create tests isolated to a single project.
In contrast, tools that regularly update to "the latest" version invariably seem to bring headaches down the road.
Getting "the latest" is ambiguous. It means that your build will not be reproducible in any automated way, whether it's one week or one hour from now. It's a moving target. Can you do a `git bisect` if something goes wrong and you need to track down a change?
Getting "the latest" also doesn't take into account branching. This is something a team I'm currently on is poignantly aware of: feature branches are used extensively, and when this concept spans projects, we found "latest" ceases to mean anything contextually useful.
If you're working on projects where a CI server is actually part of the core feedback loop (say your test suite has gotten too unwieldy for any single developer to run before pushing), then a fetch "latest" can be helpful to enable this during development. But even if jenkins informs you the build is green, it's important to remember this won't be reproducible in the future; you should make an effort to get back to precisely tracked dependencies as soon as possible.
mdm deals with this by letting you use untracked versions of your dependency files... but it will consistently show that in `git status` commands, so that you A) know that your current work isn't reproducible by anyone else and B) everyone is encouraged to make an effort to get back on track ASAP.
The tiff mentioned in the article was interesting to read.
I also feel like in the case of something like a package manager, this potentially harms the wider community in ways that charging for features in a specific piece of software doesn't.
Freemium models often suck because of stuff like this. But if the "users" would just consider it normal to pay money then we wouldn't have crazy things going on where people providing critical infrastructure services need to figure out how to "convert" their "users." Instead, say, every professional Java shop would pay $100 a year or so for managed access. Projects that want to use it like a CDN so their users could download would pay a fee to host it.
They have bills to pay. They'll cover them one way or the other. If we pay directly at least we know what the game is.
 They could be inserting advertising into the jars. Hey, at least it would still be a "free" service, right?
I would've much rather had a Makefile. Build scripts and package managers need to be separate.
This. Especially when there's broken links, you're gonna have a bad (and long) time.
Curious if anyone knows of any well done takes on this. In either way. (If I'm actually wrong, I'd like to know.) (I fully suspect there really is no "right" answer.)
So,jcenter is a Java repository in Bintray (https://bintray.com/bintray/jcenter), which is the largest repo in the world for Java and Android OSS libraries, packages and components. All the content is served over a CDN, with a secure https connection.
JCenter is the default repository in Goovy Grape (http://groovy.codehaus.org/Grape), built-in in Gradle (the jcenter() repository) and very easy to configure in every other build tool (maybe except Maven) and will become even easer very soon.
Bintray has a different approach to package identification than the legacy Maven Central. We don't rely on self-issued key-pairs (which can be generated to represent anyone, actually and never verified in Maven Central). Instead, similar to GitHub, Bintray gives a strong personal identity to any contributed library.
If you really need to get your package to Maven Central (for supporting legacy tools) you can do it from Bintray as well, in a click of a button or even automatically.
Hope that helps!
I am also not sure how you figured out those are fake downloads. For sure the script that DDOSes Bintray from China won't use Groovy, but it's a still a valid download. Not for showcasing how popular Groovy is (they factor out those things when talking about the numbers), but for the raw statistics - for sure. The file was downloaded, wasn't it?
edit: and with websites everywhere routinely providing SSL, it seem crazy that it has to be a paid feature for such a critical service.
$ curl http://get.example.io | sh
Now I am wondering what tool actually uses those .asc files that I have to generate using mvn gpg:sign-and-deploy-file when I upload new packages to sonatype...
I wrote an article about mitigating this attack vector a while back which might be useful: http://gary-rowe.com/agilestack/2013/07/03/preventing-depend...
edit: Looked up answer myself. Lein downloads whatever key the signature claims to be made with from public keyservers. How does this provide any additional security over not bothering to verify signatures?
But clearly the job isn't finished; even if Clojure developers do a good job of signing packages and signing each others keys, (which is not generally true today) it still needs to distinguish between signed packages and trusted packages. Hopefully the next version can add this. But as with anything that requires extra steps from the developer community, a thorough solution is going to take time.
Did you know that Xine, the media-player, has a similar thing behind the scenes? I didn't
How hard would it be to just mirror it to S3 and use it from there via HTTPS?
Now ask: how hard would it be to pay the bandwidth charges, assuming it were a public bucket. I don't know the answer, but it's a much more interesting question.
It occurs to me that BitTorrent technically solved the problem of high bandwidth costs long ago; millions of people transfer 1.1TB files around every day without worrying about bandwidth costs at all.
Can we come up with a similar system for jars? Why are we still relying on central servers for this at all?
The point is, what do you care that your repo is local, or that your jars are secured, if the tool you got maven itself in binary form, from a server you don't control?
That is the whole point of Linux distros package managers. It is not only about dependencies. Is about securing the whole chain and ensure repeatability.
Maven design, unlike ant, forces you to bootstrap it from binaries. Even worse, maven itself can't handle building a project _AND_ its dependencies from source. Why will the rest of the infrastructure be important then?
Yes, Linux distros build gcc and ant using a binary gcc and a binary ant. But it is always the previous build, so at some point in the chain it ends with sources and not with binaries.
And this is not about Maven's idea and concept. If it had depended on a few libraries and a simple way of building itself instead of needing the binaries of half of the stuff it is supposed to build in the first place (hundreds), just to build itself.
I don't think so. The first versions of GCC were built with the C compilers from commercial UNIX from AT&T or the like. The first Linux systems were cross-built under Minix. At some point you'll go back to a program someone toggled in on the front panel, but we don't have the source for all the intermediate steps, nor any kind of chain of signatures stretching back that far.
> And this is not about Maven's idea and concept. If it had depended on a few libraries and a simple way of building itself instead of needing the binaries of half of the stuff it is supposed to build in the first place (hundreds), just to build itself.
Any nontrivial program should be written modularly, using existing (and, where necessary, new) libraries. Having a dependency manager to help keep track of those is a good thing. I don't see that it makes the bootstrap any more "binary"; gcc is built with a binary gcc for which source is available. Maven is built with a binary maven and a bunch of binary libraries, source for all of which is available.
For external libraries, a "distribution repository" is a file store for a bunch of different projects. It typically stores released binaries for distribution (libfoo.1.5.pkg, libfoo.1.6.pkg, libbar.2.3.pkg, etc...), but could also contain clones of external source repos (libfoo.git/, libbar.hg/, etc...).
Which brings us to the other meaning - a "source repository" is the version-controlled store for the source of a single project.
The repo for external libraries is a distro repo, where the repo for your project is a source repo.
If you're checking the code for multiple projects into a single SCM, why bother maintaining separate source repos at all? Why don't we all use one giant source repo for all projects everywhere? Just check out "totality.git" and use the /libfoo/ and /libbar/ subdirectories. And in your internal company branch, add an /ourproject toplevel for your own code?
When you have answered that question, you will realise why we keep separate projects in separate source repos/SCMs.
Note that you will probably want to publish different "releases" of your own project to an internal distro repo for your internal clients to use, e.g. ourproject.1.2.pkg, ourproject.1.3.pkg, etc...
Well, I think I've found your problem. :-/
I had a close call with nearly installing/building some Java packages a couple of weeks ago, and due to reasons I eventually decided to try and find a different solution. Looks like the bullet I dodged was bigger than I thought.
By not downloading everything from maven central in real time. Companies usually run their own repository and builds query that one. Central is queried only if the company run repository is missing some artifact or they want to update libraries. How much bureaucracy stands between you and company run repository upgrades depends on company and project needs.
As for production, does anyone compile stuff on production? I through everyone sends there compiled jars. You know what exact libs are contained in that jar, no information is missing.
There are other schools of thought, like pinning the remote repos to specific commit-id. These are better than nothing, but still depends on 3rd party repos which I think is to risky for production code. It is great for earlier stages of a project when you are trying to work out the libraries you will use and also need to collaborate.
There are also other benefits like automatically downloading and linking sources and documentation (more relevant if using an IDE).
And to be clear, just http here is not the issue. It's http combined with lack of package signing. apt runs over http, but it's a pretty secure system because of its efficient package signing. Package signing is even better than https alone since it prevents both MITM attacks and compromise of the apt repository.
In fact, apt and yum were pretty ahead of their time with package signing. It's a shame others haven't followed their path.
It's available under MIT licence: https://github.com/gary-rowe/BitcoinjEnforcerRules
- charge some token amount of money to projects (harms the ecosystem, probably not a good idea)
- charge some amount for projects to host old versions, or for users to access old versions (same idea as the first, just less so)
- charge for access to source jars
- paid javadoc hosting
- rate-limiting for free users (the "file locker" model; particularly effective at convincing people sharing an office IP into paying up)
It's a pip wrapper that expects you to provide hashes for your dependencies in requirements.txt.
There was a lightning talk at PyCon this year, it seems super easy to use (though admittedly I'm not using it regularly yet).
Just don't do this. There is no such thing as a free lunch (or wifi).
It should be no different from shipping broken code. You can't just say, "oh, well we offer a premium build that actually works, for users that want that." Everybody needs it.
Evernote made this mistake initially when SSL was originally a premium feature. They fixed it.
Granted, there are degrees of security but protection from MITM attacks is fundamental. (Especially for executable code!)
UPDATE: @weekstweets just deleted the tweet I was referencing where he described security as a premium feature "for users who desire it" or words to that effect.
If the answer is no, then the smart developer has no financial incentive to do so, and every reason to segment security out as a premium feature.
Maybe MITM vulnerability counts as broken code. But as always, markets win. I don't think the status quo will change until users consider security assurances worth hard dollars.
This reinforces my priors that there is very little "free-as-in-beer" software.
Think through that a little more and I think you'll find there is long-term ROI in the form of customer trust and goodwill. You'll buy the product because it works and won't hurt you, and basic security should be part of "won't hurt you".
Right! And that's actually a valid model. Ask HP about what it'll cost to upgrade the firmware on your enterprise server..
 I had a boss who insisted that TJ Maxx was going to collapse because of their security holes. Nope.