Yes, we probably would have gotten better usability scores if we developed our own PGP client with a better UX and tutorials. We selected Mailvelope because it was rated highly on the EFF's secure messaging scorecard [1] and we were exploring how the state-of-the-art in PGP software actually performed with end users.
Matthew Green and I had a bet for the last year, which just ended, over libotr's security; I bet him that nobody would find a sev:hi flaw in it all year, and, of course, won, because at this point all the low-hanging fruit in libotr has been shaken out.
That bet was a reaction to the release of the EFF scorecard, which at the time gave Cryptocat(†) a perfect score but dinged ChatSecure, which is a libotr client, for not having an audit done.
I told Matthew Green I'd write up something about the bet, and what did get reported to me about libotr; I'll probably spend a few thousand words critiquing the scorecard there. A brief outline, though:
* There are places where the scorecard is factually misleading. For instance: there seems to be no coherence to what "source code available for inspection" means; it still lists Telegram as being open source!
* It's oversimplified in misleading ways as well. Systems which technically have the capability of verifying peers are given big green checkmarks even when that feature is so broken as to be useless. And, of course, there's the "been audited recently" checkmark, which, as anyone familiar with software security auditing will tell you, means absolutely fuck-all (again: ponder the fact that libotr, which has been a high-profile target for something like a decade and is more or less frozen stable, was counted as "not audited", while projects that got a 1-week drive-by from a firm specializing in web security got a big green checkmark).
* What does "security design properly documented" even mean? Where's the methodology behind the chart? A few paragraphs of text aimed at laypeople isn't a documented methodology! The one place they eventually did add documentation --- "what's a good security audit" --- tries to explain a bunch of stuff that has almost nothing to do with the quality of software security inspection, and then throws up its hands and says "we didn't try to judge whether projects got good audits". Why? Why didn't they consult any named outside experts? They could have gotten the help if they needed it; instead, they developed this program in secret and launched it all at once.
* The project gives equal time to systems that nobody uses (at one point, Cryptocat was near the top of the list, and ChatSecure was actually hidden behind a link!), and is ranked alphabetically, so that TextSecure, perhaps the only trustworthy cryptosystem on this list (with the possible exception of the OTR clients) is buried at the bottom.
* If the point of this chart is to educate laypeople on which cryptosystem to use, how is anyone supposed to actually evaluate it? They don't really say. Is it ok to use Jitsi's ZRTP, despite missing the "recent audit" checkbox? What about Mailvelope, which is missing the forward-secrecy checkbox? Can anyone seriously believe it's a better idea to use Telegram or Cryptocat, both flawed ad-hoc designs, than TextSecure or ChatSecure?
I guess I can't be brief about this after all. Grrr. This scorecard drives me nuts.
I am not saying that these flaws in any way impacted your particular research project.
What would be a better way to produce such a scorecard?
Is there already any collection of common criteria established by computer scientists and accepted by experts and / or any kind of standardization of requirements for secure software that was produced by leading security capacities that allows to extract data for a compact visual comparison like the eff scoreboard?
Would you like to provide or show me a link or any material that compares "the security" of products and offers an understandable and "industry-accepted" categorization?
Isn't it a bit strange that a small organization of non computer scientists produce something that was painfully missing for at least 50 years? Isn't it clear that a first approach to such a thing must fail and that this can only be a prototype for a process that should be adopted and worked out by people who understand what they are doing?
Isn't it a bit strange, that there is no such thing as that scoreboard produced by an international group of universities and industry experts, with a transparent documentation of the review process and plenty of room for discussion of different paradigms?
The eff scoreboard demonstrates painfully the obvious omissions of multiple generations of security experts who failed to establish a clear definition of what security exactly means, how to discuss it and how to find an acceptable approach to establish a thing that would allow to be named "review" in the scientific meaning of the word.
It is totally clear that Apple and Microsoft have very different ideas about security than OpenBSD developers, but it would still be of great value to have a space where people could follow that discussions and compare the results of different approaches and solutions to security related problems.
The eff scoreboard carries the embryo idea of a global crypto discussion, review, comparison and knowledge site that could also serve as a great resource for non-crypto people and students to learn a lot about that field. The highly valued information you and other experts are dropping here and there in HN threads and/or on various mailing lists should be visible in a place that collects all that stuff and allows for open discussion of these things in the public, so people can learn to decide what security means for them.
There is not. Software security is a new field, cryptographic software security is an even newer field, and mainstream cryptographic messaging software is newer still.
The problem with this flawed list is that it in effect makes endorsements. It's better to have no criteria at all than a set that makes dangerously broken endorsements.
[1] https://www.eff.org/secure-messaging-scorecard