
Ask HN: Most comprehensive CVE database? - chatmasta
The official https:&#x2F;&#x2F;cve.mitre.org&#x2F; database seems like it is lacking a lot of information. There is a lot of useful data around CVEs that is not in any centralized source (e.g. referenced changesets, PoC scripts, github projects, etc) that could be collated into a master database of CVEs that would be very useful.<p>Has there been any attempt at making the most comprehensive CVE database possible?<p>Specifically, data I would be interested in for each CVE, where available:<p>1) The source diff of the patch applied to fix the vulnerability, i.e. the &quot;before&#x2F;after&quot; of the critical section of code<p>2) The bindiff of the binary before and after patching the vulnerability<p>Is this available anywhere?<p>The reason I ask is because I think this could lend itself to a really cool machine learning system for identifying unknown vulnerabilities. The trick would be to use the code from previous vulnerabilities to determine what &quot;vulnerable&quot; code looks like, and then find code similar to it.<p>rough idea:<p>DATA COLLECTION PHASE:<p>1) Collect the before&#x2F;after code of all previous vulnerabilities<p>2) Use the before&#x2F;after code to identify the &quot;critical section&quot; that caused the vulnerability<p>3) Convert the &quot;critical section&quot; to its AST representation<p>TRAINING PHASE:<p>1) Determine the best ML algorithms to use for comparing AST representations<p>2) Using labeled inputs of &quot;vulnerable&quot; and &quot;safe&quot; AST representations, train the ML system to recognize a &quot;vulnerable&quot; AST<p>IDENTIFICATION OF NEW VULNERABILITIES PHASE:<p>1) Download open source code bases<p>2) Somehow prioritize which code to convert to AST<p>3) Convert code to AST and feed to ML system to determine likelihood of &quot;vulnerability&quot;<p>4) Apply some combination of static and manual analysis to verify the vulnerability<p>5) Use results as further feedback to train the ML system<p>Is anyone familiar with something like this existing?
======
chatmasta
I'm curious to hear people's thoughts on this. It's an idea I've been playing
with in my head for a while, but most of it falls way outside my expertise.

There are definitely a lot of challenges with it, mainly false positives (e.g.
maybe a double nested for-loop with a dozen conditionals looks like a
vulnerable AST, but it's in a non-critical portion of code). But I think the
central idea of training ML algorithms based on existing vulnerabilities would
lead to a very efficient way of finding NEW vulnerabilities. At the very
least, it could provide an efficiency boost to tools like fuzzers, by
directing them to the critical portions of code. Also, it does not necessarily
have to work on only open source code. It could also disassemble the
vulnerable binary and the patched binary, and compare their ASM instructions.
In fact this might even lead to higher signal than the AST method.

~~~
pboutros
This is really interesting - and extremely ambitious. Having played with the
CVE database, I'd caution you that it was surprisingly difficult (for me) to
parse. The content is mildly unstructured, so that would be one of the first
things I would look to figure out before building the retrieval system for the
affected code.

