
SourceDNA (YC S15) finds hidden security and quality flaws in apps - katm
http://venturebeat.com/2015/08/11/sourcedna-launches-searchlight-a-developer-tool-to-find-coding-problems-in-any-app/
======
tptacek
I'm not unbiased when it cames to Nate, who is one of my older friends,
because he's dragooned me into being an advisor for SourceDNA. I've promised
to donate all proceeds from his venture to charity, unless it returns enough
to buy me a private jet, in which case I'm going to buy a private jet and then
donate the rest to charity. I almost quit Matasano to join him; the day after
I flew out to work out a role, we got the acquisition offer, and I had to
stay.

Nate is way underselling himself. He's essentially not only acquired most of
the contents of most of the app stores, and not only decompiled them, but has
then built up a comparative analytics framework that can answer questions
based on code similarity (as a first order of available facts) and behavior
(as a sort of second-order thing).

I'm really curious to see what ideas other people would have for this kind of
data set. If you could answer virtually any question about the behavior of
any/every app in the app store, what would you do with that capability?

Also: people should ask him questions about how this stuff works. It's really
neat.

~~~
NateLawson
Thanks, high praise indeed. I am super curious too what anyone would want to
know if you could see inside any apps in the app stores?

~~~
munin
I would be interested in pointing a (distributed, imprecise) symbolic executor
at each app to gain a sense of the "state depth" of that app, and then
correlating that with bugs discovered and rate of code churn in the app, and
comparing the "state depth" across apps in a given 'vertical' and all apps
globally.

~~~
NateLawson
Interesting, thanks. We've discussed providing some kind of code quality
score. Also, we can definitely do something interesting by comparing how other
apps have performed that share code features with yours. The ideal would be a
predictor that makes recommended changes based on the insights we've learned
from other code we've seen.

------
NateLawson
Hi, I'm happy to discuss how we're finding hidden flaws in millions of apps
that even developers didn't know about. We've built a really cool binary code
search engine that has indexed the structure and behavior of apps. Our engine
allows us to quickly find apps that exhibit particular problems, such as
calling a broken API or using a version of a library that has a vulnerability.

I need to write more about how it works. We translate the app code into an
intermediate language (like LLVM bitcode) and index features derived from both
the structure (callgraph/control flow graph) and syntax (opcodes) of each
function. This allows you to search for snippets of code that match particular
patterns or discover the relationship between modules by assessing the
similarity of each. Since we use an IL, we can match code cross-platform.

I'd love to talk about it here if you have questions.

~~~
geographomics
Does it work against intentionally obfuscated binary code? Anti-piracy
measures and suchlike.

~~~
NateLawson
Yes, in two ways. First, it's often useful to target malware or other systems
by the common packer code they use. This doesn't require deobfuscating
anything.

More interestingly, the matching algorithms we developed are by design
resistant to many common obfuscation schemes. For example, Proguard renames
variables and functions, as well as discarding dead code. But other aspects of
the code that we index, including control flow structures and data references
survive. We've designed our matching engine to apply a combination of all
these factors to be resilient to changes in subsets of them.

Compilation is one of the simplest forms of "obfuscation", so started with
dealing with different levels of optimization discarding code or changing the
opcodes.

If a program is self-decrypting, then we have to apply standard unpacking
techniques to get back to a reasonable format before ingesting it. Nothing
magic there.

~~~
munin
What if the obfuscation removes the ability to construct a control flow graph,
but is not self-decrypting?

~~~
NateLawson
We depend on a huge number of program features so CFG alone isn't enough to
throw it off (data references, function layout, linker behavior, and much more
are included in our matching).

You're right that, at some point, you've destroyed enough of the features that
matching will fail. For that case, you'll always need targeted techniques,
such as for virtualizing obfuscators. Since there's a big performance loss
when you use them, almost no legitimate mobile apps are willing to pay the
battery and speed loss.

