Hacker News new | past | comments | ask | show | jobs | submit login
Welcoming Semmle to GitHub (github.blog)
256 points by johns on Sept 18, 2019 | hide | past | favorite | 104 comments

The linked blog post [0] and the new security marketing page [1] both have a little more detail on what this actually means.

Basically, Semmle offers a static analysis tool that operates on your source code as a graph (from what I understand) and points out bugs and security holes in your code. Github is now offering that for free on repos at all tiers.

[0] https://github.blog/2019-09-18-securing-software-together/

[1] https://github.com/features/security

Semmle is basically datalog over source code.

For what it works for, it works nice. But it is not a pancaea.

Security vulnerability finding is almost certainly the wrong target for Semmle - I am unsure why they are trying to push that angle. There are much better stories in things like refactoring and understanding. (I say this having overseen a number of deployments for various reasons, some successful, some not)

I've seen coworkers run semmle queries across the entire Windows OS codebase and find hundreds of issues which were/could result in security vulnerabilities. They've also leveraged it for variant analysis. If I'm not mistaken, the security teams are the largest internal users of Semmle at Microsoft.

You're right though, it's not a panacea, and it could probably be great for other uses too.

Any idea what the false positive rate is?

If you use the out of the box rules for any of these tools the false positive rate will usually be pretty high. The trick is to write custom rules that are more tailored to your code.

So how much is involved in writing the rules, and at the end of it, what is the net false positive?

It took me a few months to get decent results with a low false positive rate. We haven’t had the tooling in place long enough to give hard stats but our aim is to have a false positive rate of less than 25%. Another great thing that these tools provide (if the results are valid) is that they let more junior members of the team/developers not as familiar with security issues to be able to understand the vulnerabilities found as they display a nice call flow graph/diagram that’s shows source to sink.

Nothing is a panacea. Things that help move the needle without requiring tons of time or effort are useful and valuable. I'm really glad to see more efforts in this area.

While it's true that there is no pancaea, Semmle will not move the needle on vulnerability finding. This I have extensive data on.

(I mean this in terms of capability, not sudden popularity)

It would move the needle on a bunch else. It is a good tool for sure (and im very happy for them), i just think they will disappoint people by pressing this particular narrative, and wouldn't do so with a different narrative

I'd be curious as to why you think that. Are you able to provide more detail on that claim? I have extensive experience with Semmle and my experience drastically differs from you.

With certain languages and a strong and diverse ruleset Semmle has it's strengths. In particular with native code (C, C++) and decent rules I have seen Semmle be very successful at finding certain classes of bugs.

The Datalog part is interesting! Do they have a bunch of rules to make graph queries work nicely, like Datomic pull syntax or maybe some pattern matching syntactic sugar? Is the underlying thing still an EAVT store? Is any of that information publicly available?

It looks like they have reasonable docs on their query language, in particular https://help.semmle.com/QL/learn-ql/about-ql.html#properties... has some info on the QL language.

https://help.semmle.com/lgtm-enterprise/user/help/generate-d... says "LGTM generates a database for each commit stored in a repository. Each database is a relational database that represents the structure of the codebase for a specific revision, or snapshot, of the code.", though a triple store could qualify as relational here. I couldn't find much more than that about the implementation details though.

Right. I found those docs but they didn’t look like datalog queries at all. Of course that doesn’t mean they don’t compile down to datalog :)

I would love to hear more about this data, as everyone I know who has used it for vulnerability finding has very good things to say. Semmle have also demonstrated its capabilities with some high profile examples.

Who have you talked to about it? Outside Mozilla and Microsoft?

I’d rather not name drop individuals or companies, partly because I don’t know that these entities want their business relationships public, but neither of the companies you named. I’ve also used it myself (you can too, at lgtm.com). I’m aware that MS is a customer, but I don’t think I’ve talked to anyone there about their experiences with it. As far as static analysis goes, which is inherently limited, it’s far better than anything else I’ve tried (which is most of them).

Fair! Thanks!

Could you share the details of your experience? My experience has been quite the opposite.

I have been using Semmle daily to automate much of the vulnerability discovery process and I am extremely satisfied.

We run it over millions of lines of Java code and have not yet run into scale or perf problems.

Developing custom queries and defining security invariants in a logic language is, quite honestly, a joy.

Semmle does not scale, both in terms of their index design and their overall system design. This makes it poorly suited for truly global program properties and much better suited for things like refactoring.

It's being run frequently across the entire Windows OS repo. I have heard there is more work to be done to make it scale better, but it can scale.

Am Microsoft.

Mountains were moved to make it scale, but that has been achieved. Semmle can scale with work - it just takes a lot of effort and code.

Yeah, I think that is at least one of the issues. For us, for what it provides, it is not worth the time/effort vs just building our own tools or other options.

What had to be done to make it scale?

I was not part of this effort but I did scale another static analysis tool for industrial size codebases and have a patent on it.

To simplify a bit you can think of most static analysis algorithms in terms of graph problems where nodes are statements and functions and edges are flow of control and calls. On large codebases the amount of edges, nodes, and calculated data is just too big to keep in memory. The trick is to break the graph intelligently into parts, calculate some sort of summary information for each of them, distributing between cpus or computers, move up to the supegraph of graphs and perform higher level calculations on it.

I was looking at it earlier and the query syntax seems awesome. I’ve spent the last few months writing custom rules (queries) for fortify sca- which is another static code analysis tool and I must say, Semmle seems like it’s a lot easier to use.

Static code analysis tooling is never the end all be all for vulnerability research, but it does let you express vulnerability patterns for implementation type vulnerabilities and find them at a mass scale (that is if your rules/queries are legit).

> Security vulnerability finding is almost certainly the wrong target for Semmle














































































































Uh, I'm not sure why you believe this is an effective retort, perhaps you would like to explain?

>This list provides details about security vulnerabilities discovered by the Semmle Security Research Team using Semmle QL.

Clearly, it works.

You want to see how long a list I can make for you for grep?

How would this list compare to other methods?

You’re wrong on that, security teams at the major tech companies love it, especially for variant analysis. Ask your coworkers at Google! One of which recently left to become Semmle’s Chief Security Officer.

I know Fermin, quite well, and I know the state of using it at Google. I technically am the one paying for the contract at this point!

I've also met repeatedly about it with all Google customers over the past few months, Both those currently using it, and those that stopped.

Prior to that, I had met with most that used it but stopped, at the point they started using it. There were several large scale attempts/efforts to use semmle in various ways, by various teams.

I'm trying to be as nice as possible here, since, as I said, I think it is a great tool for a lot of things, and I've been a strong supporter for this kind of technology for those cases (for years, in fact, as i'm sure some folks at Semmle can tell you) so I'd rather not burn it all down, which I expect is what would happen if i did a point by point explanation of everything bad about it.

So I will instead reiterate my claim, and take my downvotes ;)

Interesting - I’d like your unfiltered take, but I totally understand your position. I’ve heard very positive feedback from one security person at Google, and I don’t know Fermin, but taking a CSO role there would suggest to me that he believes in it.

If you are comfortable sharing more, I’d be curious what you found that it struggles with, without burning it to the ground. It’s possible that different security teams use it in different ways, and it might be more suitable depending on your expectations. It’s also possible that the feedback I heard was during the honeymoon period, and practical issues outweigh the utility once you use it more.

But I’ve seen real 0-days found with it, first hand and second hand, so I’m having trouble reconciling that with your account that it’s not useful for security.

"and I don’t know Fermin, but taking a CSO role there would suggest to me that he believes in it."

Sure, but Fermin was also offered a fairly ridiculous amount of money and a serious promotion :)

I mean, he doesn't not believe in it, of course, but i also think most folks would have taken the role in his situation.

IE it's not the kind of offer that really required a lot of faith

I'll try to write a bit more later after i think about how to frame it :)

Fermin here. Danny, I loved working with you at Google but let me correct you about my promotion and money. Not true.

I respect your point of view around our technology. You may like it or not (some folks love it), but please do not make statements about me you do not really know :)

And to be clear, I believe this technology makes security researchers scale on different aspects. At least I had first hand experience with this and our goal is to make security easy for non security folks... this technology enables us to do this.

Happy to sync in private over a coffee!


All good Danny, we had good times at Google... let's remember those :)

Coffee offer is still there!

Danny, I am the CEO and founder of Semmle. I will refrain from arguing about the value of our product and technology. However, I must correct your statement about Fermín which is utterly false. He took a huge pay cut to come to Semmle. Please stick to facts when talking about people.

So I did read a whitepaper about static analysis at Google, and how it was largely self-serve - let developers run the tools and fix what it tells them to as they see fit. I’m wondering if it was under this model where you found it was not useful. I would not expect it to provide much value in that scenario, and would not be surprised by your feedback.

If your data is closer to a model where security bug hunters whose sole job is to find vulnerabilities and audit code, and it was deemed not useful in that scenario, then yes, I am at odds with your claim. Admittedly, that’s a pretty niche set of customers. If you don’t learn Semmle QL, and you aren’t writing queries, it’s probably not for you.

Here is our experience building and using program analysis as part of our product security efforts at facebook: https://engineering.fb.com/security/zoncolan/.

Its run in both self-service (output to developers), guided (output to product security oncall of security engineers) and used ad-hoc to power up manual security reviews. Depending on the accuracy of each rule and the impact of the pattern of security flaw the rule finds it is promoted to ultimately output to developers directly.

It finds about a third of the security vulns we unearth each year.

That’s been my approach as well. An astonishingly large number of companies think they can buy an off the shelf static analysis tool and pipe the default output to developers. That’s counterproductive. A very small percentage of developers will understand the output, be able to assess the exploitability/severity, and care about fixing it. One might think you could then just have them take the “better safe than sorry” approach and fix everything, but FP rates for all of the commercial tools make that completely untenable. At the same time, you can’t expect to convince small teams of developers to model everything out and define sources/sinks using some obscure DSL that they have to learn. But, there are classes of issues that are extremely high impact, but only low accuracy static analysis rules can find the candidates. It’s that part in the middle that you don’t want to throw out, but you need security experts to vet. Other cases with high confidence checks are appropriate to short circuit straight to the devs, but it’s a bad first step.


I think there's some great feedback here, for anyone at Semmle thinking about how to develop the tool further.

Annoyingly, now the GP post is now deleted, the context to my comment looks different, and I can't delete my comment.

Since the GP has been deleted I'll respect that and not reference specifics, but I want to clarify for any passers by, much of the GP comment I replied to was of a detailed and technical nature, about things like performance enhancements, features and semantic analysis approaches that could make the tools useful in more use-cases - very different from the rather general and personal criticisms I see elsewhere in the nearby comment tree.

It's the suggested technical and product enhancements that I felt was potentially useful feedback, rather than any of the criticism (I can understand why those are deleted).

First of all, Daniel Berlin is pretty senior at a reasonably large tech company a lot of us here have heard of.

Secondly, I know Microsoft loves it, which is presumably where your telemetry comes from, and I know a lot of security people on Twitter are fans of the technology, but I've been asking around and "love it" is not the signal I'm getting from software security blue team people. "I installed it, I guess it does some stuff, we never think about it" is the modal feedback I've seen.

I'm very interested in hearing success stories about this; the problem Semmle addresses is a huge part of the cost basis for my practice, and I'd love to hear that someone has gotten it working well.

It’s absolutely useless in the wrong hands, so I don’t think you’d necessarily get a good signal by asking your average blue teamer. It’s a godsend for someone who spends a lot of time auditing code and has some experience writing code analysis tools.

I mean think about it, if you wanted to write a query against the AST of a target, would you find that useful? Or in a given codebase, if you find one bug, would you like the ability to capture that in a query that can tell you if a similar mistake was made elsewhere?

Out of the box, it isn’t going to give you much value. It’s the power of the query language, if it’s your job to do that, where you’ll see the benefits.

But don’t take my word for it, just try it out.

Their licensing model may be problematic for your use case though. I only vaguely understand what you do, but last I asked them about it, it’s not possible to get a personal license that a security person can use for multiple projects, and my read was that they had no interest in selling to individuals anytime soon.

I mean, queries against an AST is sort of standard security tooling; the difference appears to be that Semmle (1) properly assigns types in C/C++ and (2) exports that query language. (1) makes sense to me; (2) I don't know how much better I'd get than just hand-writing tree walkers.

Anecdata, I have also successfully used it to unearth complicated bugs that I already knew must exist, but didn’t have a good way to find.

Think of a query like: Find all calls to function A that have an output pointer-pointer of type B in the last argument position and also have a boolean return type, verify the input dereferenced B is NULL, then verify that iff A assigns output to B that the return type is false and that the calling frame also includes a later call to function C to clean up B. This type of thing run on millions of lines of code.

This can get as crazy as you want it to be and did work, but there is a “but.” The query language is so verbose and powerful, there were many, many ways to represent the same query that all had drastically different performance profiles. The docs were woefully underwhelming in that regard — they simply stated what a member did, but not anything at all related to memory or performance implications. That, coupled with the fact that nearly every complex query during dev hit the wall of the JVM killing it for taking too much time/mem, it became apparent that they must have tooling for the tooling to analyze performance and memory profiles of queries (not unlike all the effort put into SQL over the years). Also, no debugger; resorted to “datalog printf debugging” (ie, including internal clauses as outputs and chopping off later parts of the query...)

I spent a few weeks only writing queries and in every non-trivial one I needed a senior person there to review/rearrange a few clauses to go from e.g. 30 minute runtime to 15 seconds. That left me with a feeling that I was constantly fighting the docs and lack of tooling and would constantly need their help to tune things. Nothing was fundamentally wrong with the queries — it was just not documented anywhere that some filtering should be done caller->down vs. other kinds of checks in the same query should be callee->up.

I had questions regarding scaling up to 1B+ LoC like others in the thread but didn’t really get that far.

It did successfully find the C/C++ bugs I was looking for once the queries had hours of investment from both sides, and we were also able to find bugs in a large custom JS codebase by mocking all the things it would need to understand to eval the code. Whether or not that investment makes sense is an individual/team/org question, but if they ever seriously invest in a query debugger and profiling, they’ll be pretty hard to beat.

This is exactly the kind of thing I'm interested in knowing about Semmle. Thank you!

I’d presume the tool would be used by a whole team, not just Thomas alone, and hence be licensed to a company… Are you saying that they don’t sell the tool to infosec businesses working with multiple end customers?

That’s correct. They sell to whoever owns the code being looked at, and charge accordingly, based on how many developers the codebase has. The incremental cost for people on security teams to use it is actually $0, no matter how many of them are working with it. If you have 500 developers checking in code, that’s what they charge you for, and read/query access to the results is “free”.

The funny thing is I was working on a pitch to get Atlassian to buy them so they don't end up in Microsoft's hands. I thought integration with a repo company would be good since they could cross-sell it for code understanding and maintenance. Then I see this article. (sighs) At least I got it right on the type of company that would grab them.

I'd push something else for security, though, to complement it. RV-Match is my favorite commercial one because they built on a formal semantics for C, it's set for low false positives, and they open source a lot of stuff. They have something for Java and smart contracts, too. Past that, what's good depends on what language you use.

Just a small fraction of the industry's ongoing software development is done in C. Obvious, Google, Microsoft, Apple, and Mozilla still write a lot of it, but you don't acquire a whole company to address 4-and-change customers.

They'd have them retarget to what's popular post-acquisition in my concept. Especially among their paying customers. Keep adding languages or just useful things for it to look for.

Security problems aren't interchangeable between languages, and C has a very particular set of concerns that don't translate.

Why would Atlassian be better?

They were kind of a default since there's only a few huge ones. Main win: it's not Microsoft. MS already has lots of internal tools from MS Research they were wasting. They probably couldve built Semmle themselves. They're also a known patent troll. I don't know if Semmle's methods are patented, though.

I'd rather a company like them not acquire them in favor of one that is constantly developing new services with no attempts to financially drain 3rd parties. Not to say good won't come out of Github integration given huge number of projects in it.

Am I reading that right that it's only on public repositories though? For private repositories I guess you have to buy through Semmle directly (via call us pricing)?

This would make sense since Semmle would likely need access.

I'd be ready to put money on the fact that GitHub has access to all repositories, even private ones!

Of course they do, but they also have safeguards in place that prevent access without alerting auditors and eventually the repo owner.

So I'm guessing they'll be merging what they have now with Semmie's tool? Because they've had the free vulnerability check for a while now.

Different things. Github has features that scan repos for "known-vulnerable" dependencies. They do not have features that scan for new vulnerabilities.

Yes and no. LGTM is about known vulnerabilities. It doesn't (currently) use artificial intelligence to discover new vulnerabilities, but it allows writing complicated yet efficient patterns for vulnerabilities found by human intelligence.

So it's more advanced than simple "know bad dependencies", but it's also not quite "new vulnerabilities".

Thank you for the zero-indexed reference list. Brought a smile.

I hate that these kinds of Orwellian phrases "Welcoming X to the Y Family" have now become idiomatic of corporate English. Ugh, no. There is no "family" involved here, not by any stretch of the word.

To be fair, if a “parent corporation” is a thing, then logically it has children and can be a corporate family.

Intent matters. The phrase "parent corporation" has no PR or emotional intent. "Welcoming X to Y family" has a clear emotive intent.

Maybe the intent of the writer was to make the new hires feel welcome aboard to their new company.

That's also not mutually exclusive of the emotive intent you are describing. What makes that Orwellian though?

The parent comment seems to be criticizing the higher-level corporate trend to use this lingo, and isn't talking about Friedman or Github specifically.

Github has a cartoon cat-octopus all over the site I would expect as a baby toy. Clearly linked to "emotional intent" "Github is so fun, guys!" I think they've outgrown that style (the black and white one in the header is ok)

Welp, guess it's time to bust out the "I'm offended and we need to change this lingo" card because corporations aren't people and we should stop referring to them that way

I mean, in all seriousness, aren't they a little bit like people? https://www.npr.org/2014/07/28/335288388/when-did-companies-...

It actually took me a while to understand that the announcement likely meant that they had acquired Semmle...

They have to say that anyway, because Microsoft is acquiring Semmle, not GitHub. It is joining the GitHub "product family".

If they had said "Welcoming Semmle to the Github Family of Products" instead, that would've been much more tolerable.

When I read the headline I assumed it meant that GitHub was hiring an employee named Semmle, which confused me until I realized Semmle was a business.

Free hint for the GitLab - they can integrate a similar but open source tool - Infer[1]. Essentially it provides the similar features, just lacks a good interface to do so. They also have a query language, called AL[2]. It is way less polished than Semmle, but opensource and with a good potential.

[1] https://github.com/facebook/infer

[2] https://fbinfer.com/docs/linters.html

Interesting to see the differences between Github and Gitlab's strategy in this arena.

Github appears to be going the aqui-hire route with Semmle, dependabot, pullpanda etc, where as I don't think Gitlab's made an acquisition for a year or two.

GitLab published what they're interested in: https://about.gitlab.com/handbook/acquisitions/. It's an amazing, one-of-a-kind doc. One of their constraints (https://about.gitlab.com/handbook/acquisitions/#what-we-offe...) is quite limiting, though:

> The total purchase price of the deal, paid in cash, will not exceed $1M and will be the total and only compensation for the entire deal.

They are looking at companies that: "Raised under $10M total investment funds, last round being over 3 years ago"

This implies that in addition to self-funded ventures, they are looking for fire sales from failed start-ups.

It looks like they're buying (big) features, not complete solutions or companies. That's actually an interesting approach; I'm sure others do that, but maybe not as explicitly.

It would allow a small team of hackers to have a decent exit without having to go through the whole startup road.

Gitlab hasn't generally seemed interested in these sorts of free scanning tools. I wonder if that's because their users are much more weighted towards private/self-hosted than Github's are? Because so little open source happens on Gitlab, they can't buy good PR through this kind of strategy like Github can.

I've been looking quite a bit into this recently, and even though they might not be screaming it from the rooftops, Gitlab offers quite a few security-related features. There are code scanning, dependency tracking, etc. features at various levels of readiness.

https://about.gitlab.com/devops-tools/ https://about.gitlab.com/stages-devops-lifecycle/secure/

They’ve had SAST tools for a few releases, but high up in the paid license types. With GitHub providing for free, they may need to move them into CE.

Their scanning tools are "source available", but they're definitely not open-source. The license is gonna be a non-starter, but how they built their SAST tool [0] is actually quite interesting.

It just uses existing open-source analysis tools, but orchestrates them all into a single tool by coordinating a bunch of docker images.

[0] https://gitlab.com/gitlab-org/security-products/sast

Microsoft has $130bn cash-on-hand.

The surprise is really that they're not being more aggressive in their acquisitions.

They should be like Yahoo was and buy everything they see? (for billions, only to sell it at a massive loss later)

I've not heard from Yahoo in a year at least, do they still exist ...

I just got an email from Yahoo about a settlement in a class action lawsuit over a massive data breach. It said something about Yahoo paying for 2 years of credit monitoring service to anyone affected by the breach.

Maybe that's not exactly what you were looking to "hear from" Yahoo about, though...

Github has been really working on their source code analysis toolkit recently & this acquisition makes perfect sense as part of that strategy. Congratulations to Oege & the team.

First project I look up on lgtm.com is rust.. Second alert I find is this:


exist_ok is available from python 3.2, so this isn't a good impression.


Huge congrats to Oege and the team at Semmle - couldn't be happier for a hugely passionate and smart individual (and a previous professor of mine!)

Am sure this will bring some amazing advances to Github and thus a huge % of the developer community.

"Human progress depends on the open source community."

What a way to begin an article.

Especially considering M$ owns GitHub

I've just tested their lgtm.com on our codebase:

1) identified str.replace('[ABC]+', '') correctly as a bug (looks like a regex but is string literal)

2) identified various unnecessary code that TypeScript overlooked

3) identified double-unescaping of html (this one would have probably gone unnoticed for years)

And a bunch of other stuff. No actual vulnerability in our case, but still very useful. I'm enabling their checks on every future PR.

This was TypeScript but they support the rest of our stack too (Python, Java). I wonder if this includes Kotlin - will try.

Tested, Kotlin is not supported, nor is Swift.

> Human progress depends on the open source community.

(Non native speaker here). Am I misunderstanding something, or is the author explaining that humanity can not progress without the open source community?

That's right. That's called hyperbole (with a 'e').

As the other comments point out, it's hyperbole. It's also an aspirational statement.

Aspirational, as in, they wish that github would be the key factor in human progress. Maybe I can say that even plainer: to the leadership at github, the progress of the human race depends on github.

The open source community depends on github.

That's what it's implying.

(The reader is left to identify that as wishful thinking.)

That's the meaning I took from it (USA native).

I've used semmle's tools at Google, they seemed pretty powerful.

I spent way too long thinking that Semmie was just a badass programmer

Would be cool if the tools would be made open source in order for everyone to get more security.

So this is the excuse they're using to build infrastructure to scan through everyone's code to find whatever they want.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact