
Semantic: Parsing, analyzing, and comparing source code across many languages - lelf
https://github.com/github/semantic
======
stcredzero
The late 20th century, early 00's version:

[http://www.program-
transformation.org/Transform/CodeCrawler](http://www.program-
transformation.org/Transform/CodeCrawler)

(And MOOSE)

------
anentropic
Looks very interesting - would benefit from showing some examples and/or use
cases

~~~
bentona
Agreed. Hoping/wondering if this could be used to generate dependency graphs
for Ruby projects?

~~~
sixwing
the combination of the Import and Call Graphs
([https://github.com/github/semantic#language-
support](https://github.com/github/semantic#language-support)) should allow
for the construction of a view of what symbols are imported, and where in a
given project they are called. resolving those edges across repos is also
possible, in many cases.

------
beliu
If folks are interested in this sort of thing, either because they want to use
it or because they want to hack on it, I'd humbly suggest checking out
Sourcegraph: [https://sourcegraph.com](https://sourcegraph.com),
[https://github.com/sourcegraph/sourcegraph](https://github.com/sourcegraph/sourcegraph).

We've spent years building semantic code intel and search that's available for
open source and software teams that aren't inside Google, which has an amazing
internal code explorer powered by Kythe (though, unsurprisingly, a lot of our
users and customers are ex-Googlers!)

Much thanks to GitHub for open-sourcing this. Our own open-source code
analysis stack uses tree-sitter (as this does) in addition to other language
analyzers via LSP, and this seems like a great contribution to the open-source
code analysis canon!

~~~
Eli_P
Can these tools reason about events, async, promises, threads, queues, all
these kinds of things? I think the plain graph representation will tell just a
minuscule of a program semantics. O(n) complexity analysis where "O" stands
for Order wouldn't work here too. You have to somehow map those asynchronous
pieces of code onto a separate sequential spaces and only then recognize
similarities or do whatever analysis needed.

I've been thinking, just a static analysis is not enough, gathering profiling
info and data types from the runtime would help a lot, and some IDEs have been
doing this already.

Could you shed some light on what sort of an internal representation those
analysis tools use, is it something like lambda-calculus abstraction?

~~~
beliu
The internal representation depends on the tool and application. Here are a
few examples:

* For deeper analysis, we often use the same internal data structures that the compiler itself uses. This includes data structures that capture AST and type information. In the case of some languages, the compiler doesn't capture enough information (e.g., dynamically typed languages with no static typing). In those cases, the data structures are often enhanced/similar-but-not-quite-the-same as what the interpreter/compiler uses.

* In "shallower" analysis, you can use a language-agnostic parser to generate a generic AST. There's various techniques for doing so in a language-independent fashion, depending on what tradeoffs you want to make with speed, accuracy, and generality. The tree-sitter library is one such example. You can also use tools like CTAGS + textual heuristics in the extremely shallow case.

* Then, there's the Kythe approach, which constructs a taxonomy of source code graph nodes and edges that aims to capture every possible relationship in every language. Each language requires an integration to convert from the source code in that language to the Kythe schema; that integration can use either the "shallow" or "deep" approaches described above, and the right answer depends again on your tradeoff of speed vs. accuracy.

As for combining dynamic information like profiling and runtime info, this is
something we've been thinking about for awhile. We think there's plenty of
low-hanging fruit that combines static information (types, AST, source code)
with runtime info (stack traces, profiling info). We've a couple of efforts in
motion toward this end. One is the Sourcegraph extension API
([https://sourcegraph.com/extensions](https://sourcegraph.com/extensions)),
which allows injection of third-party info, often from runtime tools, into
wherever you view code, whether it's in your editor, on GitHub, or on
Sourcegraph. Another is allowing a user to plugin stacktrace data to
heuristically select which jump-to-def/find-references options are most likely
to be useful given what your currently debugging.

I'd love to hear what you're thinking of—care to share your ideas in this
space?

~~~
Eli_P
I thought of making a tool for language-agnostic verification and/or
refactoring advisory. That would be a modular system with a design looking
something like:

AST -[basis]-> Normalized AST -[to sequence]-> Common Representation [graph;
embeddings] -> [Code Structure DB; Temporal DB(reflects what's a code actually
doing, runtime info involved)] <-[query for anomalies]

The idea is simple, we teach some database, feeding with lots of source codes.
We may want also to define some rules, like, what's a deadlock, what's an
event, etc. I've no clear idea how to implement the rules, but it should be
possible to make an embeddings for the temporal data and map it onto a space
where checking for correctness can be done in linear time, fingers crossed.
But here comes the question, what's a temporal db should be? Should it be a
separate data definition language, or code is the data?

As for basis, we're supposed either to use some generic calculus and its
subset for the particular programming language, or somehow make it possible to
aggregate into the same database with different bases.

Obviously this approaches ask for machine learning instead of traversing the
graphs. I think there's good reason for that. For example, if I've written
many chunks of code which actually do the same thing, it'd have taken enormous
time to prove that using combinatoric approach (graph self-similarities, NP).

------
cryptonector
This could be the basis for better indexing and search.

If you're working with a monorepo with hundreds of millions of lines of Java
and such, and you're looking for "all the places where getName() gets called
for <some interface>", today you might as well give up.

~~~
saagarjha
Don't a lot of IDEs offer this functionality?

~~~
cryptonector
Yes... and no. See the thing about hundreds of millions of loc. You're not
going to load the entire monorepo into the IDE -- it's not practical. But
having an OpenGrok or whatever index of the monorepo is reasonable.

------
kqr
Something related I've been thinking about lately is combining something like
this with plagiarism detection tools to weed out code duplication. I have no
idea yet quite how, but it would be interesting to explore.

~~~
Darmani
That's called MOSS ("measure of source similarity"), and it's been used by
universities for over a decade.

~~~
kqr
My goal is slightly different, though -- I'm more interested in semantic
similarity than source similarity. I saw one article suggesting to run the
detection algorithm on the decompiled output of the compiler to get better
results, which seems like a good idea as well.

~~~
Darmani
Oh, I misunderstood. That's called "semantic clone detection," and it's been a
breeding ground for second-tier papers for a long time.

------
freddref
What alternatives exist and how do they compare? ( For example, jsinspect
finds structurally similar code )

~~~
enriched
I think that kythe([https://kythe.io/](https://kythe.io/)) is somewhat
similar. But I think that it has more of an emphasis on the graph schema.

------
axiom92
Read the title, thought it's about natural languages and got VERY excited ;(

~~~
wry_discontent
Would it be feasible to build something like that?

~~~
lozenge
Trivia: UN and EU proceedings are the perfect translation dataset as they are
already translated paragraph-by-paragraph into all their official languages.
However, they are all highly formal and have a legalistic vocabulary that
won't be useful for most translation users.

