When we first created the release issue, the thinking was that we wanted to launch support for those two languages at the same time, and for that support to be compatible with each other. (So we could correctly follow e.g. a TS library referencing a definition in an upstream JS library.)
Unfortunately we never got JS support to the point where we could GA it. Largely for scaling reasons — there is a HUGE amount of JS on GitHub, and JS is a dynamic enough language that we have to do more per-file processing on it than for TS or Python. We decided not to wait for JS to be GA-ready before releasing TS, but then never corrected the release issue to account for that.
This is exactly right. You either end up with something very low-level (on the level of LLVM IR, for instance)—which means you aren't constructing and analyzing high-level language constructs anymore—or with something high-level but with many language-specific special cases grafted on.
Where we've found success is in stepping back and creating formalisms that are language-agnostic to begin with, and then using tree-sitter to manage the language-specific translations into that formalism. A good example is how we use stack graphs for precise code navigation [1], which uses a graph structure to encode a language's name binding semantics, and which we build up for several languages from tree-sitter parse trees [2].
This is very cool, well done! I love seeing people get interested in other programming paradigms, like concatenative languages. I gave a talk about them at Strange Loop / PWL last year: https://dcreager.net/talks/concatenative-languages/
For those who haven't seen it, CCAN [1] is a great collection of reusable C code, much of which specifically exists to work around the kinds of issues mentioned in OP. It turns “oh crap I should probably roll my own” into “wait I bet someone has already re-rolled this”.
Please note that the above link has been claimed by squatters and isn’t the right link for CCAN anymore! The maintainer suggests [1] just using the GitHub repo [2] instead.
There are many libraries available that you can use as a libc replacement instead of CCAN, if that’s what you prefer [1-3]. Taking on a beefy dependency like that can be overkill, though, if all you need is a linked list or dynamic array implementation.
For sure, there's no legal expectation of support. If you don't support your open-source project, I can't sue you.
But there is a social expectation of support. If the issues page is a ghost town, proposed patches just sit, and there hasn't been a release in years, we call the project "unmaintained" and discourage its use. At that point, if there's interest, somebody can fork it and support the fork, but forking is a serious act: it strongly signals that you don't like the direction that the current maintainer is heading. If that project is truly unmaintained, that's fine, but all-too-many open-source projects end up with rival forks when the original maintainer comes back.
So yeah. There's zero legal expectation of support. But the social expectations are real, and providing a snippet (at least in my mind) has less of an expectation than a library.
This sounds like Cross-Repo Code Navigation [1], which we do support, though only for Python at the moment. And you're right that it does not use the same search index under the covers — OP is specifically describing the index for "search box"-style code search across a large set of repositories (org-scoped, global-scoped, me-scoped, etc). For Code Navigation, we have a _different_ set of interesting and bespoke indexers and storage formats. I've posted some links here if you want to learn more: [2]
It sounds like you are navigating in a language where we currently only support "fuzzy" or "search-based" Code Navigation, which only uses the _unqualified_ symbol name as the search target. As you point out, this can be very imprecise when there are particular symbol names that are used a lot. (Think `render` in a Rails codebase, for instance.)
We also have Precise Code Navigation — still currently only for Python [1], but we're working on other languages as fast as we can. As mentioned down-thread, that is built on our new stack graphs framework [2]. We're often asked why we're not leaning on something LSP-shaped for this. I go into a built of detail about that in a FOSDEM talk from a couple years back [3], and also in the Related Work section of a paper that we just published [4].
> But this could be tied into CI, especially for projects utilizing runners for building the code.
One often overlooked drawback of generating this data during CI is that you, the project owner, are now paying for the compute. Of course, maybe you qualify for a free tier. And if not, if you're already running a CI job for testing/linting/etc, the marginal cost of also generating code nav symbols might not be too bad. But one benefit of the stack graphs approach mentioned up-thread (and the code search indexing described in OP) is that all of the analysis and extraction work is done in dedicated server-side jobs that GitHub is footing the bill for.
By necessity I already do it many times a day on my workstation. It would be neat for the cloud to start utilizing the endless compute available on client machines. A back-channel when doing a git push to publish some related artifacts?
> having "face to face" chitchat, builds up a small buffer of rapport that significantly improves empathy and perceived intent
On my team we have a weekly 30-minute meeting for non-work-related updates. A chance to talk about whatever you did in the past week outside of work, or about anything else that currently interests you. It typically prompts a lot of follow-on conversations in Slack afterwards.
It might be nicer if this happened organically, and didn't require an explicitly scheduled meeting. But we consider it important enough that we carve out the time to make it clear that it's a team norm.
Parsing is definitely a big part of it, and it's a fair point that for search-based Code Navigation, we don't have to do any real heavy lifting on the analysis side. That said, I think the article describes our non-functional requirements well (zero-config, incremental, language-agnostic). It's those non-functional requirements which are most important for the “at scale” part. I'd go so far as to suggest that any static analysis implementation that can't meet those requirements would be nigh impossible for us to roll out across the entire GitHub corpus.
Note that this article describes our implementation of “search-based” or “ctags-like” Code Navigation, which definitely has the imprecision that you describe. We've also been working over the previous ~year on a framework called Stack Graphs [1,2,3], which lets us tackle “precise” Code Navigation while still having the zero-config and incremental aspects that are described in the paper.
The build-based approach that you describe is also used by the Language Server Protocol (LSP) ecosystem. You've summarized the tradeoffs quite well! I've described a bit more about why we decided against a build-based/LSP approach here [4]. One of the biggest deciding factors is that at our scale, incremental processing is an absolute necessity, not a nice-to-have.
I read about stack graphs before, it sounds interesting!
I think they help, but ultimately I expect you need a compiler solve the absolute madness of the totality of C++. For example I think getting argument-dependent lookup right in the presence of 'auto' requires type information? And there are other categories of things (like header search paths) where I think you are forced to involve the build system too.
Yup, it is probably fair to say that C++ accounts for like 50% of the complexity of Kythe at Google. Or certainly it feels like it.
And it is also worth noting that Kythe goes a bit deeper than what LSP can accomplish. In particular Kythe is built around a sort of two-layer graph, where it separates the physical code/line representation from a more abstract semantic graph. This allows us to accomplish some things that are very difficult to do in LSP.
Finally, Kythe at least internally has a big reliance on a unified build system (Blaze, or Bazel). It becomes rapidly more difficult to do when you have to hook in N different build systems up front, which is why search-based references are so appealing. Build integration is hard.
Has Tree Sitter been useful to projects like this? Does it have promise to be useful in the future? It seems to be gaining a lot of adoption among Neovim users and plugin developers, but not really anywhere else. I'm curious if that's because of lack of familiarity, or because it's technically deficient somehow.
In short, yes, very much so! Tree-sitter is what we're using under the covers to parse all of the languages that we support. The ctags-like symbol extraction described in the paper comes straight from tree-sitter, too. [1]
We had conflated JS and TS support into a single release issue. JS support never landed, but TS did: https://github.blog/changelog/2024-03-14-precise-code-naviga...
When we first created the release issue, the thinking was that we wanted to launch support for those two languages at the same time, and for that support to be compatible with each other. (So we could correctly follow e.g. a TS library referencing a definition in an upstream JS library.)
Unfortunately we never got JS support to the point where we could GA it. Largely for scaling reasons — there is a HUGE amount of JS on GitHub, and JS is a dynamic enough language that we have to do more per-file processing on it than for TS or Python. We decided not to wait for JS to be GA-ready before releasing TS, but then never corrected the release issue to account for that.
reply