I'm an engineer on the code intelligence team at Sourcegraph.
We've been busy building out true precise code intelligence/navigation support, but we also have a mode for zero-configuration code navigation based on text search, universal-ctags, and hand-rolled regular expressions (which works surprisingly well!). Tree-sitter would definitely give better results than our current ctags-based approach. It's been catching our attention more and more lately, and we have plans to use it to upgrade our out-of-the-box, instant code navigation experience.
It's not the exact right fit for our primary goals though, since it's designed around being extremely fast while editing and robust against errors. Sourcegraph is only used for navigating committed code, so we're leveraging formats like LSIF to generate complete semantic graphs of codebases and their entire dependency tree. That'll enable a lot of features that are out of reach for tree-sitter, but is a lot harder to get working out of the box and it's a much bigger technical investment.
It's very interesting to see the topological space that houses these solutions fill out. Every tool has its own set of unique trade-offs and fall somewhere on these spectrums:
- fast vs slow
- precise vs imprecise
- zero-configuration vs configuration required
We've visited a few islands in this space but still very curious to see what other islands can be discovered. We're especially excited about tools and formats like tree-sitter and LSIF around which a large and supportive community can grow so that all the products we love and rely on as developers can all make forward progress.
What are those features out of reach of tree-sitter? I can see that you theoretically want something that's optimized for parsing well-formed code all at once, rather than potentially malformed code incrementally, but what trade-offs does tree-sitter make in practice that limit its potential for your use case? On the face of it, it seems to me like tree-sitter could server as a perfectly fine building block for generating LSIF or whatever from a code file.
I wish there was a more universal format for parsers, but I just don't think there enough people who know their stuff.
Take PHP, a language that a lot of people use: the tree-sitter-php extension doesn't support features added in 2019, let alone features added towards the end of 2020.
If you want an up-to-date PHP parser, there's really only one open-source parser[0] that's accurate enough to be used on PHP codebases old and new, and it's written in PHP. Then if you want to parse in a robust fashion you have to adopt a number of hacks to get everything working.
I hadn't encountered LSIF before – can GitHub be configured to use those maps?
We've looked at LSIF before, and decided against it for a few reasons, mostly around COGS, operational overhead, and indexing latency. I gave a talk at last year's FOSDEM [1] going into some of the details. (Caveat that that talk was from when we were using a different open-source library, Semantic, to power fuzzy Code Nav. It's much easier to support new languages using the now-current tree-sitter query approach!)
We've been busy building out true precise code intelligence/navigation support, but we also have a mode for zero-configuration code navigation based on text search, universal-ctags, and hand-rolled regular expressions (which works surprisingly well!). Tree-sitter would definitely give better results than our current ctags-based approach. It's been catching our attention more and more lately, and we have plans to use it to upgrade our out-of-the-box, instant code navigation experience.
It's not the exact right fit for our primary goals though, since it's designed around being extremely fast while editing and robust against errors. Sourcegraph is only used for navigating committed code, so we're leveraging formats like LSIF to generate complete semantic graphs of codebases and their entire dependency tree. That'll enable a lot of features that are out of reach for tree-sitter, but is a lot harder to get working out of the box and it's a much bigger technical investment.
It's very interesting to see the topological space that houses these solutions fill out. Every tool has its own set of unique trade-offs and fall somewhere on these spectrums:
- fast vs slow
- precise vs imprecise
- zero-configuration vs configuration required
We've visited a few islands in this space but still very curious to see what other islands can be discovered. We're especially excited about tools and formats like tree-sitter and LSIF around which a large and supportive community can grow so that all the products we love and rely on as developers can all make forward progress.