Google’s internal code search is far ahead of GitHub, Gitlab, and even sourcegra...

beyang · on Nov 26, 2020

Author of the post here. I would say that technically Kythe is the open-source version of part of Google's internal code search (specifically, the component that provides precise code navigation), but it doesn't include the search index or the UI. So it's not on its own an end-user product.

That having been said, I love Kythe, and we've actually considered using it as a semantic backend for Sourcegraph (and still might in the future). For the time, we're using indexers that emit LSIF (https://lsif.dev). This allows us to build on top of the substantial body of work provided by the many open-source language servers (https://microsoft.github.io/language-server-protocol). But Kythe has a far richer schema that can capture all sorts of useful relationships in code. It's awesome and I wish more people were building indexers for it.

jeffbee · on Nov 26, 2020

Maybe sourcegraph just doesn't exploit the real power of language server indexes, but the C++ language server seems pretty impoverished compared to Kythe. If I ask sourcegraph to find all references to absl::string_view::string_view(const char* str) it instead finds the substring `string_view` in any context, which is quite a useless result. Kythe gives me the actual call sites of that function signature and not the other forms, and Kythe knows the difference between absl::string_view and std::string_view.

Is it just a case of the visible implementation being a bit behind the ultimate capability of the system?

jaysachs · on Nov 26, 2020

(kythe googler here)

Like beliu said, the Kythe schema is far richer; it has fully abstract semantic layer in the graph, and is a superset of what can be represented with LSIF. It's not tied to specified text regions -- there are representations of symbols/functions/classes/variables/types that do have pointers to/from text regions.

Note that because of the richness and abstractness, it's theoretically feasible to drive much more than code navigation from the Kythe graph.

And yes, the open source is just part. The large scale pieces are basically (1) do instrumented build (2) run through Kythe indexers (3) post-process output for serving.

The Kythe OSS project offers solutions for (2) for C++/Java/Go/Typescript/protobuf (and early Rust support). We do have plans to open source support for at least some other languages at some point in the future. (Hedging as best I can here.) Note that the best candidates for Kythe indexing are those languages that admit solid static analysis.

(1) is inextricably tied to the build system. Bazel support should be nearly turnkey; other systems require more (maybe significantly more) work.

There's not-full-scale support for (3) available. (Clearly we use something far more sophisticated internally.) While we'd like to see this fleshed, expansion of that will depend on non-trivial community contributions.