I need to get around to playing with tree-sitter. The approach in this article is neat.
Here's another approach. The AST of a .proto file is itself a protobuf. That's how the codegen plugins work. Protobuf also has a canonical mapping to JSON, so...
What you can do is use protoc to parse the .proto file, spit it out as JSON, and then process that data using your favorite pattern matching language. I wrote a [tool][1] that helps with that. For example, here's some [js code][2] that translates protobuf message definitions into "types" for use in an ORM.
I found that some parts of a protobuf aren't captured well by protoc; specifically annotations were not well exposed to Go libraries for writing protoc plugins in 2016. I ended up having to write my own basic protobuf parser to reliably extract annotations and comments for code and documentation generation:
> I ended up having to write my own basic protobuf parser (...)
Wouldn't it have been far more efficient to contribute back to protoc? Certainly posting a patch to a parser takes far fewer work than writing an alternative implementation of the same parser.
Not speaking for Leland, and this is not a universal principle, but if you know what you're doing, you can write a solution and have it out of your way faster than going through the bureaucracy of trying to get some code merged upstream.
Your comment makes no sense at all. You do not need to get a PR merged before you can use your changes. You work out a patch, use it locally to meet your needs, post a PR, and you move on with your life regardless of whether your PR is merged or not.
In fact, that's the whole premise of GitHub: you fork a repo, you work on a feature branch, you post a PR, anf your fork stays there.
To repeat my point: it's easier to add small features to a parser than it is to reimplement everything, and posting a PR with your small change takes no work at all.
I had similar thought and I think that doesn't work well in practice:
1. If I have to get back to this project several months in the future, it might be a lot of trouble to bring the fork up-to-date. The fork might not have new features, security updates, compatibility upgrades that I want. The upstream may have rewritten the code, making my fork redundant or refactored it and I have to rework it again
2. I think it is in bad taste to submit PR without intention of getting it merged - it create open ticket that maintainers could perhaps never close and have to sort through.
> 1. If I have to get back to this project several months in the future, it might be a lot of trouble to bring the fork up-to-date.
What's the trouble? What compells you to "bring the fork up-to-date"?
> The fork might not have new features, security updates, compatibility upgrades that I want.
I really don't understand what point you think you're making. The scenario you're discussing is either a) contribute a patch to an existing project, or b) write your own from scratch. Does scenario b) magically introduces feature and maintenance work on your alternative implementation?
> 2. I think it is in bad taste to submit PR without intention of getting it merged (...)
> Your comment makes no sense at all. You do not need to get a PR merged before you can use your changes.
What you said was:
> Wouldn't it have been far more efficient to contribute back to protoc?
My point was about the bureaucracy of contributing back upstream. Nonetheless, your new point, that patching and using your own fork is preferable makes sense on the surface given the ease of forking in go projects. Be interested to hear the author's reasoning.
> My point was about the bureaucracy of contributing back upstream.
There is no bureaucracy. Once you have your patch, you can post a PR. Click on a button, write a description, and you're done. That's it. Posting a PR is hardly the hard part of the process.
On the other hand, maintaining a custom tool vs a patched fork is not that dissimilar, is it? If patching the prevalent tool makes sense otherwise, I wouldn’t worry too much about upstreaming.
> On the other hand, maintaining a custom tool vs a patched fork is not that dissimilar, is it?
You're somehow ignoring all the development work you need to invest to go from zero to alternative implementation, not to mention the maintenance work you need to do to come close in stability.
The whole point of FLOSS is that everyone, including you, can build upon all the work invested by everyone else. Otherwise there would be a whole lot of people reinventing the wheel.
Trying to understand a big existing codebase might be more difficult than rewriting a small subset for your own use case, especially for something as trivial as a parser for a relatively simple language like protobuf.
Gaining enough understanding of that existing codebase to know how one would go about adding that small feature, on the other hand, can easily require more time than it would take to build what you need from scratch.
Protoc is written in C++ and is quite large, while at the time I was writing Go and had 8 years less experience. It was easier to write and test a small but perfectly functional recursive descent parser by myself than to try to tackle:
- learning a new language
- learning a new codebase
- modifying that codebase
- distributing a that modified fork of a project in c++ alongside what I was building in Go
It was much easier in my case to build it as a small part of the tool I was already building and move on with my life.
I highly recommend it. The docs describe it as a zen-like experience, and I fully agree. Once you get the hang of it, it makes it so easy to tweak the syntax of whatever language you’re building. I love it.
Huh. tree-sitter seems neat, but I don’t really get why the author thinks processing the descriptor set is so hard. Seems equally difficult to learn a bunch of new abstractions in the form of tree-sitter vs just learning protobuf’s own ones.
Also, if you’re parsing .proto files directly, you have to deal with a bunch of annoying issues like include paths, how you package sets of them to move around, etc. descriptor sets seem like a better solution to me.
No, they’re not inventing the tree-sitter grammar for protobuf here, they’re using the existing grammar provided by someone else. So it’s not like there’s additional reusable work being done here.
If you absolutely do need it, then in addition to "name" you can also add a field "hasName". If you feel fancy, you can even do that in a code generator.
You do need it for e.g. database records that are nullable or anywhere else that null/nil is a valid state that differs from zero value. Your alternative approach requires manual work on your part rather than being built in to the bindings, and allows for invalid states as pointed out by another commenter.
That’s an extra field, with no consistency validation, and does not play well with JSON. A field can be either "" or null, and that’s two different cases.
Also the wrappers are known to most protobuf code generators, that can then use language features for better ergonomics. E.g. in Rust all wrappers to allow nullability of values will be turned in Option. Like the StringValue will turn into Option<String>.
As for the need, yes, it is needed. This carries intent, which is a good thing in customer facing APIs. An update endpoint will have all fields optional to distinguish between a field needing to be overridden or not, without the need to re-send the value when it does not need to be changed (à la PUT). Having additional boolean fields gets unwieldy quickly.
I think the other commenter’s point is you can use 2 fields to distinguish between the first field being specified as empty vs absent (or whatever terms you prefer).
E.g.
- type.specified => “”
- type.unspecified => empty
The same technique can be used to disambiguate between 0 and empty.
Worth noting that it’s not quite equivalent due to allowing for a malformed message that includes foo = value and hasFoo = false, opening the door to varied client interpretation.
tree-sitter is an incredible tool. I wonder if there's been a dedicated discussion for it on the HN front page at some point—will have to check on the HN algolia search.
I'm using it for syntax checking across 30+ languages in Plandex[1], an LLM coding tool. tree-sitter runs in single digit milliseconds on typical files and is highly accurate. When it encounters a syntax problem, it can pinpoint the exact location/expression in the file, and it's fault-tolerant so it can keep going and identify multiple issues rather than stopping on the first one. These results can be sent back to the LLM, which can then often fix its own errors. I was able to reduce syntax issues by roughly 90% with gpt-4o using this approach.
Afaik, there's no other viable option for a use case like this. You'd need a menagerie of language specific linters, compilers, and/or language servers to get anywhere close, and many of those are way too slow to run inline.
I used to write proto parser using ragel <https://www.colm.net/open-source/ragel/> for work way back as well, it was surprisingly painless. Think this was way back when protobuf was transitioning to proto3.
Here's another approach. The AST of a .proto file is itself a protobuf. That's how the codegen plugins work. Protobuf also has a canonical mapping to JSON, so...
What you can do is use protoc to parse the .proto file, spit it out as JSON, and then process that data using your favorite pattern matching language. I wrote a [tool][1] that helps with that. For example, here's some [js code][2] that translates protobuf message definitions into "types" for use in an ORM.
[1]: https://github.com/dgoffredo/protojson
[2]: https://github.com/dgoffredo/okra/blob/master/lib/proto2type...