Hacker News new | past | comments | ask | show | jobs | submit login

First of all, I love the idea of semgrep, but can't use it since we're using C++. Is there any chance for C++ support in the future?



The good news is that we’ve replaced almost all the homegrown parsers that were written while the tool was at Facebook and we’re using the now tree-sitter project, which already has parsers for 40+ languages. There is a tree-sitter-cpp project we can and will eventually integrate! The bad news is this requires the code to not use heavily macros to be parseable as-is. So really the difficulty is not C++ but rather the pre-processor.


I guess you could run the preprocessor and then run tree-sitter-cpp / semgrep on the preprocessed output, but the problem would then be trying to tie any findings from that to the original source.

Do gcc/clang/any other preprocessor create "source maps" that could facilitate that? GCC looks like it has a `-fdebug-cpp` that "[...] dumps debugging information about location maps. Every token in the output is preceded by the dump of the map its location belongs to."


Right; but this turns out to be pretty tricky in practice. I've attempted to do this for even relatively straightforward code (libsodium--complex in implementation, though not in API) with libclang and it was not particularly pleasant.

Some prior art for reference: https://github.com/bytedeco/javacpp/issues/51


The preprocessed output of GCC and Clang usually contains the file names.


Better than that, compilers can be told to include line numbers too.

This thread would love to learn about Compiler Explorer https://gcc.godbolt.org/ which works for C++ and many other languages.


Line numbers are a good start, but JS source maps go from output source byte ranges to input source byte ranges.

I don't write or even read a lot of C++ these days but I recall from when I did that a major pain point was deciphering compiler warnings/errors when there are a lot of templates, macros, or both. Seems like the problem has been around forever.


That sounds really cool, would it be possible to ignore the macros or can the code not be parsed at all then?


Not GP, but one the one hand semgrep has a real honest to goodness parser at its core; on the other hand I'd expect C++ to have sufficiently complicated semantics that it needs some understanding of C++ specific mechanics to be useful. Furthermore, you'd need preprocessor and template expansion magic to really get to the bottom of it. Effectively this is the same problem e.g. javacpp has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: