The good news is that we’ve replaced almost all the homegrown parsers that were written while the tool was at Facebook and we’re using the now tree-sitter project, which already has parsers for 40+ languages. There is a tree-sitter-cpp project we can and will eventually integrate! The bad news is this requires the code to not use heavily macros to be parseable as-is. So really the difficulty is not C++ but rather the pre-processor.
I guess you could run the preprocessor and then run tree-sitter-cpp / semgrep on the preprocessed output, but the problem would then be trying to tie any findings from that to the original source.
Do gcc/clang/any other preprocessor create "source maps" that could facilitate that? GCC looks like it has a `-fdebug-cpp` that "[...] dumps debugging information about location maps. Every token in the output is preceded by the dump of the map its location belongs to."
Right; but this turns out to be pretty tricky in practice. I've attempted to do this for even relatively straightforward code (libsodium--complex in implementation, though not in API) with libclang and it was not particularly pleasant.
Line numbers are a good start, but JS source maps go from output source byte ranges to input source byte ranges.
I don't write or even read a lot of C++ these days but I recall from when I did that a major pain point was deciphering compiler warnings/errors when there are a lot of templates, macros, or both. Seems like the problem has been around forever.
Not GP, but one the one hand semgrep has a real honest to goodness parser at its core; on the other hand I'd expect C++ to have sufficiently complicated semantics that it needs some understanding of C++ specific mechanics to be useful. Furthermore, you'd need preprocessor and template expansion magic to really get to the bottom of it. Effectively this is the same problem e.g. javacpp has.