Preprocessing is the easiest part unless you are working on IDE functionality and need to map it back to exact position in original source code at character instead of line level. Then it gets confusing because single expression can have position in the place calling macro, inside the macro definition and inside the preprocessed buffer.
It does mean that you have to know exact compiler invocation before you can start parsing. You can even invoke the original compiler to preprocess the file and treat it as completely separate phase that can be mostly ignored after it's done.
There is plenty of complexity making extracting AST difficult after preprocessing. You could argue of how much of analysis belongs to syntax and what's already semantic analysis. But in practice and possibly at least partially due nature of C++ language the parsers making AST also do a lot of type analysis. Due to features like constexpr it they even execute the logic of code.
But you aren't parsing the program. You're parsing a particular instantiation of the program for a particular set of -D defines and include files. Whole chunks of the program that were excluded by #ifdefs don't get parsed at all.
This is fine if you just want to analyze that particular instance of what the program can become, but it's not fine if you're trying to do something to the program as a whole.
> Extracting ASTs from C (or C++) is difficult in general, due to the preprocessor.
Years back doing code analysis tooling (and struggling with, and being discouraged by, the TFA issue), I downloaded "a lot" (in an old pre-github sense) of code, and IIRC, found preprocessor use was so stereotyped that almost all could be easily parsed and included in the AST. Using a backtracking parser though, rather than the usual "having chosen a cripplingly inexpressive parser tech, difficulties with parsing came as a surprise". ;)
Despite being burned by gcc's "hide the AST" policy, it's not clear to me when/if it became the wrong thing. It's so easy to underappreciate the contingency of history. We're on a timeline where open source, and the internet, more or less succeeded - many perhaps aren't so lucky. On the other hand, emacs, at least with lisp on lisp machine, could be an immensely powerful integrated experience, that perhaps needn't have been quite so stagnant and lost.
That's moving the goal posts. The important thing is that macro invocations can be found "statically" (with out knowing the implementations of macros at all let alone evaluating).
One way to look at this is that partial evaluation of Rust or Scheme macros is very tractable, because there are very few side effects / side channels. But if you have a lack of hygiene or the C preprocessor, it's very difficult and almost everything becomes a "stuck term" whose evaluation is contingent on earlier evaluation.
That's true about the pure act of parsing and generating an AST. Bit once we want to do semantic analysis, this is no longer true.
Because of procedural macro, it becomes practically impossible to find all occurrences or rename a particular symbol regardless of the #ifdef or #[cfg] or whatever.
So with true hygienic macros, this is not the case. Even with procedural macros, quoted identifiers are resolved where they are quoted not where they are expanded. Identifiers that are not in scope at the macro definition site would have to be parameters, and those are caught at the invocation site just like any other identifier (the quote and splice cancel out).
I am not sure whether Rust procedural macros are always that hygienic, so fair point if they aren't.
I really hate the C/C++ preprocessor.