> c preprocessor macros are complex At its core the C preprocessor language is a...

GolDDranks · on Nov 19, 2018

It's not complex by itself, but it has some emergent complexity. (Not unlike some other "simple systems": https://en.wikipedia.org/wiki/Rule_110)

I'd say the main cause of complexity with C preprocessor is that it indiscriminately feeds it's output to a complex system. This means that relative small perturbations such as precence or lack of parentheses can have large and dramatic effects. One could say that a system that works around these by adding safeguards such as hygiene-by-default is more complex by it self, but it definitely begets more controlled behaviour.

joshuamorton · on Nov 19, 2018

Indeed, complexity of the implementation leads to a simpler abstraction/conceptual model/end user experience. This is, almost undoubtedly, a good thing.

jstimpfle · on Nov 19, 2018

I want to disagree. Complexity of implementation is never a good thing. If you cannot truly hide it, bad thing (for the user of the API/language/etc). If you can, it's unwarranted complexity in the first place.

joshuamorton · on Nov 19, 2018

This doesn't make sense at all.

Abstraction is the act of hiding necessary, but irrelevant, complexity.

Timsort is much more complex than quicksort or merge sort. A basic doesn't need to know that they're using timsort instead of mergesort. An advanced user may prefer knowing that timsort will always be faster, but not knowing this doesn't harm them.

The same goes for many (most?) performance optimizations. A more complex implementation is often wholly transparent to the API (although I guess you could argue that the abstraction is leaky since the system runs faster), but simultaneously often warranted when applied across all users of a system.

The same thing is true with macros that act on an AST instead of raw text. You use a wholly different API (acting on a tree structure instead of a string), but this makes many changes easier, and makes them all safer, at the cost of the API having to handle parsing the language for you.

There's no downside to the end user. Only a higher cost of implementation. You're free to disagree, but you should be able to demonstrate specific examples of costs to the end user in those two cases, as well as any others.

jstimpfle · on Nov 19, 2018

I like the way you put it, but nevertheless some thoughts.

There's a nice quote, "Being abstract is something profoundly different from being vague. The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise."

If the user really does not care what implementation is used, you could use any. You could use a less sophisticated one. In any case I do not think I had a sort implementation in mind, since one might argue it's not really complex. It doesn't interact with the rest of the system in complex ways. If that makes sense.

UPDATE: Now I know what bugged me in the way you put it initially -- it's in "complexity of the implementation leads to a simpler abstraction/conceptual model/end user experience". Note that this does not apply to Timsort (or any other sort). The "complex" implementation does not make for a simpler conceptual end user experience. A sort is a sort.

I don't have anything against AST transformations. They are a good idea to implement if one can figure out relevant usecases and a usable API. But in most cases I guess personally I'm likely to prefer either that 1-line macro, or not to add tricky AST manipulation code to create heavy syntax magic for a one-off thing.

joshuamorton · on Nov 19, 2018

>There's a nice quote, "Being abstract is something profoundly different from being vague. The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise."

And under this definition, using AST level macros is vastly superior to text-transforming ones. Text-transforming macros don't allow the user to be precise. Or at least, to be precise, the user must know a number of rules that are arcane and not obvious. `#define min(X, Y) (X < Y ? X : Y)` feels like it should work, but it doesn't. Or well it does sometimes, but not always.

I'll come back to this in a second, but lets talk about sorting. You claim that if the user doesn't really care you can use a less sophisticated abstraction. But that's not at all true! In fact, its a violation of the Liskov Substitution Principle[0]. If I have an interface `Sort`, I shouldn't care whether it is of type Bubble, Tim, or Quick. But if I have a TimSort, I certainly may care if you swap it out with BubbleSort. In this case, the desirable property is speed. One way of putting this is that for any Sorts S and T, S can only be a Subtype of T if S is faster than T. This allows you to replace a T with an S, and you won't lose any desirable properties[1].

Another way to put this rule would be that when modifying an API, your changes shouldn't cost the user anything. Replacing Timsort with bubblesort costs the user speed. That's bad. It violates an assumption the user may have had.

And while you say that this doesn't make for a simpler conceptual end user experience, I disagree. If the system just is fast, you don't need to worry about performance. If `sort` is fast, you won't need to implement your own faster sorting function (or even have to try and figure out that that's what you need). Not having to write an efficient sorting algorithm certainly sounds simple to me!

Similarly, there is a cognitive cost to an API. And working with an API that is astonishing[2] has a cost. Its harder to reason about how it will interact with the rest of the system. An API that can guarantee that all macro results will be syntactically valid is simpler than one that requires that you manually do that bookkeeping. Same with a language that can guarantee that all memory accesses will be valid. There may be associated costs with these things (you can't string-concat syntactically invalid strs into valid ones, or you can't implement certain graph structures without jumping through hoops), so its not quite as clear cut as with sorts, but I think its pretty clear that AST-based macros are simper to interact with than Cs.

In fact, C's macros being so simple makes them more dangerous. Its easy to understand how they work internally, so a novice may feel like they won't be surprising (until...). But the simplicity of the implementation leads to footguns when they interact with other things. The abstraction C macros provide is conceptually simple but leaky, or to use your words, imprecise. You have to understand them more than you think to be able to interact with them safely.

AST based macros on the other hand aren't leaky. They're more difficult to conceptualize, but you don't really need to fully conceptualize, because they won't surprise you. You take in some expressions and modify them, and you'll get out what you expect.

Doing AST based transforms and substitutions instead of text-based ones significantly reduces the cognitive overhead. You stop having to worry about the edge cases where some transformation might happen in the wrong context, or not happen in the right one (as a simple example, applying substitutions via regex vs. via AST means that you no longer have to worry about where there were line breaks).

>I don't have anything against AST transformations. They are a good idea to implement if one can figure out relevant usecases and a usable API. But in most cases I guess personally I'm likely to prefer either that 1-line macro, or not to add tricky AST manipulation code to create heavy syntax magic for a one-off thing.

I'm a bit confused here. I'm not saying that the macros are harder to implement for the end user, in fact the opposite. But that they're more work for the language to implement. A text based macro like

    DEFINE_handler(type, function_body) (void handle##(type)((type) input) { (function_body) };

isn't significantly easier to define than something like

    DEFINE_HANDLER(type t, AST function_body) {
        f = ast.Function()
        f.name = "DEFINE" + t.name
        f.body = function_body
        return f
    }

In fact, its arguably clearer what's going on in the second example.

[0]: https://en.wikipedia.org/wiki/Liskov_substitution_principle

[1]: Yes I realize there are other desirable properties of sorts, such as stability and inplaceness but I'm simplifying.

[2]: https://en.wikipedia.org/wiki/Principle_of_least_astonishmen...

jstimpfle · on Nov 19, 2018

I never wanted to argue as much. I'm certainly not saying macros solve all the world's problems. Far from it, and as I said from the beginning, you need to use them carefully. Here are examples of my typical uses:

    #define LENGTH(a) ((int) (sizeof (a) / sizeof (a)[0]))
    #define SORT(a, n, cmp) sort_array((a), (n), sizeof *(a), (cmp))
    #define CLEAR(x) clear_mem(&(x), sizeof (x))

    #define BUF_INIT(buf, alloc) \
        _buf_init((void**)(buf), (alloc), sizeof **(buf), __FILE__, __LINE__);

    #define BUF_EXIT(buf, alloc) \
        _buf_exit((void**)(buf), (alloc), sizeof **(buf), __FILE__, __LINE__);

    #define BUF_RESERVE(buf, alloc, cnt) \
        _buf_reserve((void**)(buf), (alloc), (cnt), sizeof **(buf), 0, \
                     __FILE__, __LINE__);

    #define RESIZE_GLOBAL_BUFFER(bufname, nelems) \
        _resize_global_buffer(BUFFER_##bufname, (nelems), 0)

    #define MSG(lvl, fmt, ...) _msg(__FILE__, __LINE__, (lvl), (fmt), ##__VA_ARGS__)
    #define FATAL(fmt, ...) _fatal(__FILE__, __LINE__, (fmt), ##__VA_ARGS__)
    #define UNHANDLED_CASE() FATAL("Unhandled case!\n");
    #define ABORT() _abort()
    #define DEBUG(...) do { \
        if (doDebug) \
                _msg(__FILE__, __LINE__, "DEBUG", __VA_ARGS__); \
    } while (0)

And a clever one, saving a lot of typing (which many will argue only fixes a problem of C itself. But still).

    #ifdef DATA_IMPL
    #define DATA
    #else
    #define DATA extern
    #endif

    DATA char *lexbuf;
    DATA char *strbuf;
    DATA struct StringInfo *stringInfo;
    ...

None of these is a maintenance burden, and each makes my life significantly easier. I don't believe there's a different scheme that is a better fit here.

joshuamorton · on Nov 19, 2018

Sure, but all of those solve problems that only really exist in C. If you're arguing that C, as it currently exists, needs text-macros, then maybe. But most of those macros are just dealing with flaws in C (lack of compile-time array-size tracking, no logging built in, etc.).

In many languages, those macros aren't things you'd ever need to do. You're just being forced to make up for a flaw in the platform.

zzo38computer · on Nov 19, 2018

I disagree. If it isn't a macro then you cannot write your own kind of macro instead. Those things aren't really flaws in C (although there still are flaws in C, such as that macro expansions cannot include preprocessor directives). You need all of the kind of macro.

joshuamorton · on Nov 19, 2018

>If it isn't a macro then you cannot write your own kind of macro instead

I'm not sure what you mean here. ast-macros can still wrap ast-macros.

And yes, I'd absolutely claim that not tracking array size at compile time is a flaw in C (rust fixes this, you can pass `&[int]` to a function (a reference to a compile-time-fixed-size array) and call `.len()` on the argument. This has no runtime cost in either space or speed).

In the same way that I talk about cognitive overhead above, the requirement that a user manually pass around compile time information is dumb. Note that in C this wouldn't have prevented you from down-casting an array to a pointer, its just that the language wouldn't have forced this on you at every function boundary.

The only reason C didn't do this is because the implementation was costly to the authors. It doesn't have any negative impacts to the end user (well, there's an argument to be made that there was a cost to the end user at the time, but I'm not sure how much I believe that).

jstimpfle · on Nov 19, 2018

C is minimal and orthogonal. It doesn't have bounds information because it's not clear how to do that. If you look in other languages, you can find these alternatives:

- OOP with garbage collection: Need to pass around containers and iterators by references. Bad for modularity (the container type is a hell of a lot more of a dependency than a pointer to the element type). And not everybody wants GC to begin with, not in the space where C is an interesting option.

- Passing around slices / fat pointers with size information. Not as bad for modularity, but breaks if the underlying range changes.

- Passing around non-GCed pointers to containers (say std::vector<int>& vec): Again, more dependencies (C++ compilation times...). And it still breaks if the container is itself part of a container that might be moved (say std::vector<std::vector<int>>.

- With Rust there is now a variation which brings more safety to the third option (borrow checker). I don't have experience with it, but as I gather it's not a perfect solution since people are still trying to improve on the scheme (because too many good programs are rejected, in other words maybe it's not yet flexible enough?). So it's still unclear to me if that's a good tradeoff.

None of these options are orthogonal language features, and #2 and #3 easily break, while the first one is often not an option for performance reasons. All are significantly worse where modularity is important (!!!).

I personally prefer to pass size information manually, and a few little macros can make my life easier. It causes almost no problems, and I can develop software so much more easily this way. I have grown used to a particular style where I use lots of global data and make judicious use of void pointers. It's very flexible and modular and I have only few problems with it. YMMV.

joshuamorton · on Nov 19, 2018

The borrow checker isn't a solution to all problems, but afaik, doesn't fail for arrays/slices ever.

The situations where the borrow checker can't work are different than those that involve arrays. You don't lose anything.

There doesn't always need to be a trade off.

zzo38computer · on Nov 19, 2018

>I'm not sure what you mean here. ast-macros can still wrap ast-macros.

Yes, although sometimes text macros are useful, but ast-macros are generally much better yes.

>In the same way that I talk about cognitive overhead above, the requirement that a user manually pass around compile time information is dumb.

If the macro facility is sufficient, it would be implemented by the use of a macro; you do not need to then manually write it during each time. In C, you can use sizeof. Also sometimes you want to pass the array with a smaller length than its actual length (possibly at an offset, too).

iopq · on Nov 19, 2018

Rust tracks this use case, as you can slice the array smaller and it will track that it was sliced off further. No need to remember this kind of thing.

_pmf_ · on Nov 19, 2018

>[...] isn't significantly easier to define than something like [...]

Of course it is. The first is accessible by everybody who knows the host language's syntax, the second requires an understanding of how the syntax is mapped to the AST, which may be not be 1:1 in certain edge cases.

joshuamorton · on Nov 19, 2018

In c, macro language is different than host language, and they don't interact with clear semantics, so this is not at all obviously the case.

jstimpfle · on Nov 19, 2018

Huh? Your parent obviously means that the C preprocessor is a macro language to generate host language syntax. The preprocessor language itself is pretty simple -- next to invisible in many cases.

joshuamorton · on Nov 19, 2018

That still doesn't address the interaction with unclear semantics.

Getting a struct pointer and modifying it would be much more familiar to anyone who hadn't yet written macros than jamming together tokens with ##.

jstimpfle · on Nov 19, 2018

I don't think pasting tokens with ## is such a common practice at all, and it certainly must be used prudently in order to avoid producing unreadable code. Myself I used it for one little thing (I posted it above) to get nicer code. I needed it because I had pointers of various types and wanted to associate data of a fixed type with each of the pointers. So I simply made an enum of BUFFER_pointername symbols which I then can use to associate arbitrary data with them. I cannot go the other way (i.e. making a link from the extra data to the pointer) because then the pointer type information gets lost - I would need to go with a void pointer and add the actual type in lots of places.

I also don't like putting the pointer together with the extra data because that means dependencies, and also I want a simple bare pointer to the array data (instead of a struct containing such pointer) that I can index in a simple way.

I also don't like the stretchy_buffer approach where you put the metadata in front of the actual buffer data. Again, because of dependencies.

The alternative would have been to go for C++ and make a complicated templated class. I don't use C++ and templates are a rathole on their own. So a single ## in my code solves this issue. I'm happy with it for now.

joshuamorton · on Nov 19, 2018

You're comparing against a different thing.

I'm suggesting that in lisp or rust macro land, your macro is a lisp or rust function. So in sane c macro land, your macro is a c function.

    macro macrofun(AST* node) {
        node->name = strcat("BUFFER_", node->name);
    }

Literally just use a subset of c syntax on a c ast represented as a tree of node structs. It need not be blazing fast, it runs at compile time.

C++ constexpr is close, although decidedly less powerful, but is still a huge win over c macros.

You keep making excuses for why you're doing these things, and I don't really care why you're doing them. I'm saying you shouldn't need to, and that the interface with which you would solve them should be less awful and inherently error prone.

But making a less error prone interface takes more up front complexity. You've elsewhere claimed that this has positive externalities ('it forces you to understand c sytax better'), but I'd reverse that and claim that

1. It violates a common api ("languages are expression oriented"), and is therefore both astonishing and leaky

2. It doesn't force you to understand the language better, it prevents you from being productive until you understand a set of arbitrary rules (you list these elsewhere) that aren't necessary for normal use. Token based macros require you, the user, to understand how c is parsed, AST macros don't because they do it for you.

jstimpfle · on Nov 19, 2018

No need to get all worked up. As I said, I'm not idiomatically against AST manipulation at all. Just that I don't have a big issue with C preprocessor, which is a good solution for the simple cases. It doesn't improve the situation if you replace ## by programmatic access -- it's a little more involved but on the upside you can maybe expect slightly better error messages in case of wrong usage.

In the end, there are zero problems with my approach here, either. And the CPP doesn't encourage you to get too fancy with metaprogramming. Metaprogramming is a problem in its own because it's hard to debug. I've heard more than one horror story about unintelligible LISP macros...

Note that I am going to experiment with direct access to the internal compiler data structures for my own experimental programming language as well. But you need to realize that this approach has an awful lot more complexity. You need to offer a stable API which is a serious dependency. You need to offer a different API for each little use case (instead of just a token stream processor). If you're serious like Rust you also need to make sure that the user doesn't go all crazy using the API. Finally, it's simply not true that you need to understand less about parsing (the syntax) with an AST macro approach. The AST is all about the syntax, after all.

jcelerier · on Nov 19, 2018

people who believe this are why we can't have nice things.

jstimpfle · on Nov 19, 2018

You're welcome. Have a good day :-)

lilyball · on Nov 19, 2018

The core idea of the C preprocessor is relatively simple. But it has very complex effects on the surrounding language.

anticensor · on Nov 19, 2018

C preprocessor is one of the worst design decisions in C. They could have made the directive system into brace-and-semicolon grammar and not introduce a second syntax.

v_lisivka · on Nov 19, 2018

You can use an other implementation of preprocessor for C/C++, with more features, e.g. DMS[0], which works with AST, or even PHP or Perl, but you will need to parse code twice, so compilation will be about 2x slower.

[0]: http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html

anticensor · on Nov 20, 2018

In C(++), the preprocessing is only way to portably declare directives. All others are vendor-specific. They are working on a common directive and module system now, but it will take at least two more years to be standardised fully.

v_lisivka · on Nov 21, 2018

So just ship you preprocessor with your application/library. Look at cvstrac, for example.

rtpg · on Nov 19, 2018

They're complex in the sense that they do not follow the rules of the surrounding system. They can produce mismatching parentheses, breaking the pretty basic notion that expressions can be "inlined" into other expressions.

Not knowing whether `f()` is a syntax error or not removes a huge part of the foundation that people use to read code they do not know 100%

jstimpfle · on Nov 19, 2018

> They're complex in the sense that they do not follow the rules of the surrounding system.

And this is exactly why they are useful. You can do things with it that you cannot do in any other way. Practical things, I want to mention.

Some examples: conditional compilation based on feature set. Automatic insertion of additional arguments (like __FILE__, __LINE__, or the sizeof of arguments). Conversion of identifiers to C strings. Computations such as getting the offset of a member in a struct. Ad-hoc code and datastructure generation.

Many of these could be individually replaced with complex features built into the core of the language, by making arbitrary ad-hoc decisions that would be hard to standardize and would probably kill the language.

> Not knowing whether `f()` is a syntax error or not removes a huge part of the foundation that people use to read code they do not know 100%

It's your responsibility to match parentheses when required (which is almost always). That's easy for a macro that is usually 1, or at most 5 lines. And if something goes wrong, it's usually not that hard to find the error and fix it. You need to be aware of some gotchas though (wrap expressions in parens, don't do naked if statements or blocks but wrap them in do..while(0)). That means you will get to learn the C syntax more intimately.