Hacker News new | past | comments | ask | show | jobs | submit login
Dear sir, you have built a compiler (rachitnigam.com)
355 points by rachitnigam 9 days ago | hide | past | favorite | 175 comments

I feel like you can apply the same sentiment to many of the "big scary things" in programming.

Things that you don't want to build (as far as "common engineering wisdom" is to be believed):

- a compiler

- a programming language (not sure that there is a difference to compiler as stated in the article)

- a database (query engine)

- a CMS

- a ERP

But sometimes you actually _do_ want to build that (even if every alarm bell in your engineering lizard brain goes off), and then it's probably better to commit and learn the ins and outs of that specific problem domain.

I think especially the recommendation to commit to it is something that's missing from the article. It's "easy" to point a people and saying "look, you did the thing you said you wouldn't do", but far harder to suggest a course of action.

I personally hit the barrier of "things you shouldn't build" in the past, and I've always been happiest (and professional outcomes have been the best) when I decided to break through the barrier and prepare for what lies on the other side, instead of trying to dance around it for ages.

> and then it's probably better to commit and learn the ins and outs of that specific problem domain.

Even more, it's probably better to start off with a full understanding of these systems, so you can recognize when you're Greenspunning one of these abstractions, and just switch to using a robust off-the-shelf implementation of the same abstraction. (E.g. switch from writing a hand-rolled parser, to using a real parser-combinator library, or writing a formal grammar for a PEG compiler.)

Emphasis on "off-the-shelf": the most important thing to get from experience with these abstractions, is the understanding that you it will be less work to use the abstraction that has more power; and, in fact, that much of the tooling around these abstractions is built to assume that you want the full power of the tool by default, such that you have to add configuration to get less. So the workflow is actually "start off ordering 'one with everything', and then streamline your system over time, removing bits and pieces, as you notice properties of your particular situation that mean that you can do with something less powerful at particular stages."

> more power

eg: log4j -- so much power you can enable random bitcoin miners with your infrastructure (tongue in cheek)

ie: Sometimes it's not about more power but about the smallest/least powerful implementation that accomplishes the job.

I agree that using off-the-shelf is often important but it also comes with the need to vet the thing in the first place. That's not always trivial if you care about packaged licenses from dependencies or a lack of hidden features or actively developed or a plethora of other dimensions to what the "shelf" provides.

I think "tools that build tools" is a very different ecosystem from most others. When you pull in something like e.g. LLVM, you're not really having to deal with any of that. It's a clean bolt-in API, freely-licensed, no weird integration points, with many active users who've sanded off any weird expectations.

I think the difference is that something like log4j has no pressure to be "the least power it can while doing the biggest job it can" — it has mostly enterprise customers who want the power, but don't care about the scope creep. Something like LLVM, on the other hand, has both big enterprise use-cases, and tiny embedded use-cases, with big players invested in both. So its size and scope have been much more well-considered, with any components that do something other than "compile these data structures representing inputs, into these other data structures representing outputs" having been long factored out into additional "plugin" sort of downstream-dependency libs.

In the specific case of LLVM, I mostly agree but for different reasons. However, Node would be a good counterpoint to “tools that build tools”. There are plenty of large interests there, though maybe not so much emphasis on the embedded space. More importantly, a bug or vulnerability in Node is generally far less impactful than one in LLVM. Owning a compiler that compiles everything in an OS >> owning Node.

The problem, of course, is that it's impossible to know everything :)

IME the best compromise I've found is to learn enough to recognize the problems, so that when I implement a bastardized compiler, I'm aware that that's what I'm doing and that at some point in the future I'll need to decide how deep the implementation should go or switch to actually using some library.

I somewhat disagree — there really is a finite amount of Computer Science to know. The more you dive deep into different domains, the more you find yourself ending up in the same places, learning the same things with different names. What is a SQL query planner but a compiler? Is an OS scheduler really that different from an HPC workload manager, or your GPU's computer-shader arbitrator? Is RDMA against an NVMe storage pool fundamentally new to you if you've already written hierarchical chunk persistence+caching for a distributed key-value store? Etc.

At some point, it's all just Computer Science. And not too much of it, either! An amount that could easily be learned in a (good) 4-year undergraduate program.

Yes, but also no- I see what you're saying from a high level, but my point is that the devil's in the details and that you can't know all the details of each domain, the best you can do is be aware that those details are there. Just being exposed to even a subset of the high-level is pushing the limits of a 4-year undergrad curriculum.

> What is a SQL query planner but a compiler?

Kind of? Lexing, sure, and to some extent IR, but an execution plan is not built from an instruction set, especially when you start talking distributed. Even if you ignore the distributed scenario, there's still a huge difference between understanding indices/columnar-storage/row-storage and implementing PGO or LTO.

Plus, there's also a lot of domain complexity here- are you aiming for just ANSI compat, or are you going deeper and trying to be compatible with specific SQL Server / Oracle / Postgres features?

> Is an OS scheduler really that different from an HPC workload manager, or your GPU's computer-shader arbitrator?

Knowing little about HPC or GPUs, all I can say is that yes, there's absolutely a difference between the kinds of issues a kernel, hypervisor, and distributed scheduler have to deal with. Hypervisors need to integrate with higher-level RBACs, for one; distributed schedulers also need to be able to evict bad machines from the same system.

Telemetry also changes massively: if you're dealing with a kernel/hypervisor, stats from things like strace etc are useful. If you're distributed? You now have multiple cardinality problems (hostname, network devices, microservice release ID).

> Is RDMA against an NVMe storage pool fundamentally new to you if you've already written hierarchical chunk persistence+caching for a distributed key-value store?

Again, yes (at least I assume so, I'm not super sure exactly what "NVMe storage pool" is, I assume you're talking about the protocol that SSDs speak) - e.g. what's your resilience story against a network failure (someone cutting a fiber line, or losing a subgraph of a clos fabric)?

> but my point is that the devil's in the details and that you can't know all the details of each domain, the best you can do is be aware that those details are there.

IMHO you can do slightly better — you can know enough about isomorphic decisions you've made in other domains to drive design in a new domain, without necessarily understanding implementation.

(Especially "design" in the sense of "where does the shear-layer exist where I would expect to find libraries implementing the logic below that layer" — allowing you to understand that there should exist a library to solve constrained problem X, and so you should go looking for such a library as a starting point to solving the problem; rather than solving half of problem X, and then looking for a library to solve overly-constrained sub-problem Y.)

Or, to put it another way: knowledge of strategy is portable, even if knowledge of tactics is not. An Army General would — in terms of pure skill — pick up the job of being a Naval Admiral pretty quickly, because there's a lot about high-level strategy that isn't domain specific. And that experience — combined with hard-won domain experience in other domains — could guide them on absorbing the new domain's lower-level tactics very quickly.

Or, another analogy: linguists learn languages more easily than non-linguists, because 1. it turns out that, past a certain point, languages are more similar than they are different; but more importantly, 2. there are portable high-level abstractions that you begin to pick up from experience that can accelerate your understanding of a new language, but which you can also be taught academically, skipping the need to derive those concepts yourself. (You still need practical domain knowledge in at least some domains to root those concepts to, but you can do that in your own head when studying the concepts in a book, rather than "in anger" in the field.)

This belief — that there exists portable high-level strategic knowledge to software engineering — is exactly why some companies promote programmers into product managers. A product manager will be tasked to drive the strategy for writing a new piece of software, potentially in domains they've never worked in. They won't be implementing this software themselves — so they don't actually need to understand the domain with high-enough precision to implement it. But they'll nevertheless be relied upon to make architectural decisions, buy-vs-build decisions, etc. — precisely because these decisions can be made using the portable high-level strategic knowledge of software engineering.

(Though, once again, this knowledge must be rooted in something; which is why I don't really believe in CS / SWEng as programs people should be taking fresh out of university. The best time to learn them is when you already have 10 years' programming experience under your belt. They should be treated as "officer's finishing school", to turn an experienced apprentice 'programmer' into a professional Software Engineer.)

On to specific hand-wavy rebuttals, since my original rhetorical questioning didn't do the job of being particularly clear what I meant:

> but an execution plan is not built from an instruction set

What is a logical-level write-ahead log, but a long bytecode program that replicas execute? What does executing SQL do, but to compile your intent into a sequence of atomic+linearized commands — an instruction stream — in said write-ahead log?

(Alright, in a multi-master scenario this isn't as true. But you can get back to this property if your DBMS is using a CRDT event-streaming model; and AFAIK all the scalable multi-master DBMSes do something approximating that, or a limited subset of that.)

> Hypervisors need to integrate with higher-level RBACs, for one; distributed schedulers also need to be able to evict bad machines from the same system.

Perhaps you haven't dealt with machines of sufficient scale—the abstractions really do come back together! The IBM z/OS kernel knows about draining+evicting hotplug NUMA nodes as schedulable resources (and doing health-checks to automatically enable that), in exactly the same way that Mosix or TORQUE or kube-scheduler knows about draining+evicting machines as schedulable resources (and doing health-checks to automatically enable that.)

Even regular microcomputer OS kernels have some level of support for this, too — though you might be surprised why it's there. It doesn't exist for the sake of CPU hotplug+healthcheck; but rather for the sake of power efficiency. An overheating CPU core is, at least temporarily, a "bad" CPU core, and on CPU paradigms like big.LITTLE, the entire "big" core-complex might be drained+evicted as a response! That code, although different in purpose, shares a lot algorithmically with what the distributed schedulers are doing — to the point that one could easily be repurposed to drive the other. (IIRC, the Xeon Phi did exactly this.)

> e.g. what's your resilience story against a network failure (someone cutting a fiber line, or losing a subgraph of a clos fabric)?

My point with mentioning hierarchical persistence+caching for a distributed key-value store, is that these same problems also occur there. From the perspective of a client downstream of these systems, a network partition in tiered storage, is a network partition in tiered storage, and you have to make the same decisions — often involving (differently-tuned implementations of) the same algorithms! — in the solution domain for both; and, perhaps more importantly, expose the same information to the client in the solution domain for both, to allow the client to make decisions (where needing to capture and make this information available will heavily constrain your design in both cases, forcing some of the similarity.)

> you can know enough about isomorphic decisions you've made in other domains to drive design in a new domain, without necessarily understanding implementation.

That's certainly true!

I'm not at 5 years yet in the industry myself, but I've definitely started to notice these patterns - the strategy vs tactics is a nice way to describe it :)

> Perhaps you haven't dealt with machines of sufficient scale—the abstractions really do come back together! The IBM z/OS kernel knows about draining+evicting hotplug NUMA nodes as schedulable resources (and doing health-checks to automatically enable that), in exactly the same way that Mosix or TORQUE or kube-scheduler knows about draining+evicting machines as schedulable resources (and doing health-checks to automatically enable that.)

Oh, this is interesting. My experience has been as a Google-internal user of Borg, where although the fleet is large, most of the machines are individually small enough that it's easier for us to simply evict a machine from the fleet rather than block a core. (The whole "commodity" server strategy, whereas z/OS follows a very different architecture, but I see exactly what you mean there.)

> An amount that could easily be learned in a (good) 4-year undergraduate program.

i think any kind of undergrad program can only ever scratch the surface of some of these topics. people spend decades becoming proficient in just a single vertical (microchip architecture, distributed systems, asics/fpgas, signal processing, gpu shaders, network engineering, jit compilers, language design, color spaces, database engines, etc).

sure, it's all "just" CS/EE :D

look at how much knowledge/skill was required to make an even faster fizbuzz: https://news.ycombinator.com/item?id=29031488

IMHO "programming language" would refer specifically to the grammar, type system, semantics, keywords, etc. and "compiler" would refer to the implementation that takes something written in that representation and outputs some type of executable. A pedantic distinction but probably a useful one, especially given how some languages have multiple implementations of their standard.

The Programming Languages course I took in college had us implement a toy subset of Ruby targeting the Lua VM. Easily the most interesting and useful class that I had the pleasure of experiencing; it completely changed how I think about languages. I think everybody should have exposure to the basic principles of compilers like syntax trees, flow analysis, calling conventions, and so forth (if not necessarily the more complicated stuff like operational semantics).

"And so Forth"

Nice unintentional pun.

There are two neat things about compilers and interpreters:

1. They are one of the most researched areas of CS with a large amount of educational materials, tools, and community knowledge to help.

2. They map to any problem involving transforming an input into an output (hell, any problem transforming an input to an output is a compiler or interpreter).

does it affect how you code other kinds of apps too ?

knowing how to think about a program interpretation space gives quite a broader view IMO

I'd argue the idea of an "interpreter" is one of the most foundational concepts of CS.

It sounds really basic, but the idea of being handed a description, and doing/creating something based on that is everywhere. It is really quite beautiful.

It comes up often too. A few month ago I wanted to make a part of the code I was working on more declarative, so I had it take a javascript object that described a layout, and make actions depending on that object. That's a form of interpreter in a way.

To me it was the recursive mapping over a closed set of types, whatever you do you should end up with an element of your type/grammar. That and the graph walk space of the AST.

I'm currently writing a compiler so I'm not new to this, but I've no idea what you just said. Have you any links to explain it? TIA

The closed set of types describes how each pass in a compiler pipeline tends to build an intermediate representation of some sort, analyze/modify it, and then output it to the next stage. The IR is closed in that it is not open for everyone to extend willy-nilly. It is this property that makes it even possible to analyze and manipulate in a sound manner.

In a sense, compilers literally build a bunch of linguistic 'worlds' within their pipeline, each one emphasizing something different.

IIRC it's even in the "Design Patterns" book -- the Interpreter pattern. It really does come up a lot.

The GoF book ? .. I didn't remember seeing it but I have to admit I skimmed with very negative eyes :)

Yep, the interpreter pattern is definitely in GoF. Good way to learn a basic stack interpreter.

I would also add a build system to that list, a tarpit many engineers have fallen into.

> not sure that there is a difference to compiler as stated in the article

It's hugely different.

If you're building a compiler for an existing language, you inadvertently signed yourself up for implementing and supporting the entire language, which is a pretty large engineering task.

But if you design a new language, you signed up for not just implementing the language, but also:

- Actually designing it, which is a very difficult task with hard trade-offs all over the place.

- Documenting it so that others can actually learn it. Technical writing is another hard task that takes years to master.

- Syntax highlighting and editor support so that it's as easy to work with as other languages. You will discover that users in the wild apparently use thousands of different editors, all of which you are expected to support.

- Static analysis tooling so that code navigation, automated refactoring, linting, etc. all work. Users today expect a rich IDE experience for every language they use. Also, you'll discover the hard way that the architecture of an interactive static analyzer is very different from your batch mode compiler, so now you're likely writing two front ends.

- Thousands of answers on StackOverflow, blog posts, introductory videos, etc. Users learn in many different ways and they expect to find education material in the form they prefer.

Note that this also happens pretty much whenever anybody has any sort of template system that they want to support substitution into. You decide that you want to support substitution and suddenly you are building static analysis and documenting and writing syntax highlighting and all this mess.

So for example when you want to create a validation system for something, and then you realize that you want to have generics/type-variables/polymorphism, you want validation containers. Bam, the complexity suddenly shoots up! Or maybe you just want to let someone configure with yaml, and then you find yourself with substitutions and substitutions in substitutions and all that mess. Why?

The abstract reason is that the underlying mathematics of template substitution is lambda calculus, which is Turing complete. It's hard to learn lambda calculus because it's very abstract, it takes place in a mathematical ideal realm where nothing happens, but this also means that the same pattern reappears in a whole bunch of different contexts, and it is best if we just regard it as a design pattern and conscientiously adopt it. “Oh, this is a Lisp, I have seen the 20 line lisp-in-lisp evaluator, let's just code that in whatever host language and someone else will have thought of the details.”

"Documenting it so that others can actually learn it. Technical writing is another hard task that takes years to master."

Some of us might argue that you have to do that anyway, for any software you write. In fact, a similar argument might apply to your first point.

On the other hand, the last three points are really not requirements unless you are creating a general purpose language and have a large audience.

> Some of us might argue that you have to do that anyway, for any software you write.

Yes, but if your software is another implementation of an existing language, the burden is an order of magnitude smaller because almost all of the existing docs for that language apply to your implementation too.

> On the other hand, the last three points are really not requirements unless you are creating a general purpose language and have a large audience.

Even if you have a small audience, if you want them to be productive, they need good tools and resources. Part of the reason why it's hard for DSLs to be effective is that most of the resources get amortized across all users. When you have a small audience, you can't justify building all these tools and docs, but the users still need them and their productivity suffers in the absence.

If your audience is too small, I think it often just doesn't make sense in terms of cost to make a language for them. You're usually better off using an existing language that already has those resources you can leverage.

My first job out of university, I ended up doing one of those by mistake. My second job after university, I ended up doing another three of those, but at least we confronted the thing head on and deliberately, knowing that we were aiming to create them. Trust me, it's better to do it deliberately than by mistake.

Yes. Another way to put it is: engineering should be focused on what's needed more than what the Right People On The Internet Say.

Additionally, the dirty secret of all those things is they make orders of magnitude more sense to dig into and learn vs the innumerable arcane errors require Googling to fix (obscure Webpack errors, CSS hackery, etc). Your learning accumulates over time.

In situations of mutual dependence between software components in a big project, implementing a quick "scaffolding" solution for one of them can get development moving on all components independently. A long time ago, I was working on a new incompatible instruction set architecture at a specialized computer systems vendor, and had written a simulator and assembler for it. The O/S group was ready to start porting our UNIX variant, but there was no C compiler. So I wrote one; ANS C89 is a pretty small language. That got everybody unstuck, and the O/S guys could do their port while the compiler group could move at their own pace (and had a partial O/S on which to run compiler test suites). I guess my point is that there can be times when quick "scaffolding" components are reasonable.

Don't forget the paramount of the Mountain of Shit:

- a billing system

There's a reason why there are b2b companies that specialize in billing. When people branch out it's often billing and X, not X and billing. And even then sometimes they pull back. ADP bought a bunch of companies outside their wheelhouse and then spun them back out.

I mean I actually really want to build many of those things (the first 3 anyways), and I have. I just don't generally show the results to people or make them use them.

I can't vouch for the quality of what I've built, but taking time to attempt to build those kinds of things has always paid off in terms of overall betterment as a programmer.

Then I can go back to a day job of mundanely moving protobufs from one place to another.

Lol, I’ve built all those things.

I think the main thing is having respect for the mountain of engineering that went into existing solutions and not rolling your own unless you really really have to.

I think an Embedded DSL with post compilation constraint solver based analyzers could take the place of restricted mini-languages while taking advantage of the existing engineering and tooling available for modern languages.

So like a restricted subset? That's a really neat idea; you should develop it into a paper or something.

This restricted subset concept has the huge advantage that you can slip out of the harness in a pinch. So you can reuse a relatively developed Float-less Python program in Full Python and later walk back removing floats.

That's really clever. It essentially obviates the whole configuration DSL tango; it also gets people started where the rubber hits the road -- not with syntax and parsing, but with why the DSL exists and what it's supposed to do and not do.

You can restrict all sort of things, including of course a restricted subset of the language. Such solutions already have headcount at MS which is about as good as it gets for bringing a new thing like this to masses.

When it is released and as it matures we can extend it to do a bunch of other cool things like configuration DSL which would really help clean up Linux. Maybe even build scripts.

You could do stuff like this before with LISP, or Scala compiler plugins. But the tooling and accessibility will make a big difference.

Is there a name for this type of technique I can google?

For C# there is Roslyn Analyzers which was a foot in the door for a lot of things. F# has LiveCheck which AFAIK is a lot more powerful. I don't think there is much public info out there; Don Syme demos the use of it to add shape checking for Deep Learning models with full IDE interaction, but any arbitrary constraints can be added and constraint solvers are probably the easiest way to manage and compose those constraints. https://youtu.be/0DNWLorVIh4?t=3410

I had a very different reading of the article, colored by my own bad experience.

Sometimes you start off with this cool idea/hack to solve a problem that could have been solved some other way. But your cool hack solves the problem and more! Then your fellow team members all want it to be even more powerful, and your cool hack slowly grows into a monstrosity.

In these situations, often your manager did not particularly want you to do this, but he let you do what you want. He does want you to be productive, and this doesn't count for that much. But since the team is now dependent on you, he does expect you to maintain it. You now end up with two jobs: The one your manager hired you for, and maintaining this thing.

I would love to write my own DSL to solve a problem at work, and this post read to me as a cautionary tale on pursuing that goal - unless management is aligned with your vision. If they're not, solve it the boring (and perhaps nonideal) way.

I'm hit by this at the moment. I need to build something so that users can revert to a previous version of a record, but it's constrained to sqlite. I'm (re) building versioned document stores or temporal tables in various POCs now and I'm scared.

I would add encryption/authentication protocols to the list.

Make sure you don't roll your own crypto on your way, though.

Unless when it somehow becomes the absolute necessity...or it's for your dissertation.

You wanna get crazy?

When you hit:

- a ERP

You start to build all the others!

(This is where I'm now!)

I work on a commercial ERP and this is 100% accurate. It features:

* A proprietary language

* A compiler for said proprietary language

* A way to define tables and views and turn these definitions into real tables and views that live in an SQL database. There is also a custom SQL dialect that gets translated to real SQL on the fly. I'm counting this as having its own database even though it offloads the actual database work to a real database.

* One of the products in this ERP suite includes a CMS built on all of the above.

Look at the bright side. At least Hell is always warm, and plenty of interesting people to talk to.

Implementing a small domain-specific language is fairly easy. Why avoid it?

Because if it succedds, it needs to be supported long term. If you're doing that inhouse, chances are slim of original designers staying 10-20 years at the same company.

Then what you have in your hands is a legacy monolith neglected over its life which might double as core of your stack and it's awful to move away from down the line.

Been there, don't recommend it.

Oh, my god! If it succeeds you'll have to support it forever!"

Unfortunately, that's rather true of all software.

Indeed, but maintaining software that is a core part of your business is a very different game than supporting (complex) internal tooling. They don't bring revenue, are a time/money sink and not many companies can spare the resources to do it properly.

Anedoctal experience: a company I used to work for had all the core stack developed inhouse: programming language (yeah), compiler, database, etc. No one worked on it apart from the founder and a couple of other engineer that were no longer there.

A few years down the line a big client asked if we were able to integrate our software with their PLC and this had to be developed from scratch, as it wasn't in the scope of the programming language. The quote sent to them was in the six figures and up to 2 years for it to be ready. Every major programming language has a library that does this. Eventually the company lost the client.

In my experience it's fine if you keep its scope limited. The scripting language for an MMO I worked on for example is simple and effective.

Writing your own scripting language is really not something game developers should be doing for the past decade or so. There are so many good off-the-shelf options ranging from Lua to C#, plus a litany of obscure (but good!) ones.

For performance-critical code, even a tiny DSL is not a solution I like unless you're transpiling to C++ or something.

Hooking LUA or similar can introduce a lot of unwanted complexity, be harder for an untrained implementation team member to understand, and be less of an exact fit to the purpose. There are other cases where it makes sense though.

Mind sharing which one?

Legends of Equestria

Because the complexity of the problem tends to scale to your willingness to address said complexity. In other words, it likely won't stay small for long if it gets any users, and now you're the maintainer for a tool used by others.

This applies to all software in general. You'll get eaten alive by your success if you don't manage expectations and tightly scope things.

At least with a DSL, you can often define a small core language and de-sugar down to that.

That principle is not specific to a DSL, so that hardly seems like a reason to avoid a DSL.

I'd say it's more of a risk with languages and libraries (things that get deeply embedded into other people's stuff), but I get what you're saying.

These aren't "big scary things", they are "technical debt you didn't need to take on".

...and the alternate course of action is clear: use the tools you already have.

Any tutorials on building the last one? (Beyond basic Rails style CRUD) How do you clone SAP in a weekend?

Heavily depends on what specific incarnation of SAP you mean? If you want to have a SAP replacement for your specific use case, Rails CRUD is the only technological component you need. The much more important part to the recipe is knowledge of the domain model.

That's also where SAP's moat lies. Not with it's technological underpinning, but with the all the different industries and their processes which they've transferred into lines of code (and the ability to extend them with further LOC with the help of a "SAP consultant").

- a filesystem

> …But sometimes you actually _do_ want to build that…

That’s not true for everyone. I’ve been a dev for 20 years and have never done any of those, and I have no desire to at all. That is ok! Programming isn’t some kind of progression to run from “n00b hello world” to “1337 compiler hax0r”, it’s a tool to solve problems with. Many of us will never have to solve these problems or have no interest in these problems.

So, if you’re like me, don’t worry because you’ve never made a brainfuck interpreter or a database engine. It doesn’t mean you’re a bad dev. You can have a successful career without building these! And if you need them some time, learn them at that time… you may not ever need to.

I don't think the parent was suggesting that the individual developer has a personal desire to build a compiler for the fun of it.

It's much wiser and far easier (assuming your goal is to have a successful career and run a profitable business) if you Keep It Simple, Stupid! Building a compiler/database/CMS/ERP instead of using an off-the-shelf one is, for well-selected business problems, a bad decision, increasing complexity, adding risk, and allowing scope creep.

But sometimes, in spite of your efforts to adhere to the KISS principle, the real world is hairy and some complicated new problems need complicated innovative solutions, and instead of turning a crank to glue together some off-the-shelf libraries, you'll find yourself, in my particular example, way off in the deep end getting PCBs manufactured or, in the article, writing your own compiler. It's not bad, it just means you've got a niche problem to solve! Or that sales did a terrible job of understanding what they were offering, and something basically equivalent to the customer is just an off-the-shelf Wordpress plugin, but instead you're stuck writing a custom CMS.

For some (many?) problems, there is a point beyond which it is more difficult and painful not to build a compiler, interpreter, or other "I have no desire to" piece of software.

The trick is to recognize that point before you have a giant, unmanageable ball of ....

I think I might have used the expression "do want" to vaguely here. Same as you, I have very little desire to write any of those systems, and still try to avoid them as much as possible (as those endeavors usually require more time/energy, etc. and come with higher risks).

Maybe it's better put as "But sometimes all your product requirements strongly suggest that you have to build X and there isn't really a way to avoid it" (and what's available off-the-shelf doesn't fit)?

Thanks for the clarification (even for some rando on the internet), that makes a lot more sense... I think I'll implement a rule for myself where I won't comment on HN posts after first waking up anymore. Agreed that could totally happen!

'Put self into read only mode until after the first pint of coffee' is a policy that has significantly reduced the number of times I look back at a comment I've made somewhere and judge myself for its quality.

I mean, even fully caffeinated, I still manage to beclown myself on a semi-regular basis, but the 'semi' part is still a net win.

I think you should almost always build a compiler when presented with a parse then execute problem. It is so much faster to create a small machine that is fed data to specific the program, and a test harness that also drives that machine, then it is to hand code every little case repetitively.

Or maybe I like building little compilers.

Parser yes, compiler no.

It occurred to me at several points in my career that the tools I was using were deeply hamstrung by mixing interpretation of complex data with the use of that data, making it difficult for anyone to effectively add new behavior in the system. Most vividly, I ran into this with first Trac (which supports plugins, but at the time each one had to reimplement scanning of wiki words and all handled a different 80% of all scenarios) and Angular 1.0.

A variant on this, I don't know that I've ever heard a coined phrase for but I just call Plan and Execute. In the spirit of dynamic programming, you visit a graph of data once in order to enumerate all of the tasks you will need to do, deduplicate the tasks, and then fire them off. This works much better than caching because under load a single unit of work can potentially self-evict. It's also a lot easier to test. It does however delay the start of work, so either you need to make that up in reduced runtime (by not doing things twice) or have a mixed workload. You may also have to throttle outbound requests that had previously been staggered by alternating CPU and IO bound phases.

I have one bit of production code where this is working well, and another that desperately needs it but is challenging to reach feature parity due to the way it is invoked and limitations of a library that it uses. I'm going to have to change one or both.

When you describe it that way it reminds me of SAX [1] – I always hated SAX, but eventually realized it was kind of a tokenizer that left it up to the developer to figure out how to turn that into a compiler, though in this case compiling XML input into some internal data structure or action.

[1] https://en.wikipedia.org/wiki/Simple_API_for_XML

Weirdly, but sometimes ideas churn around HN, I just replied to another thread about this. See my reply on this thread https://news.ycombinator.com/item?id=29917060

At BitFlash, one of the things we had to build was a SAX parser for the SVG DOM. I used the DSL of the W3C spec to compile the SAX parser.

One of the more strange things I did was in some contexts (think old school Blackberry) we had a server to pre-parse the SVG so we "knew" it was clean (I'm still sceptical 20 years later this is ever a good strategy, but take it as a given). Because we knew the SVG was clean, there was a faster way to parse the XML than reading the tokens.

I used my magic Perl script transformer to compute the lowest entropy decision tree to identify a token with the fewest comparisons, which was surprisingly way more efficient than a trie.

The real annoyance with SAX is its push model. Similar pull APIs (e.g. XmlReader in .NET: https://docs.microsoft.com/en-us/dotnet/api/system.xml.xmlre...) let you parse XML in much the same way as a recursive-descent parser parses text, with the reader working much like a lexer.

I completely agree with the sentiment but also feel like it is just my bias because I enjoy writing compilers.

This reminds me of a great post about the typical software evolution going from hard coded values, through configuration and rules engines to a DSL, only to be back where you started.

I have seen it happen on multiple projects, it’s hard to spot it at the time as all the changes seem sensible but hindsight is a wonderful thing


This line really struck me: “They will have to change at some point, and you don’t want to recompile and redeploy your application just to change the VAT tax rate.”

Continuous deployment has changed this for me. I'm still not going to leave tunable constants in the middle of random if statements. But nowadays instead of trying to avoid recompiling and redeploying, I'd rather invest that time into making that very easy.

Yeah, CD is a big game changer here. Managing config separately from code is a big pain (as that piece points out in colorful detail), and as the countervailing painfulness of a deploy drops toward zero, it becomes better and better to just hardcode some consts. "Business rules" just become "code you write". Easier to change, test, and reason about. At my last job we moved to CD and it was fun to see everyone (haltingly, sometimes unwillingly) change their approach to this.

There's still a place for configs and even DSLs (multiple deployments, on-prem software, biz logic written by non-programmers, etc) but CD eliminates the huge class of "this is too expensive to deploy changes to" cases.

Pulling things out into a configuration file (or environment variable) is valuable when either (a) the value needs to be set per-environment and you have a variable number of environments or (b) the value needs to be changeable by somebody other than the dev team.

Also it's worth considering for constants whether what you actually want is a 'resource file' (which could easily just be a file containing all your constants written in your programming language, no need to use a different syntax if that doesn't gain you anything) - learning and internalising (and also learning how to explain to other developers) that a resource file is -not- the same as a configuration file has been a significant quality of life improvement for me.

Did you happen to see the recent discussion of the maintainer-hoarks-up-a-dependency problem? How does your CD system handle testing?

IMHO an embedded DSL (eDSL) helps to close the loop somewhat. At least you're writing your "configuration code" in a real programming language, while still keeping the configuration separate from the main application. It addresses most of the issues the article points out regarding custom DSLs: no/limited debug capability, lack of tooling and IDE support, unable to write the same style of code used in the application. Of course, it helps if your main language has good support for eDSLs, since not all do.

I'm always surprised by the lack of "easy" parser/compiler toolkit for more or less complete DSL... It looks like there's only

- either full blown programming language for IT guys

- "natural language" AI toolkits (not really usable yet for just simple automation by users)

- graphical language like State Machine or "no code", missing loops and requiring mouse and boxes

However, most of the time, user simply need some kind of BASIC (like visual basic used a lot in "shadow IT" by users)... but there's really few libraries providing that kind of functionalites

Racket is often heralded as the "programming language programming language" and comes with tons of features to build full blown languages: https://racket-lang.org/

The most complete language/ecosystem that really showcases Racket's capabilities is Rosette IMO: https://github.com/emina/rosette/

Also worth noting that language that runs this very website was implemented in Racket.

I think it was implemented in PLT Scheme before it stopped being a Scheme implementation and became Racket.

Hacker News runs on a language that runs on Racket? Which language is that?

Arc Lisp[0]. This forum was concieved as a MVP for the language.

Anarki, a divergent open source fork, can be found here[1].

[0] http://arclanguage.org/

[1] https://github.com/arclanguage/anarki

I thought that was Arc?

It is. Arc itself is written in Racket.

Ok, so pretty good proof that racket is a working programming language programming language then :-)

Most people who want this end up just selecting a regular programming language with a powerful-enough macro system + flexible-enough compile-time parser; and then using said language to define a "more or less complete DSL", by redefining all the primitive syntax features of the language to instead be primitive syntax features of the DSL.

For example, the "numerical definition" DSL within Elixir's Nx library (https://github.com/elixir-nx/nx/tree/main/nx#numerical-defin...).

Users using this "full" DSL are still using the host language — they're invoking its compiler/runtime and so forth — but they don't necessarily need to care about that. They can still technically access the power of a full-blown programming language, "tucked in" around the edges of the DSL — but you as the DSL's creator, don't have to document that power as being available. As long as users of the DSL can do everything they need to do without any need for "breaking out" of the DSL's little world, they don't need to be aware that they could e.g. make fully-namespaced calls to regular stdlib functions of your runtime.

The intermediate solution is a programming language and/or runtime specifically designed as an extension system. I've had positive experiences embedding Lua. GNU Guile was specifically designed for this too, if you think your users can deal with the parens.

I think there's a tipping point. You start with a configuration or markup language. Then you add and add and add and next thing you know, you've built a very simple scripting language.

It's kind of like those people who cobble together a Turing complete implementation of HTML+CSS or Magic: the Gathering game. You can kind of stumble into completeness.

> However, most of the time, user simply need some kind of BASIC

"most of the time" is THE problem here. Once users approach a barrier (missing language feature) they will complain and push you until you implement that feature. After several such features some developer in basement of your building will say "ok, fine, I'll do it" and now you accidentally have a compiler.

Edit: I know because I've recently made a "almost universal data watcher" which can react to combination of measured values changed in different devices and execute actions. Simple solution, but I fear the day when someone needs to execute other recipes in response to data and creates loops.

That's a problem, but before that what's the good answer to "how can I take advantage of existing libraries to make a quality but simple BASIC-like DSL?" And in particular, where's a well-thought-out BASIC-like template I can customize?

There's Forth. Many would argue it's not a full blown programming language, it's more like a toolkit to make DSLs.

There's also Eric Lippert's blog series on OOP and business rules [1] that culminates [2] in suggesting a DSL with Inform7 as a possible candidate. The comments on the final post are great as people rail against the use of a DSL.

[1] https://ericlippert.com/2015/04/27/wizards-and-warriors-part...

[2] https://ericlippert.com/2015/05/11/wizards-and-warriors-part...

I started working with C# about 6 months ago. I'd read that post before, but I'm only just now understanding it. I think I really missed the boat when it comes to things like OO programming patterns. I'm still kinda fuzzy on the visitor pattern he mentioned. I think I get it?

The abstraction he offered, of having a tree and then visiting every node in that tree, made the most sense.

> One of the most common usages of this pattern is for the scenario where you’ve got a complex tree where tree nodes can be of many types, and you wish to traverse the tree — that is, visit every node in the tree — taking some actions or performing some computations as you go. The “double dispatch” part of it is that the actions you take as you visit nodes in the tree can depend on the runtime types of both the nodes in the tree, and the code performing the action.

That sounds like a parser. It accepts an AST and evaluates productions on its way through. Saying "actions you take ... depend on the runtime types of ... nodes ... and the code performing the action" sounds similar to sorting out how to mash tree-traversal and productions together.

In my compilers class we first wrote a code linter for our "Wombo" language (a simplified Java). Then we wrote something that built an AST. The core tree traversal code was the same, but the "productions" paired with nodes were different.

It’s great when you don’t need to have a huge codebase.IME forth gets out of control quickly from a maintainability standpoint as you go big, but it is excellent at small dsl type tasks. The maintainability issues with forth might actually be solvable if the right IDE paradigm could be found.

I don't think so. Let me try to explain: regular languages are only viable as long as someone is willing to speak them and learn them. Latin being the obvious exception, but that's mostly because of medical science and biology, two very large branches of the scientific tree.

A DSL is a language with only a handful of users, each of which is usually in the normal path of employment (obvious exception: Chuck Moore). As soon as someone in a team that relies on a DSL leaves the loss of knowledge is immediate. Unlike any other language that you may have picked up in some other organization the DSL has instantly lost a sizeable fraction of those versed in it. It doesn't take many losses like to kill such a project.

Another problem is that even if the original authors hang around, the chances that they will stay fluent in their own DSL diminishes with increasing intervals between work on a particular project. If the code would be written in a 'normal' language then they would stay fluent.

So DSLs are not a good match for how we normally think about software, even though there is a 1:1 correspondence between the words in your typical DSL (say a FORTH word) and a function call. Functions tend to be written in terms of other functions, just like FORTH words are written in terms of other words, and yet, the standard library is never far away whereas with FORTH words it could easily be 30 words down before you hit the actual language.

I'd love to see something like a DSL based language but with longer term prospects, for a long time I hoped that Smalltalk would be that language but I have mostly given up on that.

I think it depends what we mean by out of control.

Forth does less compile-time checking than any other language I could name. It doesn't have a type system of any sort, not even a dynamic one. This makes it very flexible, but ties its hands when it comes to easy/automatic correctness-checking of DSLs. You'd need to implement your own checker.

There are also valid criticisms of Forth's stack-oriented nature regarding readability and maintainability:

At a glance, you have no idea of the stack effects of those 3 words. This simply isn't an issue in conventional languages.

(For context, I rather like Forth as an interesting and very unusual kind of language, but I don't think it's the future of mainstream software development.)

> the standard library is never far away whereas with FORTH words it could easily be 30 words down before you hit the actual language

Forth famously encourages writing short words, but 30 still sounds extremely deep.

The way I see it, Forth is the architecture-independent assembly that is optimized for ease of bootstrapping on new hardware at the cost of efficiency.

I agree that assembly is a good comparison.

With a threaded-code interpreter, yes, Forth's overhead is considerable, but with a native-code Forth compiler, performance is generally only slightly worse than C.

This is presumably because modern optimising C compilers are incredibly sophisticated, whereas modern Forth engines tend to be essentially one-man projects, even the commercial ones.

> performance is generally only slightly worse than C

I've built a couple of projects both in C and Forth, the latter usually to prototype and once I had a good understanding of the problem space the final version in C. Typically the C version outperformed the Forth version by 3:1 or better, and I would not have known how to bridge that gap.

Well written C is quite performant and Forth's overhead (which is relatively small, but it is there) translates into Forth's threaded interpreter executing a JMP instruction at least once per Forth word and that alone is probably enough to explain a good chunk of that gap. FWIW, my experience with Forth is limited to one implementation and a whole bunch of applications (after being introduced to the language by a veritable Forth wizard in 1985 or so, Hi, LS). Even the code of experienced Forth programmers was no match for my then 6 year old experience with C. Nowadays with far larger caches Forth might do better, I haven't really worked with it for years.

With respect you've ignored the point I was making. There exist several Forth engines with native code-compilation, for instance VFX Forth, SwiftForth, and iForth.

> Typically the C version outperformed the Forth version by 3:1 or better, and I would not have known how to bridge that gap.

With a threaded-code Forth interpreter I'd expect the C version to outperform it by something closer to 5:1, so 3:1 doesn't sound too bad. The only way you can close the gap is with good quality native-code compilation.

> Nowadays with far larger caches Forth might do better, I haven't really worked with it for years.

It's interesting how advances in CPU architecture change the relative performance of the different threading strategies. This has been nicely studied by the gforth folks. [0][1] Threaded-code interpreters still easily lose to optimising native-code compilers though, [2] and I expect they always will.

More on how Forth collides with low-level CPU matters: [3][4][5]

[0] https://www.complang.tuwien.ac.at/forth/threading/

[1] https://www.complang.tuwien.ac.at/forth/threaded-code.html

[2] https://github.com/ForthHub/discussion/issues/88#issuecommen...

[3] The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures - https://citeseerx.ist.psu.edu/viewdoc/download?doi=

[4] Branch Prediction and the Performance of Interpreters - Don't Trust Folklore https://hal.inria.fr/hal-01100647/document

[5] Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters - https://www.scss.tcd.ie/David.Gregg/papers/toplas05.pdf

There's Prolog and its Definite Clause Grammars (DCG) formalism. Here's a DCG for a tiny subset of natural English (copied from wikipedia [1]):

  sentence --> noun_phrase, verb_phrase.
  noun_phrase --> det, noun.
  verb_phrase --> verb, noun_phrase.
  det --> [the].
  det --> [a].
  noun --> [cat].
  noun --> [bat].
  verb --> [eats].
The syntax is just like BNF. "-->" is "::=", terms in []'s are terminals and the rest are nonterminals. Nonterminals can have arguments and there's special syntax to include entire Prolog programs in the body of rules.

I discovered DCGs at the end of my CS degree when I was working on my graduation project, a Magic: the Gathering expert system written in Prolog. I needed a way to script cards for my rules engine, to translate expressions in the M:tG ability text language to function calls in the engine. I really wanted to write a parser using the standard parser generator tools, but I thought it would be too much work for a graduate project. Then I found out about DCGs. With DCGs I could write grammar rules like this:

  tap(X,Y) --> (['Tap'];[tap]), target(X,Y), (condition ; []), {permanent_Type(Y)}.
  tap(X,Y) --> (['Tap'];[tap]), all(X,Y), (condition ; []).
And that would not only parse, but also _generate_ ability text like "Tap target Kavu" or "Tap all Goblins target player controls" etc. So as a side effect of adding a bunch of abilities to my engine, I also had a generator for new M:tG cards.

[Edit: to be clear, the grammar rules above are already, by themselves, a parser for abilities that start with "tap". The Prolog interpreter executes those grammar rules as a program and instantiates their variables in the process. To integrate them with the engine all I had to do was write engine functions that called the nonterminals in the grammar and read the values of variables like X and Y in tap(X,Y).]

Admittedly, my grammar was limited because I didn't have the time to hand-code a grammar for the entirety of M:tG ability text, as it was at the time (I graduated 2011). In any case my rules engine was also limited (I had no planeswalkers and static effects were only partially implemented, if I remember correctly; because the rules for static effects are a bit arcane). But I managed to cover a good subset of the rules and cards, so it was possible to play a basic game with creatures with activated and triggered abilities and with non-creature spells [2].

The important detail is that DCGs are syntactic sugar for ordinary Prolog, so I could code my parser in the same language as my engine and my player. My biggest problem was really to draw an imaginary line between the parser and the other two components, because the easiest thing would be to allow them to know about each other and be infinitely entangled in a big ball of mess.

Anyway, yeah, there's something out there that is perfectly suited for creating DSLs, if you have the patience [3]. The tradeoff is that you have to learn Prolog. I suspect this is a cost too high for many who just want an easy-to-use tool.

Btw, you can't easily run my degree project unfortunately. Like an idiot I wrote it for a proprietary Prolog system and although I tried to port it to the free and open-source SWI-Prolog, it's been a nightmare of incompatibilities. The ported project is here but it breaks unexpectedly all the time:



[1] https://en.wikipedia.org/wiki/Definite_clause_grammar#Exampl...

[2] I also had an alpha-beta minimax AI player. Not that good though. I wanted to program a player with background knowledge of M:tG but I didn't have the time.

[3] Or you can learn them from examples... See Experiment 3 on page 19 of the pdf, here:



Do you have any recommendations for good articles about Prolog? I really like the idea. I'm particularly interested in solving assignment problems using constraint logic programming.

To be honest I don't know any short articles that do a good job of describing Prolog.

For a practically-minded programmer, I'd recommend this longer tutorial that guides you through the creation of an old-skewl text-based adventure game:


As far as I remember, the tutorial should run on most Prologs without (significant?) modification.

For a more in-depth, high-level, more computer-sciency view try Marcus Triska's pages:


Marcus Triska is also the author of several constraint logic programming libraries, for example:


Then, there's a number of textbooks. The classics are Clockin and Mellish and Bratko:

Programming in Prolog (Fifth Edition):


Prolog programing for AI


I personally really enjoyed this book by George Luger:

AI Algorithms, Data Structures and Idioms in Prolog, Lisp and Java


For the theory behind DCGs, there's Pereira and Shieber:

Prolog and Natural Language Analysis


And for logic programming theory in general, the seminal source is J. W. Lloyd:

Foundations of logic progamming


Unfortunately, I can't recommend any newer texbooks.

JetBrains has a MPS (Meta Programming System) [1] that I believe is for knocking up DSLs. At a lower level, I've used ANTLR [2] a bunch of times, and its StringTemplate even more often.

[1] https://www.jetbrains.com/mps/

[2] https://www.antlr.org

Sounds like a gap TCL could fill. The syntax is simple and consistent, a developer could create packages that a non-developer end user can use as commands, and it could be wrapped with GUI code that runs on Windows, Mac, or Linux.

Linked article appears to be describing the exact original motivation for writing Tcl:

"My students and I had written several interactive tools for IC design, such as Magic and Crystal. Each tool needed to have a command language (in those days people tended to invoke tools by typing commands; graphical user interfaces weren't yet in widespread use). However, our primary interest was in the tools, not their command languages. Thus we didn't invest much effort in the command languages and the languages ended up being weak and quirky. Furthermore, the language for one tool couldn't be carried over to the next, so each tool ended up with a different bad command language. After a while this became tiresome and embarrassing."


Tcl is good, but non-developers could really benefit from more write-time checks. Something that can tell them "there's no command called moveRight, the nearest match is move-right" before their script is executed.

I agree. I would really like the same. Komodo IDE may help with this, but I haven't tried it. I usually end up in vim due to habit.

Just use flex / bison?

I've always found ANTLR to be more intuitive and maintainable than the flex/bison combination. And the learning curve is gentler.

And the tooling is better. I always found it hard to find working examples with bison.

Looks like these are grossly underappreciated these days. They are wonderful, easy to use tools.

I think "easy to use" might be an overstatement. I spent a fair bit of time trying to get in to the flex/bison workflow and it never seemed to click. I've found it much easier to use PEG parsers, parser combinators, or a hand-written recursive descent parser. Based on their ubiquity and the high reviews, I'm sure they're great tools once the initial hump is passed but that initial hump is quite large.

I think it's because they're less easy to use tools than writing your own lever/parser, which takes about a day and is fully introspective and debuggable. And if you need decent error reporting at the input it's much easier to thread that through.

Plus almost everything language has some kind of PEG or parser combinator library these days.

The easy options:

As others have said, embed something existing. I think this is especially true for programming languages. Even if Lua isn't quite what you want, it's probably going to deliver a lot more value than anything you could bash together on your own. And there are a few other options as well. There are several of these "complete toolkits for DSLs", it's just they do take a bit of work to correctly insert into your code base and then expose enough of what you're doing to make it useful. There isn't and as far as I can see can't be something you just install and it magically exposes everything and it's just awesome and takes no work, because even if you're in something like Python where you can just open an interpreter shell in the current context, thereby technically giving the user the power to do anything they want, you still need to create a safe interface for them, document it, test it, etc. (You may not do a lot of technical work to keep them out of where you haven't tested but you will need to provide them a supported area to get anywhere.)

For building your own: For small tasks, a parser is just a specialized sort of deserializer. Yea verily these 30 years ago, there wasn't a lot of generic deserializers available, and the ones that existed like asn.1 were completely unsuitable for human use, so "classic" compiler literature spends a lot of time on building your own parser. Now, if you don't have a specialized need, you should use a generic deserializer. I have quite a few JSON-based syntaxes lying around. (They feed "interpreters", but same principle.) Lisp S-expressions, if suitable for your environment, can be adapted. XML could be useful in some specialized circumstances where your input looks more like a document than conventional code. Etc. You should only build a custom parser if it's going to give you more bang for your buck across the whole usage cycle than a generic parser can give you. The reason why I have several things that input JSON is precisely that the amount of usage across all time was just going to be too low to justify doing anything specialized, because it was always a strictly internal tool.

(Hint: Just because your compiler accepts JSON as input doesn't mean that that's what the humans have to directly output or edit. My biggest interpreter took in JSON but I provided abundant function helpers that abstracted that away. It was a thin enough abstraction that it could be penetrated if need be, but it avoided most of the verbosity issues as far as the humans were concerned, and gave them the full power of the programming language to generate loops and such. If you squint, I basically borrowed an existing general-purpose programming language that everyone involved already knew to use as a macro language for my language, basically for free. This means the "core" language that I was implementing didn't have to do anything that the already-existing general-purpose language could do in macros, meaning the layer the humans used offered a lot of power for not much implementation cost on my side.)

For AST transforms, this is always where your secret sauce is and there's little help to give, beyond a suggestion that this is arguably the canonical case for a functional programming style, even if you are in an otherwise OO language.

For the final output, I would suggest that in modern times, the easy answer is, don't bother. You can probably interpret fast enough to solve your problems. If not, there are some options like LLVM that will help vs. what you used to have to do. But I would definitely consider just interpreting the final AST directly. If you intend to do that your AST transforms can be built with interpretation in mind.

The real value of a compiler and what distinguishes it from any ol' normal code that ingests something and emits something is the AST in the middle, and the complexity of that AST (probably through significant recursion) and the transforms you run on it, and the corresponding complex behaviors the input can invoke through the recursive input and complex AST transforms. Anything you can do to brush away the incidental details of a compiler by using existing serializations and if at all possible simply skipping the output step of the compiler will hugely increase the bang-for-the-buck you get. You might say, rather than "writing a compiler" consider "using compiler techniques" but not writing a compiler. But not in a way the article was talking about.

For the specific case mentioned there, go get an existing parser for your language, get the full AST tree. You can write your transform in terms of what you do understand and just pass through anything you don't. It's slightly harder, but will be fairly effective.

It is possible even today, of course, that you will be backed against a wall and be forced to go through the full classical compiler cycle. A programming language, for instance, just can't afford to use any existing parser, it's too much a vital part of the experience to try to use an existing one. Even a new Lisp will probably want its own spin on S-expressions enough that it can't just pick up an existing one unchanged. And maybe you've got something that you expect will be big enough that it just requires a new syntax, and you can afford all the support that goes with that. But the vast majority of the time, with the right toolset, you can indeed obtain most of the value of compiler-like techniques without the cost of the full classical compiler technology stack, because even if it was hypothetically possible to build a "real compiler" that had its own custom syntax rather than some slightly dodgy JSON and emitted LLVM code that resulted in an executable that runs all its tasks in .0003 seconds, the engineering effort for that is so much greater than the JSON input and interpreter that runs in .003 seconds that in most cases it's not worth it.

>(Hint: Just because your compiler accepts JSON as input doesn't mean that that's what the humans have to directly output or edit.

Ah, the JSON_IR!

What about ANTLR?

Actually, I was thinking about having an "already build" BASIC language... not creating a new one each time.

As a programmer, you would include the libray, provide the binding to your program... and done ! In a way, it's a bit what is possible with java.scripting but they provide a Javascript engine but no BASIC for end-user

At my company (JITX) we've fully committed to compiler architectures in our stack, which makes sense since we're developing an embedded DSL for circuit boards.

It's remarkable how many problems are easier when you just accept it's some kind of compiler problem that needs parsing into a tree you can walk with a pass to spit out the required data. For example in audio networks, you can write a buffer allocation and latency compensation solver as a compiler pass over an AST that represents the network topology, collected by walking the network graph objects. It's way easier to write and test than using the same objects as the network itself.

I will say one of the downsides (if not tackled early) is incremental and streaming data through the architecture. It's a lot easier to write a batch parser than interactive one - which can hurt if you need partial compilation in the future.

Something I've found useful when perpetrating custom languages is to, as early as possible, bring up a REPL that handles multiline expressions sensibly.

This (a) requires me to build stream/incremental parsing (b) gives me a tangible feature out of doing so that makes my life better (c) means from then on even if the production uses of the language are all batch shaped to begin with I'm regularly dogfooding the non-batch code paths.

(because yeah, it -so- does hurt trying to retrofit that later on, so 'finding a way to trick myself into -wanting- to build it in early' was rather a win for me ;)

Lucky for us, the language we embed the DSL(1) within has a REPL!

(1) http://lbstanza.org/

The article is “addressed to those who did not want to build a compiler”: what's the point of this article then? It seems to implore to not build compilers, but those who did not want to in the first place, probably actually did not build compilers.

The article does point out numerous challenges with building compilers, and, by extension, with software language engineering, which concerns itself not just with building compilers but with the design of languages, and implementation of non-compiler tools around it. It merely points out the existence, and likely occurrence, of those challenges, and it can be surmised that the author is frustrated by frequent experience in his own daily life.

A bespoke software language -or DSL, if you will- is not always the best solution. It really depends on what you carve out as the domain, how fast that domain changes, what costs and risks are associated with those changes, and with which people you have to implement those changes. No amount of tooling is going to help you out if the domain you chose to recognize as such isn't changing fast enough, isn't costly or risky enough to change (quickly), or the domain stakeholders are anyway not helping out.

But in case you do want to build a DSL, have a look at a language workbench like JetBrains' MPS, or this book I happen to be writing: https://www.manning.com/books/domain-specific-languages-made...

It's not a warning not to build a compiler, it's more of a story about how feature bloat might cause one to inadvertantly build a compiler without realizing that's what they were doing. The last line reveals it by telling us that each of those "features" was actually a standard compiler component:

> Done at last, you say to yourself, without having to build a compiler.

> A parser, an intermediate representation, transformation passes, and a code generator. Dear Sir, you have built a compiler.

[OP] Yup, that was the intent. I've seen a lot of academic and industrial hand wringing about not wanting to build a compiler because it's an overkill for the solution only to have the people come back 6 months later having built a compiler.

I'm not saying everyone should build a compiler for everything–but when you should build a compiler, you really should bite the bullet and build a compiler.

As I interpret it, it is adressed to those who have a problem which needs a compiler, but don‘t want to build one because it sounds scary and difficult.

This is a great article, and if I were writing path-critical code for the situation the author describes I'd use an AST too, but I still munge me some text and most of the time it's good enough.

The problem with using AST libraries is that they're hard to understand unless you're used to them. So if you have to choose between a couple hundred lines of python or ruby and a giant AST I think it's smarter to go with the ruby. That way when something like a new emoji comes out, or what have you, the junior dev can handle the minor bug fix. Your AST is still going to choke on it. No way around it, the world has changed.

That said, there is an art to writing text processors and keeping things sane. The most important bit I've learned is to avoid transformations during extractions even at the cost of performance. At least until a fully exhaustive test suite is up. It makes it easier to test and it's a clearer separation of responsibilities.

I've written a lot of these parsers. One for a custom format for the Nasdaq on a project I was the lead on! I shudder to think of how many bits that thing is processing since it pipes data to and from thousands of data brokers and hedge funds.

Ways you can avoid building a compiler:

1. Lower the scope of the solution. If your solution is designed to address 100% of the users' problems, it's probably too grand of a solution. Start with 80% of the problems. Identify all the problems, rank them by priority and difficulty, and leave the most difficult and least priority problems out.

2. Lower your expectations. Imagine your great idea. Now imagine how you will implement it. Now imagine it is 100x more difficult than you imagine. Woof, that's hard! Strip down the implementation to the essentials needed to solve the immediate problems.

3. Give the user the minimum possible functionality to address their needs.

4. Go find an existing solution that provides these requirements.

If you did it right, you will probably find a solution that you really don't like but already exists and solves the problems you need to solve right now.

And in six months, the users will be hacking around the parts you left missing, probably by emailing Excel spreadsheets around.

If you did it right, you will be off to your next place of employment before the wheels completely come off.

"Select the pistol, and then, select your compiler."



"Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."

(Same applies to popular languages after the original quote.)

What often goes unsaid is the corollary: "Any Common Lisp implementation people use contains an ad-hoc, informally-specified, bug-ridden optimizing compiler written in C or Fortran."

That makes no sense. Which Common Lisp implementations contain a compiler written in C or Fortran?

I should have been clearer: written partially in C or Fortran.

SBCL, for example, has C bindings in the /runtime subdirectory.

> SBCL, for example, has C bindings in the /runtime subdirectory.

So nothing to do with SBCL's compiler (which is somewhat confusingly named 'Python' and written entirely in Lisp), then.

Well, handwave handwave. Nothing to do with the compiler except that I'd assume if you rip out the /runtime directory you'll get a compiler that can't output working code.

The point is that these domains have a tendency to eventually overlap each other.

The Python compiler can output working code just fine. It takes either a Lisp structure (COMPILE) and generates a code object in memory, or takes a text file (COMPILE-FILE) and generates a FASL file. This has absolutely nothing to do with the runtime, except perhaps for COMPILE-FILE's need to read and write a file which ultimately needs to go through a syscall in the runtime. But the compiler is not even partially written in C. (And considering that Python originated as part of the CMUCL project in the mid-80's era of Lisp machines which didn't even include any C code whatsoever, writing even a part of Python in C would have been a curious design decision back then.)

shameless plug: I have done a part 1 and part 2 presentations on this very topic

and Your Program as a Transpiler: Improving Application Performance by Applying Compiler Design https://www.youtube.com/watch?v=TWfigR9wGsA

Your Program as a Transpiler: Applying Compiler Design to Everyday Programming https://www.youtube.com/watch?v=BUrY6On1SxM

the second in particular is about reasoning in terms of compiler phases when you have to process something that apparently may not immediately look like a programming language.

Seems like the idea of exposing a programming language to the end-user should have better support so that it's more routine to create a restricted subset of something existing for users to put arbitrary code in. Something with both standardized textual interface and standardized structured editor interface would be great. I guess this was always the dream for LISP but it never got there.

The language pattern occurs quite frequently in many domains. Unfortunately, I have so far only seen half-assed interpreters. At best, people did a shallow embedding (interpreter) of their DSL. No one ever did a true compiler.

I think the reason for this can be found in the ideal encoding of the input language: In its most concise formulation, a language is a recursive algebraic datatypes. Handling of that input requires a recursive algorithm that deals with all corner cases. Engineers I worked with tend to not see their input language as an ADT, though. They focus on some particular use cases and if there is a specification it is often too large and yet incomplete. But on top of that comes the hesitation to implement a complete algorithm. Nearly every time, some "corner case" is ignored because "nobody uses it" and we end up with an implementation that is factually incorrect and incomplete.

You don't even have to squint to see that BNF description is a sum of products data type. With that, functions on those types write themselves and you're a type-check away from consistency.

On the flip side, if you start out with "ah shoot I have to write a compiler", that can be paralyzing unless you happen to know how to write a compiler. Sometimes it's best to just write code, do it the wrong way, and then learn the compiler stuff on the fly.

It's a hilarious post but I'm scared and curious to ask what this is in response to?

Yes, I agree. It feels like some context is missing here.

[OP] Ah, I've just seen a bunch of people showcase "amazing new tools" that are just bad compilers in disguise. State of the art Database query optimizers are good examples of this.

YAGNI is a good principle here. Whenever I've found myself thinking about reaching for a parser library, I was over-complicating or over-generalizing the problem. Write the code you need to solve the problem you actually have.

YAGNI works well as long as you are willing to re-write your code in the rare cases when you do "need it".

It's fine to start without a real parser, and you often will end up not ever needing a real one, but if the complexity of your ad hoc solution starts getting anywhere close to the complexity of a real parser, you are better off re-writing it as a real parser because parsing is an extremely well-studied area of CS while your ad-hoc "not parsing" is not.

I agree completely. Anytime I’m looking at a parser library I just shake my head and close that browser tab. I’m invariably going to want a hand-rolled recursive descent parser two weeks later, so let’s just get on it.

So much this. I've used a parser generator successfully exactly once, and even then I simply didn't care about proper error reporting.

I wouldn't mind something reusable that generated red-green trees for me a la Roslyn. I can bang out a recursive descent parser in a day, but not one that can be run after every keystroke.

There's a middle ground with parser combinators.

Once upon a time, a project I was working on had a particular sub-problem: we sucked in a data file from an external source that consisted of a collection of records that had to be read, validated, and either diddled around in our DB or, in the case of errors, passed off to a human for pondering.

The guy who was originally assigned was quite skilled, but apparently he'd never written an interpreter before. He started with the typical Spring Beans Uber Alles infrastructure and processing for the records, but then somehow managed to get pulled off on some other aspect of the project and this part was passed on to me.

After my first look at what he had, I realized that error handling, which ranged from misformatted records to nonsensical values, would be a giant ball of razor wire and lemon juice if I continued with his design. So I tossed out much of his infrastructure---keeping his good-record handling---and wrote a simple recursive descent parser that recorded errors (including their locations in the data file) for human processing. The result was simple, clean (if you've seen an interpreter before), and I haven't heard of any bugs reported in it.

Turns out, IDNI.

The article makes a good argument against YAGNI, at least YAGNI applied in its naive form. The point is that if you're going to make an embedded language, just do it the right way from the start instead of trying to cobble it together with YAGNI-inspired half-assed implementations that break in weird cases.

Overall decent take on why you should go for a compilers course, but it has a very weird section advocating for non-deterministic type checking. You want probabilistic AI methods to guess that this string is more of an integer? Buddy, knowing what kind of data we have for certain is the only reason we built type checkers in the first place. Take your 90% chance int but 10% chance string and get out of here.

Maybe he means “error messages are bad, we should guess what people mean and make better suggestions on type errors” but the case was not argued well. It also stuck with an article-long joke of accusing compiler people of being boring at parties. This is 2007 after all, an unenlightened time.

You need a compiler when you need language, that is, when you need to interface directly between brain and computer. Everything else can be less messily-specified. A good middle ground is the DSL, domain-specific language. You use the syntax and semantics of the language being used to code in, and change the nouns and verbs. All the nouns and verbs are given meaning as values or functions. Execute in a sandbox so that your DSL can't find the rest of the language. Ruby makes this all trivial.

Of course, if you're trapped in a static language with no way to execute significant logic at runtime without a ton of heavy lifting, then yeah, you'll end up maintaining an interpreter / compiler if you're not careful. Really do yourself a favor and see if you can't introduce Ruby, you can adapt it to your needs. You can inherit from BasicObject, BasicObjects can't see anything you don't want them to see.

I feel like this easily could have been addressed to a much younger version of myself on a particular project more than a decade ago (~12 years). My story happened in the early days of SPARQL, when SPARQUl (SPARQL Update Language) was very new, a draft, and definitely not merged into SPARQL yet. Oh, and dinosaurs roamed the earth. That last is from my daughter.

My project involved semi-autonomous software agents that each maintained a model of one aspect of a thing being monitored by different types of sensors. Agents were divided into nerves, and brains. The model maintained by a nerve was of a limited aspect, containing explicit, semantic representations of sensor reported values. The models maintained by brain agents were implicit, derived from one or both of two stages of constrained semantic reasoning (closed world model, and some other constraints).

Some nerve agents did the first stage of semantic reasoning, more simplistic using description logic programming rules encoded into the model. The results were stored into the model so that they, and parts of the more complex aggregate models derived from them up the chain, could be removed on the next pass.

Brain agents did a second pass of semantic reasoning. This pass is where the 'You have implemented a compiler' comes in.

That pass used rules encoded in human readable form. We looked at CWM for our rules, but that didn't have enough expressivity and didn't mesh well with some other tools we had.

We encoded the rules a format we created, called Sparql++. It had SPARQL, SPARQL Update Language, for loops, while loops, variables, and macros. Our code took the rules and put them into an intermediate form, then translated that intermediate form to EulerSharp rules and ran them, then injected the results into our model. The intermediate form and EulerSharp rules were saved so parts that hadn't changed didn't generate new EulerSharp rules.

It was ungainly. It worked well. And it was effectively a compiler.

The nerve and brain concept sounds fascinating, but I don't have enough background in sensors to quite grasp what you mean. Could you please provide a specific example?

One example is using it for network monitoring. Imagine an SNMP agent as a sensor, managed by a nerve, which gathers information via SNMP Gets and other techniques to build a semantic model from one or more SNMP Management Information Blocks (MIBs). This then is reasoned over, aggregated up the chain, further reasoned over, and fed into rules that trigger model changes that get propagated to the nerves, who translate those changes into SNMP management actions.

This happens to so many people, it’s a tale as old as programming languages. If you ever find yourself saying “wouldn’t it be neat…” or “if I could just only…”, stop yourself immediately — you may be about to accidentally spend the next 3-5 years writing a compiler. So many compilers started out looking to implement small feature X, and then ended up spending ungodly amounts of time writing an entire compiler instead.

The scariest thing about compilers is by writing one, it induces a kind of amnesia where you can’t remember why you started writing it in the first place! Soon enough the point of the compiler is to write the compiler, and feature X goes unimplemented. Eventually it will be possible once the compiler is done. But the compiler is never done… never… done…

I actually find myself needing something which is close to the front-end of a compiler:

I get a file in a C-like language (say it's C for the sake of discussion and to make life easy), and I want to figure out the names of the top-level functions defined in this file. I am willing to assume that there are no "Gotcha" macros used, which would redefine keywords or types, or otherwise mess up the syntax. The caveat is that I don't want to include any files - even though this file has some include directives; and not including them would mean some types are not defined etc.

What would I do in such a case? Should I take the "not a compiler" approach and start matching regex'es?

If the syntax for function definitions is relatively fixed, I'd say use regular expressions. Others have mentioned recursive descent parsers, and you'd be implementing the "base case" portion of one of those.

Not including "includes" is easy. Just don't go looking for them.


Edit: the fun part will be trying to implement doc-strings. :D

You wind up with a RDP that has a grammar like

    fn_def := documented
        | un_documented

    documented := doc_string def

    un_documented := def

    doc_string := <your docstring format regex here>

    def := <your function defition regex here>
Note, for the last two, it will be easier to break those up into pieces. If I were writing a RDP for C# method declarations, I'd have something like


    ACCESS_SCOPE := "private" <-- these would grammar "terminals" or the literal strings we are looking for
        | "public"
        | <etc>

    MUTABILITY := static
        | <others?>

    RETURN_TYPE := "bool" <-- note that this should actually be much more complex
        | "int"               because you can (almost) arbitrarily specify types
        | <etc>               based on C#'s internal types

    FN_NAME := <some regular expression for allowed function names in C#>

    L_PAREN := "("

    R_PAREN := ")"

Something like that. You'll need to test as you implement, though.

Each of the above grammar definitions should correspond 1 to 1 with a function that you implement. If you need/want help with implementation, my email is in my profile. I'd be more than happy to walk you through this (I've done it before :D specifically this use-case, too, where I was trying to auto-document a language that doesn't have any development/documentation tools).

I wrote an inspection tracking system in Turbo Pascal/MS-DOS with some Norand hand-held computers running PL/N back in the late 1980s. As we modified the system to handle new classes of inspections, I ended up having a little configuration file that sat in the folder, that looked just like Pascal.

I encrypted that file by XORing it with a random string in a little command line program for the purpose. It also decrypted the file so you could work on it, of course.

Those were fun days, except for the lack of GIT, and the resultant stack of floppy disks I kept the source backed up on.

Oh this brings a lot of memories. I remember writing a DSL for visualization of logs that we collected when we were working in a hyper local geographical search engine. I was fresh out of college and this was when Google maps had yet to penetrate my country.

I started out with pylex, yacc and then what was supposed to be a quite simple DSL with a few keywords and a few nodes, very quickly exploded to a very complicated endeavor. Alas, I'll never be able to prove that it was Turing Complete. But that doesn't stop me from believing it.

From painful experience this is true, especially the parts Rachit wrote to personally call me out.

i have never personally called out anyone anywhere ever

Well, a lot of guys doing algebra without realizing it.

Dear Sir,

I have built, not just a small handful of compilers, but a compiler for the parsers for those compilers.

Signed, - One who did in fact want to write compilers

What's the problem with building, aaaaah, a COMPILER? I've baked a couple, feel good so far.

Who said 'every problem is a parsing problem' ?

Ah, again? Happens to me all the time.

Alternative headline: ivy league PhD candidate attempts some gate-keeping for his future field of employment.


A lot of guys have a wife and don't know it yet, but she does and is waiting for the ring

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact