> YAML isn't a configuration language or a configuration language format, it's a serialization format
I think YAML is even worse as a serialization format than a configuration format. It has too many ambiguities, implementations are too inconsistent, and many implementations are insecure by default with untrusted input.
> Using a good custom designed configuration file format instead of trying to shove things through the narrow pipe of YAML means that you have one integrated syntax that can be designed to be more readable, more expressive, and much easier to write.
I won't say a custom configuration is never the right thing to do, but I do think it rarely is. Writing a good custom configuration language is very difficult to get right, not to mention you also need syntax highlighting and other editor support for the editors your users use, libraries for popular languages so that other programs can read and write the configuration, etc. If you need something this complicated, I'd reach for using an embedded scripting language like lua (or possibly JavaScript or python) before writing a new format from scratch.
> I think YAML is even worse as a serialization format than a configuration format.
This. I find YAML to be the least offensive option for configuration and one of the worst for serialization.
I might be misinformed, but I find it absurd that in 2021 we still don't have a default, universally available tool that supports the basic table stakes without headache:
1. core data types (number, string, etc)
2. lists, maps, and arbitrary nesting
3. comments
4. multiline strings
5. is acceptably readable
6. Just Works everywhere
YAML does get 1-5 right (specifically 3 and 4 that JSON doesn't and IMO better in 5). But then it adds a ton of complexity that has left us without a standard, safe, and sane parser implementation: anchors and references (& and *) , casting (via !!), custom data types (via !), loads of other things I don't understand.
also you can have multiple instances of yaml trees in one file, each one separated with -- . I think this makes it very confusing if used as a configuration language (they like to use yaml for configuration in kubernetes)
I mean sure but I’ve never seen any software actually use that feature in the wild except for “you’re allowed to concat your YAML files instead of separating them if you want.” Like I’ve never seen software require a certain number of documents in a file with different schemas.
It's a real shame JSON doesn't have 3 and 4, because it would be so easy and it's otherwise pretty much perfect imo. No ambiguity, every value type can be identified by its first character, clean and reasonably minimal syntax.
JSON5 is nice. I use it for all our configuration files at work after evaluating a large list of configuration file formats. I've never really run into any frustration using it, whereas YAML, TOML, and others drive me crazy when I need to represent nested structures or arrays.
you could allow python like comments and strip them with a regular expression substitution before parsing. Something like this.
(it strips all the lines that start with whitespace, followed by #, followed by anything until the end of the line)
cat cfg.json | sed -e 's/\s*#.*$//g' | jq .
Allowing multiple lines is more complicated, can't do that with regex alone.
There are lots of ways to implement it. The problem is that one of JSON's strengths is its ubiquity: every language under the sun has half a dozen different battle-tested parsers for it. Clients and servers and everything in-between have first-class support out of the box. You can even paste it directly into JavaScript as valid code.
If anybody short of a standards body tries to expand the spec, you lose out on most of that.
i think you can possibly define how you want to use json for a configuration file; json by itself is not much more than javascript objects/maps, defined as a data format. I frankly don't think that you need to be too pious about standard compliance if dealing with a cofiguration format for your application.
My stance is that YAML is a good format for configuration management and generation -- it's wonderful at filling gaps as your deployment model increases in complexity to provide a mechanism to "render" your configuration -- much like Skylark [1] does (derived from Google's internal GCL).
YAML ends up being a powerfully declarative model [2] for the state of a data structure, rather than a straight representation, ironically often enough being used in turn for an imperative model like in Ansible [3]. Definitely friendlier than JSON. But personally, I really like YAML because it lets me compose using a traits/mixins-like model using & and *, which allows for verbose, structured configuration inputs but concise configuration files.
docker-compose YAML files extension fields [4], imo, are a great example of this type of model in action. When you leave this much pre-deserialization flexibility in your configuration representation, it makes building cool stuff like docker-compose ECS support x-aws-* extension keys [5] and other plugin system-type capabilities much more straightforward than, for example, adding a new language feature to HCL.
I can see where the article is coming from. Some configuration files grow beyond classical configuration and end up being more like programming. With configuration being "Put the right connection / path strings into the program, enable some subsystems/feature toggles" and programming being along the lines of, e.g., arbitrary metric transformations in a metric collector, or programmable ACLs. With some systems, I very much end up wondering why I couldn't just load lua into the system and do the transformations with that.
However, configs in json/toml/yaml come with a lot of tooling. With a custom configuration language, you suddenly need to roll your own syntax highlighting in many editors, maybe linters, pretty-printers, ... Your config can't just be computed by a configuration management system and deployed via `computedConfig | toJSON`, instead it'll be necessary to wrangle templating. And users generally know those languages.
And writing parsers is a pain. And generating parsers is also a pain. And using bad parsers is also a pain. I've worked too much on parsers.
Those are some pretty big hurdles to overcome and if you don't, the user experience just starts off worse.
yaml (and json) gives configuration files much-needed hierarchy and structure, which is unbounded (within reason), not limited syntactically or inherently like simpler configuration formats including toml.
yaml adds to that comments and a pleasant readable layout, even if it requires a little bit more from the editor to make it pretty.
With that, I don't think yaml has much competition.
Hm, again. I don't think I agree it's that clearcut.
I've found that if you mostly want a tree of simple key-value assignments, such as database.postgres.user or database.mysql.timezone, TOML is actually very hard to beat. It's extremely simple to teach to less technical coworkers. TOML mostly becomes somewhat strange if you need lists, or maps with free-form keys.
Also, YAML is horrible if you have to generate it. Pretty much everyone automating stuff I've met will prefer JSON over YAML once you need to automatically generate the configuration. Pushing some data structure through an as-yaml pretty printer usually works, until it breaks in really weird incompatibilities between YAML parsers. And that happens way too much to be a fun distraction.
There are also many formats that are "better json" that have comments, multiline strings, trailing commas etc. Such as hjson, json5 (and variants such as json6, jsonX, etc.) and hocon. As well as more sophisticated languages like jsonnet, dhal, and cue.
Many of these can be "compiled" to JSON.
There is plenty of competition, but for some reason, YAML seems to be the most popular.
Nesting multiple levels of tables is pretty simple, you just have sections that look like [a.b.c]. It's different than Json, but IMO easier to read and navigate.
Unfortunately it doesn't handle c arrays of arrays of tables very well.
My text editor, KeenWrite[0], integrates YAML using a GUI, which hides the underlying data format[1]. This makes the format irrelevant from the end user's perspective. Hardly anybody edits .odf files or .png files directly, yet configuration files are often updated in a plain text editor rather than a GUI. Structured data formats are amenable to autogenerating GUIs (see [2] and [3]).
After resisting it for many years I've finally settled on YAML as my default configuration format.
I need a way of expressing the core data types of JSON (key/value mappings aka "objects", arrays, strings, integers, floating points and booleans) - plus comments, and multi-line strings that insulate me from complex escaping rules.
YAML does that. It's not perfect, but it's good enough. And in Python I can use yaml.safe_load() to avoid some of the more troublesome corners of the spec.
YAML seems to be one of the most polarizing things in the software industry. I'm personally a big fan of it, though I agree with the author that it becomes a pain once you start trying to use it like a DSL or templating language.
I use it in pretty much every project, but I've worked with people who've said they hate it with a burning passion. And it's kind of a trope to see anti-YAML manifestos on HN and /r/programming fairly regularly, both as article submissions and comments.
As you mentioned, one of the most useful features for me is the ability to write raw strings with single quotes (or by omitting quotes). I've been in many situations where and I and others have had to maintain regular expressions in config files, and it's so much more of a hassle with JSON due to the escaping requirements.
The opening of the article states "Hot take: YAML isn't a configuration language or a configuration language format, it's a serialization format."; I have the exact same opinion, but with "YAML" swapped for "JSON". (And maybe the author would agree with that, too.)
> it becomes a pain once you start trying to use it like a DSL or templating language.
I like YAML and use it a lot. But I'd argue the fact that it even allows/encourages you to do such things with it is a design flaw that ultimately has kept it from Do One Thing Well.
No coercion of y/n/yes/no/on/off to booleans (these were also removed in the official YAML 1.2 spec, thankfully), no direct object representations, no anchors or references, etc.
> I'm personally a big fan of it, though I agree with the author that it becomes a pain once you start trying to use it like a DSL or templating language
If only the people behind Ansible agreed with this. Jinja+YAML templating is just utterly terrible.
Yeah, I completely get the widespread hatred of YAML due to things like that. It regularly gets abused to extreme lengths. But it also regularly gets used for relatively flat, straightforward, easily maintainable config/data files.
The ideal configuration format is really powerful and expressive for the author but resolves to a dead simple data structure for the consumer. If you need to template your config files for anything other than supplying external data it’s not expressive enough. I think YAML is closer than many of its contemporaries to this.
YAML (and JSON of course) has the big advantage, that you can use JSON Schema to validate it and to provide realtime code intelligence in many Editors (VScode, IntelliJ, Monaco Editor in the web, many Web IDEs. Allthough you might need a plugin for YAML, where JSON works out of the box).
So it's easy to hate on the feature overload of YAML or on the lack of features of JSON, but it's hard to throw out the rich ecosystem along with them they already have established.
One format I personally like is JSON5 (https://json5.org/) as it's very much just JSON, but with some more modern JavaScript (ES5) syntax allowed, including comments. Looks like a parser / serializer for it is also still rather consise.
Although, I'm always wondering which features of YAML to best not use / touch. My personal approach would have been to leave some advanced features to optional pre-processors or real programming / template languages.
I find YAML an incredibly difficult configuration format to write by hand for anything but the simplest of files. Agree it’s a sterilisation format and not a great one.
Likewise for JSON. No one can say setting up a webpack config is simple which uses JSON.
It’s a recent trend. I hate working with the Kubernetes ecosystem as it’s lines of YAML config.
At the same time I love working with OpenBSD / Linux. OpenBSD especially, the config files are simple and concise.
TOML is my preference if I need a config file, it’s easy to read/write and reasonably expressive.
I think JSON is an easier configuration format than YAML. YAML is so complicated when you include all the edge cases. For example I still have no idea when you need a leading `- ` and when it's not needed.
I'm not saying YAML can't be learned, but why learn all the quirks of a new format when we can just use JSON which everyone already knows? It does the same job, especially when you use JSON5 or HJSON. It's still intuitive, also supports comments and multiline strings etc, and you don't have to deal with the mess that is YAML syntax.
Specifically, neither YAML nor JSON are configuration formats (despite everyone's desire to use them as such).
TOML is a very good generalization of a simple, straightforward configuration file format which was the "best" until TOML came around. If a service you are using needs JSON/YAML, you can generate it. But, stick with a configuration file format that is suitable for humans for actual configuration.
Also, PSA, neither JSON nor YAML files are good places to embed logic.
Also, another PSA, the fact that Pipfile is just a TOML file is just silly. Another point in Ruby's favor.
My primary gripe with YAML is white space sensitivity, it makes editing it in a hurry in a CLI environment needlessly user unfriendly. Virtually any other option is better for the person who actually has to modify these configurations later.
If your configuration is expected to be hand modified directly by a user, it shouldn't use YAML.
The solution for defining a more complicated config is to write your own config file format? I would think a custom format wouldn't necessarily be easier for others to read and write unless it came with a guide/readme, but then that's just one more thing to learn. Admittedly, I've never had to write a super complicated config file, but can anyone tell me why I shouldn't continue to use something like JSON for all of my config files?
Compared to yaml or ini formats, JSON is arguably harder to read and definitely harder to write due to having to keep track of brackets, commas, and/or quotes.
Ideally, a configuration file should be easy to read and write/modify by a user of the application. A lot of applications just stick with ini like formats because it meets both requirements.
> I really struggle to write yaml. I end up writing it as json and converting it.
Since valid JSON is also valid YAML with the same semantics, it is impossible for it to be harder to write YAML than JSON, and no conversion is necessary.
Two big reasons you shouldn't use json for config: comments and multi-line strings. json supports neither, although there are many other simple formats that do.
* C# - builtin via JsonCommentHandling, JsonSerializerOptions.AllowTrailingCommas
- Newtonsoft via CommentHandling, trailing out of the box
* C++ - Boost.JSON/RapidJSON/DAW JSONLink have options for both, nlohmann has an option to allow comments
* Python, I only know doing it via jsmin first
* Javascript, use jsmin also
These are the ones I have used or seen. It comes up so often that parser writers get issues over and over until they support them. nlohmann did not for years. But people want them for config files and such. Them being an extension is nice in that they are not officially supported and how they are handled isn't specified.
All these goofy syntaxes encode groves, meaning trees annotated with key/value pairs. XML, JSON, YAML, etc. Pretty good for serializing object graphs and so forth.
What syntax (sugar) is good for tabular data too?
My persona grove syntax is (secretly) awesome. Think JSON superset, like HOCON, HJSON, and SCON, but with more scalar data types, tweaked ergonomics, and conveniences.
But for the life of me, I still don't have an obvious syntax extension for tables. Like inlining CSV. Or how markdown does tables. (Nested arrays for representing a matrix sucks.)
The One True Syntax to unify groves and tables could really benefit all these big data (NumPy) and note keeping (Notion, Roam) projects.
I know that this won’t likely be a popular opinion but I really like the MSBuild configuration way with xml and interpolation. I have found I can even reason about fairly complex configurations without too much hassle. You can also debug them which is great. It’s not that it doesn’t have its warts it’s that it really does a good job for what it’s mean to: semi-programmable configuration. It’s also extensible.
Here we go. The never ending story of inventing new syntax to configure software. Because that somehow brings value to what you are trying to accomplish with the software? Really? But why stop there? Why not invent a whole new programming languages to make writing your “To Do” single user web application easier? That makes as much sense to me as spending time thinking about your configuration file syntax.
Currently, my main concern with YAML is that, by the spec, comments are not attached to a particular node (see https://yaml.org/spec/1.2/spec.html#id2767100). As a result, a lot of YAML parsers (like https://github.com/yaml/libyaml and https://github.com/chyh1990/yaml-rust) only filter out the comments during the parsing phase. This makes it less than ideal for a use-case where the configuration file is expected to be modified by both programs and humans.
TOML makes it more trivial to associate comments with a node. This is mainly because the language is simpler though, as the spec is not explicit about that (https://toml.io/en/v1.0.0#comment).
I always wondered, why XML is not used by developers. I use it for my programs. The combination of attributes and data within nodes, to me, has been very useful.
It's a slog to write, harder to parse, and you have to make more decisions (do I put this in the node's attribute, or as a child node?).
Maven had pom.xml files for declaring dependencies. Very few developers enjoyed editing those, and you don't see many XML-based config files these days for a reason.
There are a variety of generic DSLs for encoding configuration concisely with composability, functions, custom validation, etc. Some examples would be jsonnet, dhall, and cue.
If the configuration needs to be transformed into a more computer-friendly format like json or yaml for later loading by a binary, that's easy enough to do at build time in a modern-ish build system like bazel
Complex configurations should simply use configuration programming languages like Lua. They can start as simple as JSON/YAML but stay flexible as the complexity grows thanks to functions, references, etc. Parsing them should be also easy (configuration languages are meant to be embedded in others, just like JSON/YAML).
How about not using configuration files in the first place? Instead expose a library in some programming language (preferably typed) to assist with writing config, then serialize the output in whatever format is easier to parse.
A while back it took myself, a data center engineer, and two engineers from the appliance vendor a solid 20 minutes to figure out the problems with a ~12 line yaml file that needed two or three lines added (network config file). Between syntax, indentation, etc it required a few tries. Plus there was a copy and paste in there via SSH.
I don't know what a better alternative is but yaml can be incredibly frustrating.
I think I get what the article is saying. I don't think they expressed it very well.
Here's a concrete example of what I think they're complaining about. From a Kubernetes Deployment manifest:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-conn
key: uri
That valueFrom.secretKeyRef.{name, key} structure is an "exploded AST" — something that in any programming language, you'd be chastised for writing out as a structural-initialization literal, because there's so much stuff there that it's easy to screw it up and forget something.
When attempting to express such a literal in most-any programming language, you'd expect to either have:
• the Java "literal factory function" approach — some static function to call, that takes either other literals, or a single literal string encoding an expression in a DSL, as arguments, and then produces an initialized value-object; or
• the Elixir "custom sigil" approach (there's probably a more popular language with this feature, but Elixir is what I know) — a macro or operator that takes in a raw string or lexemes encoding an expression in a DSL, and then codegens out to the appropriate structural-initialization literal. (Macros are better here, as incorrect DSL syntax can be caught at parse/compile time, just like incorrect syntax of the surrounding language.)
-----
To be honest, I don't think I would want YAML to support either of these. YAML shouldn't be arbitrarily extendable; a YAML document should be a YAML document, parsed by any compliant parser. (There could be some Avro-like "embedded schema" format for YAML that enables this, but that format would be better as a layer on top of YAML, rather than being part of YAML itself.)
What I would like in YAML, is support for built-in literal sugar for certain specific YAML structures.
This is different in concept from YAML having support for a type, where that type must have some 1:1 native type on the host language side for it to encode/decode to.
Instead, a "sugared structure" would involve
1. a YAML parser having a sub-parser for certain specific DSLs, where the output of this parser is a small, standardized-in-shape container structure, made out of regular YAML arrays, dicts, and scalars;
2. a YAML generator offering a configuration option to pattern-match on these standardized-in-shape structures, where recognized subtrees are swapped out in the emitted YAML for the appropriate sugared-literal representation.
Examples of where this would be helpful include: DateTimes, URIs, UUIDs, Intervals... and that's basically it.
There really aren't too many of these. The short list above represents basically the entirety of the set of types I've ever had YAML fall down on in the 10 years I've been generating/parsing YAML documents. For everything else, it works fine.
Right. Elixir's sigils (https://elixir-lang.org/getting-started/sigils.html) are just a macros in the current namespace of name name pattern sigil_[letter], where writing ~x/foo/ or ~x(foo) etc., evaluates the macro sigil_x("foo") at compile-time.
The JavaScript equivalent with tagged template literals isn't the same, because it isn't a macro, and so the template function has to run at runtime. As such, these are no longer literals (i.e. pure data) per se.
The C++ equivalent is closer, if-and-only-if the user-defined literal operator function call is a constexpr that gets pre-evaluated.
Structural initialization is when you define a struct (as opposed to a scalar) by directly declaring the values of its (potentially-private!) named fields.
For example, in Elixir, this is a structurally-initialized MapSet literal:
%MapSet{map: %{}, version: 2}
This is as opposed to a factory-method-initialized literal, going through a functional API that can hide the inner workings of the ADT produced:
MapSet.new
Or, compare and contrast a structurally-initialized literal with a DSL-expressed literal:
Note that in the first case (structural initialization), if there were any other complex non-scalar objects nested within the main one, you'd have to define those too. It's a forced encoding of an Abstract Syntax Tree representation of the structure of the data; but it's an AST that's "exploded" or "cross-sectional" — one that has no functions, no ability to encapsulate/abstract.
!perl/regexp:
REGEXP: "R[Uu][Bb][Yy]$"
MODIFIERS: i
Wouldn't it be less annoying to both read and write that in a YAML document if it were expressed the way you'd expect — as:
/R[Uu][Bb][Yy]$/i
...where YAML itself would know to parse the latter as if it were the former, and to generate the latter in place of the former (for this, and an exclusive few other common structurally-initialized types that constantly get represented in YAML documents)?
Thanks for explaining. I'm not sure that's laughed at in every language. It's standard in Clojure to just define things as maps, lists, etc. I actually think it's a pretty good idea.
Your examples, datetimes and regex, make the DSL option seem nice. In picking good examples you picked ones I'd be familiar with. But that's sort of the trick. If something is completely new to me, I'd much rather have it blown up.
The thing with all of the “common” structural data types, though — datetimes, regular expressions, UUIDs, URLs — is that they have either a conventional or separately-standardized syntax, separate from the syntax of any particular programming language they’re hosted in. If you know what these things are, and what they’re for, then it’s impossible to have not encountered the basically-universal notation for expressing them as well.
And my thinking is that, if you don’t know what they are, then you’ll need to look up what they are, in order to understand the semantics at play. And doing that will force you through learning the notation as well. There’s never really a point at which a (responsible) programmer will be trying to deal with modifying the fields inside e.g. a URL, while having no understanding of what a URL is (and so seeing any familiarity advantage in the exploded-field syntax over the DSL syntax.) You’ll learn the syntax on your way to understanding the semantics, and so will end up preferring the compact DSL notation, just like everyone else.
A somewhat analogous example: there’s no common method of teaching elementary arithmetic that doesn’t pass through binary-operator expression syntax with binding affinity (i.e. “order of operations.”) In theory, you could learn elementary arithmetic entirely in the form of functional application trees (i.e. arithmetic in Lisp), or entirely in stack-machine/RPN notation (i.e. arithmetic in Forth); but no elementary-school teacher actually teaches arithmetic this way, and there are no materials aimed at children that try to do this. So, by learning arithmetic, people get this additional bit of enculturation of learning to deal with parsing out the meaning of mixed binary-operator expressions using a precedence ladder; and end up preferring the “convenience” of the compact-but-complex binary-operator notation, over the exploded-but-simple AST notation.
And also, to be clear, I’m not suggesting YAML would be better off if it did this for any arbitrary structural pattern that happens to have a formal notation for it somewhere in the world. Just the ones that most-every programmer will inevitably run into, because every programming language modern enough to support YAML, also supports the expression of those types in the form of those literals. (For example, effectively every language that has a native URL type, supports expressing URLs through factory-method literals by calling `URL.parse` on the string representation; and everyone who writes URL-handling code in a given language, when defining a constant URL, would automatically reach for “write the URL in RFC1738 URI notation in a string and pass it into URL.parse” over “structurally initialize a URL struct.”)
I think YAML is even worse as a serialization format than a configuration format. It has too many ambiguities, implementations are too inconsistent, and many implementations are insecure by default with untrusted input.
> Using a good custom designed configuration file format instead of trying to shove things through the narrow pipe of YAML means that you have one integrated syntax that can be designed to be more readable, more expressive, and much easier to write.
I won't say a custom configuration is never the right thing to do, but I do think it rarely is. Writing a good custom configuration language is very difficult to get right, not to mention you also need syntax highlighting and other editor support for the editors your users use, libraries for popular languages so that other programs can read and write the configuration, etc. If you need something this complicated, I'd reach for using an embedded scripting language like lua (or possibly JavaScript or python) before writing a new format from scratch.