YAML and Configuration Files

thayne · on Aug 14, 2021

> YAML isn't a configuration language or a configuration language format, it's a serialization format

I think YAML is even worse as a serialization format than a configuration format. It has too many ambiguities, implementations are too inconsistent, and many implementations are insecure by default with untrusted input.

> Using a good custom designed configuration file format instead of trying to shove things through the narrow pipe of YAML means that you have one integrated syntax that can be designed to be more readable, more expressive, and much easier to write.

I won't say a custom configuration is never the right thing to do, but I do think it rarely is. Writing a good custom configuration language is very difficult to get right, not to mention you also need syntax highlighting and other editor support for the editors your users use, libraries for popular languages so that other programs can read and write the configuration, etc. If you need something this complicated, I'd reach for using an embedded scripting language like lua (or possibly JavaScript or python) before writing a new format from scratch.

amirkdv · on Aug 14, 2021

> I think YAML is even worse as a serialization format than a configuration format.

This. I find YAML to be the least offensive option for configuration and one of the worst for serialization.

I might be misinformed, but I find it absurd that in 2021 we still don't have a default, universally available tool that supports the basic table stakes without headache:

1. core data types (number, string, etc)

2. lists, maps, and arbitrary nesting

3. comments

4. multiline strings

5. is acceptably readable

6. Just Works everywhere

YAML does get 1-5 right (specifically 3 and 4 that JSON doesn't and IMO better in 5). But then it adds a ton of complexity that has left us without a standard, safe, and sane parser implementation: anchors and references (& and *) , casting (via !!), custom data types (via !), loads of other things I don't understand.

MichaelMoser123 · on Aug 15, 2021

>loads of other things I don't understand.

also you can have multiple instances of yaml trees in one file, each one separated with -- . I think this makes it very confusing if used as a configuration language (they like to use yaml for configuration in kubernetes)

Spivak · on Aug 15, 2021

I mean sure but I’ve never seen any software actually use that feature in the wild except for “you’re allowed to concat your YAML files instead of separating them if you want.” Like I’ve never seen software require a certain number of documents in a file with different schemas.

MichaelMoser123 · on Aug 15, 2021

i saw it being used in some CI system; it was very confusing.

brundolf · on Aug 14, 2021

It's a real shame JSON doesn't have 3 and 4, because it would be so easy and it's otherwise pretty much perfect imo. No ambiguity, every value type can be identified by its first character, clean and reasonably minimal syntax.

bschwindHN · on Aug 15, 2021

JSON5 is nice. I use it for all our configuration files at work after evaluating a large list of configuration file formats. I've never really run into any frustration using it, whereas YAML, TOML, and others drive me crazy when I need to represent nested structures or arrays.

https://json5.org/

plusmax1 · on Aug 14, 2021

well there are json alternatives which fit this bill, such as HJSON.

https://hjson.github.io/

might not be as "common" but it has good implementations for many languages.

MichaelMoser123 · on Aug 15, 2021

you could allow python like comments and strip them with a regular expression substitution before parsing. Something like this. (it strips all the lines that start with whitespace, followed by #, followed by anything until the end of the line)

   cat cfg.json | sed -e 's/\s*#.*$//g' | jq .

Allowing multiple lines is more complicated, can't do that with regex alone.

brundolf · on Aug 15, 2021

There are lots of ways to implement it. The problem is that one of JSON's strengths is its ubiquity: every language under the sun has half a dozen different battle-tested parsers for it. Clients and servers and everything in-between have first-class support out of the box. You can even paste it directly into JavaScript as valid code.

If anybody short of a standards body tries to expand the spec, you lose out on most of that.

MichaelMoser123 · on Aug 15, 2021

i think you can possibly define how you want to use json for a configuration file; json by itself is not much more than javascript objects/maps, defined as a data format. I frankly don't think that you need to be too pious about standard compliance if dealing with a cofiguration format for your application.

SigmundA · on Aug 16, 2021

>json by itself is not much more than javascript objects/maps

And without comments or optional quoted keys single quotes or trailing commas, which makes editing by hand more work and IMO less readable.

brundolf · on Aug 15, 2021

Editors, for one, will be an issue

FrancoisBosun · on Aug 15, 2021

What about a key named “_comment”, or something similar? Of course, the underlying software must ignore unknown keys, so it’s not a full win anyway.

nicklarsennz · on Aug 15, 2021

That might help for a general comment, but not for a comment on a specific part of the structure.

MaxGabriel · on Aug 15, 2021

If Dhall was more popular, would it meet your criteria?

sofixa · on Aug 14, 2021

HCL does all of those.

TOML is pretty close as well.

scrollaway · on Aug 15, 2021

HCL is neither acceptably readable nor does it just work everywhere. TOML is getting there though yeah.

petre · on Aug 15, 2021

I used to think that before I had to edit the ejabberd.yml config file. Now I think it's only remotely useful as a subset to use in Jelkyll headers.

Just use TOML for configuration instead.

account42 · on Aug 19, 2021

> I used to think that before I had to edit the ejabberd.yml config file.

Hey, it's still better than having to write the configuration in Erlang like you had to before.

dheera · on Aug 15, 2021

Besides the above (especially comments), YAML also has one huge advantage for configuration files and that is clean diffs.

JSON's lack of support for trailing commas messes up diffs.

inkeddeveloper · on Aug 15, 2021

If you come up with a format, call it IJW. It just works.

alephu5 · on Aug 15, 2021

Use dhall and serialise to JSON/YAML!

nanoscopic · on Aug 14, 2021

My JSON parser ( github.com/nanoscopic/ujsonin ) has these things:

0. Looks essentially the same as JSON

1. Core data types, and customizable data types can be added easily.

2. Arrays, Objects, and arbitrary nesting.

3. Comments ( both /* */ and // format )

4. Multiline strings ( by default; carriage returns are no problem within strings )

5. It is JSON with relaxed restrictions and slight addition for actual named types.

6. I've written C, Perl, and Golang implementations so far.

jasfi · on Aug 15, 2021

I like the idea of what you're doing. May I suggest a Nim port? Nim outputs C anyway, but is safe, so you'd worry less about bugs.

mh0pe · on Aug 15, 2021

My stance is that YAML is a good format for configuration management and generation -- it's wonderful at filling gaps as your deployment model increases in complexity to provide a mechanism to "render" your configuration -- much like Skylark [1] does (derived from Google's internal GCL).

YAML ends up being a powerfully declarative model [2] for the state of a data structure, rather than a straight representation, ironically often enough being used in turn for an imperative model like in Ansible [3]. Definitely friendlier than JSON. But personally, I really like YAML because it lets me compose using a traits/mixins-like model using & and *, which allows for verbose, structured configuration inputs but concise configuration files.

docker-compose YAML files extension fields [4], imo, are a great example of this type of model in action. When you leave this much pre-deserialization flexibility in your configuration representation, it makes building cool stuff like docker-compose ECS support x-aws-* extension keys [5] and other plugin system-type capabilities much more straightforward than, for example, adding a new language feature to HCL.

[1]: https://github.com/google/skylark

[2]: https://en.wikipedia.org/wiki/YAML#Advanced_components

[3]: https://docs.ansible.com/ansible/latest/user_guide/playbooks...

[4]: https://docs.docker.com/compose/compose-file/compose-file-v3...

[5]: https://docs.docker.com/cloud/ecs-integration/#rolling-updat...

tetha · on Aug 14, 2021

Hm.

I can see where the article is coming from. Some configuration files grow beyond classical configuration and end up being more like programming. With configuration being "Put the right connection / path strings into the program, enable some subsystems/feature toggles" and programming being along the lines of, e.g., arbitrary metric transformations in a metric collector, or programmable ACLs. With some systems, I very much end up wondering why I couldn't just load lua into the system and do the transformations with that.

However, configs in json/toml/yaml come with a lot of tooling. With a custom configuration language, you suddenly need to roll your own syntax highlighting in many editors, maybe linters, pretty-printers, ... Your config can't just be computed by a configuration management system and deployed via `computedConfig | toJSON`, instead it'll be necessary to wrangle templating. And users generally know those languages.

And writing parsers is a pain. And generating parsers is also a pain. And using bad parsers is also a pain. I've worked too much on parsers.

Those are some pretty big hurdles to overcome and if you don't, the user experience just starts off worse.

kzrdude · on Aug 14, 2021

yaml (and json) gives configuration files much-needed hierarchy and structure, which is unbounded (within reason), not limited syntactically or inherently like simpler configuration formats including toml.

yaml adds to that comments and a pleasant readable layout, even if it requires a little bit more from the editor to make it pretty.

With that, I don't think yaml has much competition.

tetha · on Aug 14, 2021

Hm, again. I don't think I agree it's that clearcut.

I've found that if you mostly want a tree of simple key-value assignments, such as database.postgres.user or database.mysql.timezone, TOML is actually very hard to beat. It's extremely simple to teach to less technical coworkers. TOML mostly becomes somewhat strange if you need lists, or maps with free-form keys.

Also, YAML is horrible if you have to generate it. Pretty much everyone automating stuff I've met will prefer JSON over YAML once you need to automatically generate the configuration. Pushing some data structure through an as-yaml pretty printer usually works, until it breaks in really weird incompatibilities between YAML parsers. And that happens way too much to be a fun distraction.

thayne · on Aug 14, 2021

Toml has unlimited nesting as well.

There are also many formats that are "better json" that have comments, multiline strings, trailing commas etc. Such as hjson, json5 (and variants such as json6, jsonX, etc.) and hocon. As well as more sophisticated languages like jsonnet, dhal, and cue.

Many of these can be "compiled" to JSON.

There is plenty of competition, but for some reason, YAML seems to be the most popular.

fanf2 · on Aug 14, 2021

TOML gets very weird as soon as you have more than one level of nesting.

thayne · on Aug 15, 2021

Nesting multiple levels of tables is pretty simple, you just have sections that look like [a.b.c]. It's different than Json, but IMO easier to read and navigate.

Unfortunately it doesn't handle c arrays of arrays of tables very well.

dathinab · on Aug 15, 2021

It depends on the kind of nesting, cargo has multiple levels of nesting all the time and it's working well.

Through is you have thinks like object->list->object->list thinks can get weird without questions.

Still most time I had such complex configs I also often (not always) realized that I did something wrong and unnecessary complex...

thangalin · on Aug 14, 2021

My text editor, KeenWrite[0], integrates YAML using a GUI, which hides the underlying data format[1]. This makes the format irrelevant from the end user's perspective. Hardly anybody edits .odf files or .png files directly, yet configuration files are often updated in a plain text editor rather than a GUI. Structured data formats are amenable to autogenerating GUIs (see [2] and [3]).

[0]: https://github.com/DaveJarvis/keenwrite/blob/master/docs/scr...

[1]: https://youtu.be/u_dFd6UhdV8?t=160

[2]: https://www.jeremydorn.com/json-editor

[3]: http://mb21.github.io/JSONedit/

simonw · on Aug 14, 2021

After resisting it for many years I've finally settled on YAML as my default configuration format.

I need a way of expressing the core data types of JSON (key/value mappings aka "objects", arrays, strings, integers, floating points and booleans) - plus comments, and multi-line strings that insulate me from complex escaping rules.

YAML does that. It's not perfect, but it's good enough. And in Python I can use yaml.safe_load() to avoid some of the more troublesome corners of the spec.

meowface · on Aug 14, 2021

YAML seems to be one of the most polarizing things in the software industry. I'm personally a big fan of it, though I agree with the author that it becomes a pain once you start trying to use it like a DSL or templating language.

I use it in pretty much every project, but I've worked with people who've said they hate it with a burning passion. And it's kind of a trope to see anti-YAML manifestos on HN and /r/programming fairly regularly, both as article submissions and comments.

As you mentioned, one of the most useful features for me is the ability to write raw strings with single quotes (or by omitting quotes). I've been in many situations where and I and others have had to maintain regular expressions in config files, and it's so much more of a hassle with JSON due to the escaping requirements.

The opening of the article states "Hot take: YAML isn't a configuration language or a configuration language format, it's a serialization format."; I have the exact same opinion, but with "YAML" swapped for "JSON". (And maybe the author would agree with that, too.)

amirkdv · on Aug 14, 2021

> it becomes a pain once you start trying to use it like a DSL or templating language.

I like YAML and use it a lot. But I'd argue the fact that it even allows/encourages you to do such things with it is a design flaw that ultimately has kept it from Do One Thing Well.

meowface · on Aug 14, 2021

I do think some of the more "advanced" features are definitely a mistake. StrictYAML (https://github.com/crdoconnor/strictyaml) is a limited, much saner subset of YAML that I wish people would use more: https://hitchdev.com/strictyaml/features-removed/

No coercion of y/n/yes/no/on/off to booleans (these were also removed in the official YAML 1.2 spec, thankfully), no direct object representations, no anchors or references, etc.

sofixa · on Aug 14, 2021

> I'm personally a big fan of it, though I agree with the author that it becomes a pain once you start trying to use it like a DSL or templating language

If only the people behind Ansible agreed with this. Jinja+YAML templating is just utterly terrible.

meowface · on Aug 14, 2021

Yeah, I completely get the widespread hatred of YAML due to things like that. It regularly gets abused to extreme lengths. But it also regularly gets used for relatively flat, straightforward, easily maintainable config/data files.

thayne · on Aug 14, 2021

Why not something like json5 or hjson? They give you the features you listed, but aren't as complex and don't have as many weird edge cases as yaml.

dangoor · on Aug 14, 2021

I think Cue[1] is a much more powerful and useful config format that also has the ability to generate validated JSON or YAML if needed.

[1]: https://cuelang.org

contravariant · on Aug 15, 2021

Not sure if 'much more powerful' is a desirable feature for a configuration file language.

Spivak · on Aug 15, 2021

The ideal configuration format is really powerful and expressive for the author but resolves to a dead simple data structure for the consumer. If you need to template your config files for anything other than supplying external data it’s not expressive enough. I think YAML is closer than many of its contemporaries to this.

fanf2 · on Aug 14, 2021

CUE was recently discussed at https://news.ycombinator.com/item?id=28127951

Fannon · on Aug 14, 2021

YAML (and JSON of course) has the big advantage, that you can use JSON Schema to validate it and to provide realtime code intelligence in many Editors (VScode, IntelliJ, Monaco Editor in the web, many Web IDEs. Allthough you might need a plugin for YAML, where JSON works out of the box).

So it's easy to hate on the feature overload of YAML or on the lack of features of JSON, but it's hard to throw out the rich ecosystem along with them they already have established.

One format I personally like is JSON5 (https://json5.org/) as it's very much just JSON, but with some more modern JavaScript (ES5) syntax allowed, including comments. Looks like a parser / serializer for it is also still rather consise.

Although, I'm always wondering which features of YAML to best not use / touch. My personal approach would have been to leave some advanced features to optional pre-processors or real programming / template languages.

BFLpL0QNek · on Aug 15, 2021

I find YAML an incredibly difficult configuration format to write by hand for anything but the simplest of files. Agree it’s a sterilisation format and not a great one.

Likewise for JSON. No one can say setting up a webpack config is simple which uses JSON.

It’s a recent trend. I hate working with the Kubernetes ecosystem as it’s lines of YAML config.

At the same time I love working with OpenBSD / Linux. OpenBSD especially, the config files are simple and concise.

TOML is my preference if I need a config file, it’s easy to read/write and reasonably expressive.

kangalioo · on Aug 15, 2021

I think JSON is an easier configuration format than YAML. YAML is so complicated when you include all the edge cases. For example I still have no idea when you need a leading `- ` and when it's not needed.

I'm not saying YAML can't be learned, but why learn all the quirks of a new format when we can just use JSON which everyone already knows? It does the same job, especially when you use JSON5 or HJSON. It's still intuitive, also supports comments and multiline strings etc, and you don't have to deal with the mess that is YAML syntax.

If only...

nanis · on Aug 15, 2021

Specifically, neither YAML nor JSON are configuration formats (despite everyone's desire to use them as such).

TOML is a very good generalization of a simple, straightforward configuration file format which was the "best" until TOML came around. If a service you are using needs JSON/YAML, you can generate it. But, stick with a configuration file format that is suitable for humans for actual configuration.

Also, PSA, neither JSON nor YAML files are good places to embed logic.

Also, another PSA, the fact that Pipfile is just a TOML file is just silly. Another point in Ruby's favor.

Aloha · on Aug 14, 2021

My primary gripe with YAML is white space sensitivity, it makes editing it in a hurry in a CLI environment needlessly user unfriendly. Virtually any other option is better for the person who actually has to modify these configurations later.

If your configuration is expected to be hand modified directly by a user, it shouldn't use YAML.

Vitamin_Sushi · on Aug 14, 2021

The solution for defining a more complicated config is to write your own config file format? I would think a custom format wouldn't necessarily be easier for others to read and write unless it came with a guide/readme, but then that's just one more thing to learn. Admittedly, I've never had to write a super complicated config file, but can anyone tell me why I shouldn't continue to use something like JSON for all of my config files?

u801e · on Aug 14, 2021

Compared to yaml or ini formats, JSON is arguably harder to read and definitely harder to write due to having to keep track of brackets, commas, and/or quotes.

Ideally, a configuration file should be easy to read and write/modify by a user of the application. A lot of applications just stick with ini like formats because it meets both requirements.

benjarrell · on Aug 14, 2021

For me, the brackets, commas, and quotes are what make json readable (and writable).

I really struggle to write yaml. I end up writing it as json and converting it. I agree about ini though.

dragonwriter · on Aug 15, 2021

> I really struggle to write yaml. I end up writing it as json and converting it.

Since valid JSON is also valid YAML with the same semantics, it is impossible for it to be harder to write YAML than JSON, and no conversion is necessary.

HelloNurse · on Aug 14, 2021

JSON is unpleasant (for me, more unpleasant than XML), but at least it doesn't attempt to be clever like TOML and YAML.

bwship · on Aug 14, 2021

Not to mention being able to have comments in YAML. That helps make the file much clearer as well.

deterministic · on Aug 15, 2021

I prefer JSON to YAML. There are no objective way to measure whether one is better than the other.

secondcoming · on Aug 14, 2021

Maybe it's because I've written C++ for years, but give me brackets over significant whitespace any day.

If anything, YAML is harder to read because the indentation is important, yet deliberately invisible.

Aloha · on Aug 15, 2021

Agree.

Whitespace with inconsistent rules is awful.

thayne · on Aug 14, 2021

Two big reasons you shouldn't use json for config: comments and multi-line strings. json supports neither, although there are many other simple formats that do.

beached_whale · on Aug 14, 2021

Let’s not forget that almost every JSON parser optionally supports comments and trailing commma’s when parsing.

simonw · on Aug 14, 2021

Which ones do that? I've not encountered many myself.

beached_whale · on Aug 14, 2021

* C# - builtin via JsonCommentHandling, JsonSerializerOptions.AllowTrailingCommas - Newtonsoft via CommentHandling, trailing out of the box

* C++ - Boost.JSON/RapidJSON/DAW JSONLink have options for both, nlohmann has an option to allow comments

* Python, I only know doing it via jsmin first

* Javascript, use jsmin also

These are the ones I have used or seen. It comes up so often that parser writers get issues over and over until they support them. nlohmann did not for years. But people want them for config files and such. Them being an extension is nice in that they are not officially supported and how they are handled isn't specified.

Zamicol · on Aug 14, 2021

And JSON5.

https://github.com/json5/json5-go

simonw · on Aug 14, 2021

What a delightfully readable example of a hand-rolled lexer and parser that is.

jimmygrapes · on Aug 14, 2021

I thought you were being sarcastic, but I think I just finally learned how lexers and parsers work by reading the code, and I don't even know Go.

simonw · on Aug 15, 2021

Hah yeah no sarcasm intended at all! It's really neat code.

thayne · on Aug 14, 2021

Ruby's supports comments. But that's the only one I know of

specialist · on Aug 15, 2021

What about tabular data?

All these goofy syntaxes encode groves, meaning trees annotated with key/value pairs. XML, JSON, YAML, etc. Pretty good for serializing object graphs and so forth.

What syntax (sugar) is good for tabular data too?

My persona grove syntax is (secretly) awesome. Think JSON superset, like HOCON, HJSON, and SCON, but with more scalar data types, tweaked ergonomics, and conveniences.

But for the life of me, I still don't have an obvious syntax extension for tables. Like inlining CSV. Or how markdown does tables. (Nested arrays for representing a matrix sucks.)

The One True Syntax to unify groves and tables could really benefit all these big data (NumPy) and note keeping (Notion, Roam) projects.

no_wizard · on Aug 15, 2021

I know that this won’t likely be a popular opinion but I really like the MSBuild configuration way with xml and interpolation. I have found I can even reason about fairly complex configurations without too much hassle. You can also debug them which is great. It’s not that it doesn’t have its warts it’s that it really does a good job for what it’s mean to: semi-programmable configuration. It’s also extensible.

However, it is of course extremely proprietary

mixmastamyk · on Aug 14, 2021

Look into strict yaml, a simpler, safer subset:

https://hitchdev.com/strictyaml/

deterministic · on Aug 15, 2021

Here we go. The never ending story of inventing new syntax to configure software. Because that somehow brings value to what you are trying to accomplish with the software? Really? But why stop there? Why not invent a whole new programming languages to make writing your “To Do” single user web application easier? That makes as much sense to me as spending time thinking about your configuration file syntax.

louib · on Aug 14, 2021

Currently, my main concern with YAML is that, by the spec, comments are not attached to a particular node (see https://yaml.org/spec/1.2/spec.html#id2767100). As a result, a lot of YAML parsers (like https://github.com/yaml/libyaml and https://github.com/chyh1990/yaml-rust) only filter out the comments during the parsing phase. This makes it less than ideal for a use-case where the configuration file is expected to be modified by both programs and humans.

TOML makes it more trivial to associate comments with a node. This is mainly because the language is simpler though, as the spec is not explicit about that (https://toml.io/en/v1.0.0#comment).

kumarvvr · on Aug 15, 2021

I always wondered, why XML is not used by developers. I use it for my programs. The combination of attributes and data within nodes, to me, has been very useful.

bschwindHN · on Aug 15, 2021

It's a slog to write, harder to parse, and you have to make more decisions (do I put this in the node's attribute, or as a child node?).

Maven had pom.xml files for declaring dependencies. Very few developers enjoyed editing those, and you don't see many XML-based config files these days for a reason.

Bilal_io · on Aug 15, 2021

I've seen devs that came from the XML era treat json the same way, and I hate it. Exanple:

{ "key": "the key", "Value": "the value" }

dub · on Aug 14, 2021

There are a variety of generic DSLs for encoding configuration concisely with composability, functions, custom validation, etc. Some examples would be jsonnet, dhall, and cue.

If the configuration needs to be transformed into a more computer-friendly format like json or yaml for later loading by a binary, that's easy enough to do at build time in a modern-ish build system like bazel

jedimastert · on Aug 14, 2021

If you're gonna be generating at build like that, you might as well be using a binary encoding like protos (this is for devs, btw)

janjones · on Aug 14, 2021

Complex configurations should simply use configuration programming languages like Lua. They can start as simple as JSON/YAML but stay flexible as the complexity grows thanks to functions, references, etc. Parsing them should be also easy (configuration languages are meant to be embedded in others, just like JSON/YAML).

mrloba · on Aug 14, 2021

How about not using configuration files in the first place? Instead expose a library in some programming language (preferably typed) to assist with writing config, then serialize the output in whatever format is easier to parse.

seized · on Aug 14, 2021

A while back it took myself, a data center engineer, and two engineers from the appliance vendor a solid 20 minutes to figure out the problems with a ~12 line yaml file that needed two or three lines added (network config file). Between syntax, indentation, etc it required a few tries. Plus there was a copy and paste in there via SSH.

I don't know what a better alternative is but yaml can be incredibly frustrating.

oweiler · on Aug 14, 2021

The problems you describe wouldn't exist when using XML with a schema.

thayne · on Aug 14, 2021

Maybe, but XML has a whole host of other problems.

derefr · on Aug 14, 2021

I think I get what the article is saying. I don't think they expressed it very well.

Here's a concrete example of what I think they're complaining about. From a Kubernetes Deployment manifest:

    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: db-conn
          key: uri

That valueFrom.secretKeyRef.{name, key} structure is an "exploded AST" — something that in any programming language, you'd be chastised for writing out as a structural-initialization literal, because there's so much stuff there that it's easy to screw it up and forget something.

When attempting to express such a literal in most-any programming language, you'd expect to either have:

• the Java "literal factory function" approach — some static function to call, that takes either other literals, or a single literal string encoding an expression in a DSL, as arguments, and then produces an initialized value-object; or

• the Elixir "custom sigil" approach (there's probably a more popular language with this feature, but Elixir is what I know) — a macro or operator that takes in a raw string or lexemes encoding an expression in a DSL, and then codegens out to the appropriate structural-initialization literal. (Macros are better here, as incorrect DSL syntax can be caught at parse/compile time, just like incorrect syntax of the surrounding language.)

-----

To be honest, I don't think I would want YAML to support either of these. YAML shouldn't be arbitrarily extendable; a YAML document should be a YAML document, parsed by any compliant parser. (There could be some Avro-like "embedded schema" format for YAML that enables this, but that format would be better as a layer on top of YAML, rather than being part of YAML itself.)

What I would like in YAML, is support for built-in literal sugar for certain specific YAML structures.

This is different in concept from YAML having support for a type, where that type must have some 1:1 native type on the host language side for it to encode/decode to.

Instead, a "sugared structure" would involve

1. a YAML parser having a sub-parser for certain specific DSLs, where the output of this parser is a small, standardized-in-shape container structure, made out of regular YAML arrays, dicts, and scalars;

2. a YAML generator offering a configuration option to pattern-match on these standardized-in-shape structures, where recognized subtrees are swapped out in the emitted YAML for the appropriate sugared-literal representation.

Examples of where this would be helpful include: DateTimes, URIs, UUIDs, Intervals... and that's basically it.

There really aren't too many of these. The short list above represents basically the entirety of the set of types I've ever had YAML fall down on in the 10 years I've been generating/parsing YAML documents. For everything else, it works fine.

fanf2 · on Aug 14, 2021

I guess the Elixir "custom sigil" is a generalized version of the Ruby or Perl special quoting styles for things like regexes or shell commands?

There are at least a couple of prominent versions of that: a great dynamic language version is JavaScript template literals <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...>; or in the static language world there's C++ user-defined literals <https://en.cppreference.com/w/cpp/language/user_literal> (tho if I remember correctly they are horrible for strings unless you are using a recent version of C++)

derefr · on Aug 14, 2021

Right. Elixir's sigils (https://elixir-lang.org/getting-started/sigils.html) are just a macros in the current namespace of name name pattern sigil_[letter], where writing ~x/foo/ or ~x(foo) etc., evaluates the macro sigil_x("foo") at compile-time.

The JavaScript equivalent with tagged template literals isn't the same, because it isn't a macro, and so the template function has to run at runtime. As such, these are no longer literals (i.e. pure data) per se.

The C++ equivalent is closer, if-and-only-if the user-defined literal operator function call is a constexpr that gets pre-evaluated.

travisjungroth · on Aug 14, 2021

What do you mean by “exploded AST”? What’s a “structural-initialization literal”?

derefr · on Aug 15, 2021

Structural initialization is when you define a struct (as opposed to a scalar) by directly declaring the values of its (potentially-private!) named fields.

For example, in Elixir, this is a structurally-initialized MapSet literal:

    %MapSet{map: %{}, version: 2}

This is as opposed to a factory-method-initialized literal, going through a functional API that can hide the inner workings of the ADT produced:

    MapSet.new

Or, compare and contrast a structurally-initialized literal with a DSL-expressed literal:

    %DateTime{ calendar: Calendar.ISO, day: 14, hour: 23, microsecond: {121432, 6}, minute: 57, month: 8, second: 22, std_offset: 0, time_zone: "Etc/UTC", utc_offset: 0, year: 2021, zone_abbr: "UTC" }

vs.

    ~U[2021-08-14 23:57:22.000000Z]

Note that in the first case (structural initialization), if there were any other complex non-scalar objects nested within the main one, you'd have to define those too. It's a forced encoding of an Abstract Syntax Tree representation of the structure of the data; but it's an AST that's "exploded" or "cross-sectional" — one that has no functions, no ability to encapsulate/abstract.

Here's a snippet from https://yaml.org/YAML_for_ruby.html#perl_regexps:

    !perl/regexp: 
      REGEXP: "R[Uu][Bb][Yy]$" 
      MODIFIERS: i

Wouldn't it be less annoying to both read and write that in a YAML document if it were expressed the way you'd expect — as:

    /R[Uu][Bb][Yy]$/i

...where YAML itself would know to parse the latter as if it were the former, and to generate the latter in place of the former (for this, and an exclusive few other common structurally-initialized types that constantly get represented in YAML documents)?

travisjungroth · on Aug 15, 2021

Thanks for explaining. I'm not sure that's laughed at in every language. It's standard in Clojure to just define things as maps, lists, etc. I actually think it's a pretty good idea.

Your examples, datetimes and regex, make the DSL option seem nice. In picking good examples you picked ones I'd be familiar with. But that's sort of the trick. If something is completely new to me, I'd much rather have it blown up.

derefr · on Aug 15, 2021

The thing with all of the “common” structural data types, though — datetimes, regular expressions, UUIDs, URLs — is that they have either a conventional or separately-standardized syntax, separate from the syntax of any particular programming language they’re hosted in. If you know what these things are, and what they’re for, then it’s impossible to have not encountered the basically-universal notation for expressing them as well.

And my thinking is that, if you don’t know what they are, then you’ll need to look up what they are, in order to understand the semantics at play. And doing that will force you through learning the notation as well. There’s never really a point at which a (responsible) programmer will be trying to deal with modifying the fields inside e.g. a URL, while having no understanding of what a URL is (and so seeing any familiarity advantage in the exploded-field syntax over the DSL syntax.) You’ll learn the syntax on your way to understanding the semantics, and so will end up preferring the compact DSL notation, just like everyone else.

A somewhat analogous example: there’s no common method of teaching elementary arithmetic that doesn’t pass through binary-operator expression syntax with binding affinity (i.e. “order of operations.”) In theory, you could learn elementary arithmetic entirely in the form of functional application trees (i.e. arithmetic in Lisp), or entirely in stack-machine/RPN notation (i.e. arithmetic in Forth); but no elementary-school teacher actually teaches arithmetic this way, and there are no materials aimed at children that try to do this. So, by learning arithmetic, people get this additional bit of enculturation of learning to deal with parsing out the meaning of mixed binary-operator expressions using a precedence ladder; and end up preferring the “convenience” of the compact-but-complex binary-operator notation, over the exploded-but-simple AST notation.

And also, to be clear, I’m not suggesting YAML would be better off if it did this for any arbitrary structural pattern that happens to have a formal notation for it somewhere in the world. Just the ones that most-every programmer will inevitably run into, because every programming language modern enough to support YAML, also supports the expression of those types in the form of those literals. (For example, effectively every language that has a native URL type, supports expressing URLs through factory-method literals by calling `URL.parse` on the string representation; and everyone who writes URL-handling code in a given language, when defining a constant URL, would automatically reach for “write the URL in RFC1738 URI notation in a string and pass it into URL.parse” over “structurally initialize a URL struct.”)

travisjungroth · on Aug 15, 2021

I totally agree on the common data types. Thanks for clarifying for me.

toiletaccount · on Aug 14, 2021

when i see yaml, i use other software

a-b · on Aug 14, 2021

This is why you should consider https://carvel.dev/ytt/