I get the argument that YAML is here to stay and we need to live with it as a configuration language. I can buy an argument that environment variables are best for configuration with less than 10 knobs, and YAML is fine when there's a simple hierarchy, let's say within 250 knobs.
Schema validation, IMO, only really gets necessary for much larger configurations, the kinds that are the size of Kubernetes manifests or CI/CD pipelines. And we do have less powerful languages, like CUE, that prove that one doesn't need a Turing-complete language to have expressive schemas.
If you have to support YAML, that's one thing. But ideally, if you're at a scale where it really matters, you should be looking at a more modern configuration language.
The difference is because full validation/parsing is a task that can rarely be always fully accomplished with JUST a non-turing complete schema. Every time I use JSON schema I have to add additional validation on top written in turing complete code.
This happened to me literally just an hour ago when I wanted to put a DSL in a field in a config file. json-schema (the "config" schema) doesn't let me write code to validate this and reject it. It's a string or it's not. With StrictYAML schemas written in code it's pretty straightforward to create a parser/validator that rejects invalid DSL with a meaningful error.
You might argue that "these rules bolted on top aren't part of the schema" or "this is validation that you can do after the json schema validates" but there is benefit to combining them - namely, code coherence and validation error consistency.
(there are also down sides - namely that json schema can be used in multiple languages. strictness comes at the expense of reusability).
In practice almost every schema I build I want to have stricter validation rules that are not enforceable with something like json-schema alone.
These are both instances of the law of least power. There are plenty of languages which are too powerful for the task at hand and plenty which are not powerful enough and people hack around and even rage against both. There are other "goldilocks" languages that are just right for the task at hand.
> This happened to me literally just an hour ago when I wanted to put a DSL in a field in a config file. json-schema (the "config" schema) doesn't let me write code to validate this and reject it.
You can embed DSLs in CUE. It's a bit unwieldy because you have to essentially reproduce the DSL grammar in CUE, and it may not be performant, but yeah, it's doable. Can you provide more details?
> You might argue that "these rules bolted on top aren't part of the schema" or "this is validation that you can do after the json schema validates" but there is benefit to combining them - namely, code coherence and validation error consistency.
I would argue that it's a slippery slope. Consider v1 where an enum is statically defined as Employee or Manager. Then in v2 we add VP and CEO. Then in v3 actually the list of permitted titles needs to be fetched from a database populated by HR. Is it still correct to put this in configuration validation? What if the person writing the configuration doesn't have permissions to read from HR's database? So nothing should work?
CUE lets you embed functions too it looks like it's almost a programming language itself.
The closer a configuration language gets to a programming language the less of a reason I see for it to exist.
>I would argue that it's a slippery slope. Consider v1 where an enum is statically defined as Employee or Manager. Then in v2 we add VP and CEO. Then in v3 actually the list of permitted titles needs to be fetched from a database populated by HR. Is it still correct to put this in configuration validation?
No, coupling to a database would be bad design IMO, but grabbing those enums from other config files in the same folder that are parsed earlier I have done a lot.
I've also used libraries that provide lists of timezones and country codes as enums and plugged them in to the parser so you couldnt invent your own country code.
And Ive written validators that reference other bits of the config (e.g. the list of permitted titles is in another part of the config).
All of these things I would argue are good and useful and not worth sacrificing in exchange for preventing possible misuse (like coupling parsers to a DB).
I actually wrote this parser in the first place because I wanted to create a good metalanguage for tersely defining strongly typed executable specifications in YAML (i.e. Gherkin done right). Tons of stuff I wanted to strictly validate wouldnt have been possible with config-based schema validation and with YAML's weak, implicit typing it was a fucking mess.
My problem with libraries like this in Python is that, because nobody wants to have to hand write semantic validation code, they end up mashing semantics into the parser. What I want is for all the different file formats to be different syntaxes for Data, defined recursively as
Data = str | list[Data] | dict[str, Data]
Leave validation to other libraries. You can get very far simply by reflecting on type annotations.
I wonder how it compares to Yaml 1.2, Yaml 1.1 (that are not compatible with each other), and the weird mix of 1.2 and 1.1 that go-yaml/yaml (the one used in k8s, helm, docker) use
That’s relatively new (introduced in Python 3.6). You wouldn’t believe how many production code bases are still in 3.5 or lower.
The change log for 3.6 states:
The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5
Therefore it’s advisable to use OrderedDict if there’s even a chance this code might be used with older versions.
Yeah I recently learned about this after giving another engineer PR feedback that dicts are not ordered and we cannot expect the intended behavior. Jokes on me. Old habits die hard
Okay, permit me a curmudgeonly rant. As someone who has implemented a YAML parser and spent way too much time dissecting the spec along with analyzing various implementations, StrictYAML smacks of simple ignorance.
The page describing the project's raison d'être [0] is mostly a collection of incredulous statements, sprinkled with lovely factual errors. Heck, the point about implicit typing even links to the YAML 1.2 spec, claiming that implicit types are intended behavior, while the spec explicitly makes the opposite clear.
That said, a lot of the common complaints about YAML are rooted in the fact that almost all end user libraries are stuck at YAML 1.1. This is mostly because everyone (including PyYAML) relies on libyaml, the primary culprit. IMHO, YAML 1.2 is quite nice, and I wish we could fix libyaml instead of everyone and their dog inventing their own half-baked language to scratch an itch.
Or perhaps even better, the primary inventor of YAML is currently avidly working on YAMLScript[1] which is a much more radical idea on programming and config language design, while all being YAML backwards-compatible.
> YAML 1.2 is quite nice, and I wish we could fix libyaml instead of everyone and their dog inventing their own half-baked language to scratch an itch.
YAML 1.2 is categorically better than YAML 1.1.
That said: It still suffers from many things that sucks about YAML. From user perspective, it suffers from anchors being a thing (billion laughs attack), duplicate keys, complex object as keys (WTF is this feature even), and loading YAML to objects.
From an implementation perspective, the quoted scalars, several string forms, and huge number of corner cases really makes parsers difficult to write.
I totally feel you on the implementation side. The formalization used in the spec leaves a lot to be desired, and the spec devs seem to recognize this well. One of the aims of my dayaml[0] project is to explore a completely different formalization that rationalizes out the non-fundamental sharp edges.
That said, there really aren't that many special cases. A lot of the them are UI features, arguably endemic to the problem space in one form or another.
The deficiencies you see in the language, however, don't ring true with me. They are mostly implementation deficiencies:
- Anchors are just pointers. YAML can efficiently represent generic object graphs, but if a loader copies those pointers into a billion laughs, that's an issue with the implementation decision. Usually, it comes down to assuming that YAML graphs are always trees, which will always turn cyclic graphs into infinite tree unfoldings.
- Keys are explicitly specced as unique [1]. Not really sure why libyaml and friends get this wrong.
- Loading YAML to objects is explicitly designed to represent the native data structures of your language. It's a serialization format by design. That's why it loads into objects. That's why it has complex keys.
We can certainly discuss whether YAML is appropriate as a configuration language or not, but YAML is first and foremost designed as a language-agnostic textual representation of object graphs. The spec goes well out of it's way to make this clear, and viewing YAML for what it is, instead of a configuration language, really makes the apparent oddities disappear IMHO.
Yes. And it leads to exploits or errors. The dumber the serialization format, the better. JSON has thrived without these mis-features. And most stuff I've seen in the wild doesn't use the exotic features anyway.
The format user should worry about converting nodes to references, and complex keys are something few users ask for.
You are blaming implementation failures on the language design. That's kind of my whole gripe. Instead of wasting cycles on inventing new languages, I wish we'd pool resources into fixing libyaml.
> most stuff I've seen in the wild doesn't use the exotic features anyway
Mostly due to lack of awareness. Ruby, Python, JavaScript Rust, Java, etc. all allow mostly arbitrary objects for keys. It's only confusing if you conflate dictionaries/maps for objects.
In the scientific computing community, it's not unheard of to see lists as keys in YAML documents, which is really convenient if that serializes to exactly the data model you're working.
I'll just come out and say that I hate every single configuration language. All of them suck in their own unique way and every time a new one comes out it fixes some issues of the language it's supposed to supersede but never without introducing new problems. And eventually you're left thinking that you should've just used a .ini file.
"Configuration" languages are fundamentally necessary and it's tragic that people don't understand why.
There are declarative and imperative paradigms. Software engineering layers them on top of each other. Frontend devs write imperative TypeScript to manipulate declarative JSX, which instructs a React library to imperatively decide how to layout declarative HTML, which instructs a web browser to imperatively decide how to render, and so on. The frontend sends an imperative API call to a declaratively-specified API gateway, which imperatively forwards a declarative request body to a backend service, which imperatively goes through validation, authorization, etc. before submitting a declarative SQL SELECT to a database, which imperatively plans out a query over declarative representations of data on the disk, sending imperative system calls to the kernel/disk controller, etc.
Python, JavaScript, Rust, Go.... these are all fine programming languages that allow expressing an imperative paradigm.
But we have fewer languages for declarative paradigms. So-called "configuration" languages are attempts to build higher-level declarative paradigms. Nothing more, nothing less. We need higher-level declarative paradigms to build on top of the current imperative paradigms. It is the next step in the march towards more power and expressiveness, and therefore more productivity and ease of maintenance.
Agreed and they don't have good escape hatches. What was supposed to be declarative always ends up requiring some logic here and there and then you're stuck with a terrible language like YAML and you have to turn to templating, references to anchors and whatnot.
What I don't like is when you need to use a configuration language like Bicep or Terraform when the underlying architecture cannot be represented declaratively. You can create resources and provision them, that's fine. But any time you need forking paths, specific conditions, iterations over some resources, etc. You're done for, unless the configuration language has built a command or keyword for your specific use case. You can always tell me that I'm holding it wrong but when the platform requires me to use those config files or the SDKs for whatever languages are useless, it's infuriating.
Side note and not related to configuration languages but how they're used on $cloudProvider. But when you declare resources or operations that are legal in the language but invalid on the plateform, I die a little bit inside. The platform has all the knowledge about the existence of resources, policies, behavior; there's a whole class of problems that shouldn't exist before you're even trying to run a pipeline!
I like how you shifted the goal post from “I” to “you” to justify your point of view. I don’t care, give me yaml, toml, json, jsonnet, ansible, who cares. It’s a tool. I’m not married to it.
I'll use what I'll have to use, it's a tool like you said. But I don't have to love it. Configuration is a necessary evil and whatever I end up using, I'm never fully satisfied with the end result.
Don’t comment then. You may not agree with me but I feel it is important to comment because it’s important to share an opinion shaping the path of new engineers joining the field. Tools are tools. Some are better than others. There’s no reason to have an emotional connection to them. Five years from now there will be new tools we never imagined we need. We are paid for getting the job done, not for an emotional opinion.