(I am an absolutist on this matter. To be a superset, all, that's A L L valid JSON strings must also be valid YAML to be a superset. A single failure makes it not a superset. At scale, any difference will eventually occur, which is why even small deviations matter.)
I’ve often heard this (YAML is a superset of JSON) but never looked into the details.
According to https://yaml.org/spec/1.2.2/, YAML 1.2 (from 2009) is a strict superset of JSON. Earlier versions were an _almost_ superset. Hence the confusion in this thread. It depends on the version…
CPAN link provided by the parent says 1.2 still isn't a superset:
> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON, even though the incompatibilities have been documented (and are known to Brian) for many years and the spec makes explicit claims that YAML is a superset of JSON. It would be so easy to fix, but apparently, bullying people and corrupting userdata is so much easier.
"Please note that YAML has hardcoded limits on (simple) object key lengths that JSON doesn't have and also has different and incompatible unicode character escape syntax... YAML also does not allow \/ sequences in strings"
I just checked YAML 1.2 and it seems that 1024 limit length on keys still in spec (https://yaml.org/spec/1.2.2/, ctrl+f, 1024). So any JSON with long keys is not compatible with YAML.
Another reason to have a limit well below the computer's memory capacity is that one could find ill-formed documents in the wild, e.g., an unclosed quotation mark, causing the "rest' of a potentially large file to be read as a key, which can quickly snowball (imagine if you need to store the keys in a database, in a log, if your algorithms need to copy the keys, etc.)
I assume JSON implementations have a some limit on the key size (or on the whole document which limits the key size), hopefully far below the available memory.
I assume and hope that they do not, if there is no rule stating that they are invalid. There are valid reasons for JSON to massive keys. A simple one: depending on the programming language and libraries used, an unordered array ["a","b","c"] might be better mapped as a dictionary {"a":1,"b":1,"c":1}. Now all of your keys are semantically values, and any limit imposed on keys only makes sense if the same limit is also imposed on values.
Yes absolutely, in practice the limit seems to be on the document size rather than on keys specifically. That said it still sets a limit on the key size (to something a bit less that the max full size), and some JSON documents valid for a given JSON implentation might not be parsable by others, in which case the Yaml parsers are no exceptions ;)
I'm not even sure why I'm playing the devil's advocate, I hate Yaml actually :D
> Then we said it's too verbose. We named some subsets XML, HTML, XSLX
If anything, XML as an SGML subset is more verbose than SGML proper; in fact, getting rid of markup declarations to yield canonical markup without omitted/inferred tags, shortforms, etc. was the entire point of XML. Of course, XML suffered as an authoring format due to verbosity, which led to the Cambrian explosion of Wiki languages (MediaWiki, Markdown, etc.).
Also, HTML was conceived as an SGML vocabulary/application [1], and for the most part still is [2] (save for mechanisms to smuggle CSS and JavaScript into HTML without the installed base of browsers displaying these as content at the time, plus HTML5's ad-hoc error recovery).
While indeed neither markdown, much less JSON syntax has been intended as an SGML app, that doesn't stop SGML from parsing JSON, markdown, and other custom Wiki syntax using SHORTREF [1] ;) In fact, the original markdown language is specified as a mapping to HTML angle-bracket markup (with HTML also an SGML vocabulary), and thus it's quite natural to express that mapping using SGML SHORTREF, even though only a subset can be expressed.
I think you'll find that in the beginning were M-expressions, but they were evil, and were followed by S-expressions, which were and are and ever will be good.
SGML and its descendants are okay for document markup.
XML for data (as opposed to markup) is either evil or clown-shoes-for-a-hat insane — I can’t figure out which.
JSON is simultaneously under- and over-specified, leading to systems where everything works right up until it doesn't. It shares a lot with C and Unix in this respect.
If XML for data is bad, check out XML as a programming language. I think this has cropped up a few times, one that stuck with me was as templating structures in the FutureTense app server, before being acquired by OpenMarket and they switched to JSPs or something.
Lots of <for something> <other stuff> </for> sorts of evil.
Python's .netrc library also hasn't supported comments correctly for like 5 years. The bug was reported, it was never fixed. If I want to use my .netrc file with Python programs, I have to remove all comments (that work with every other .netrc-using program).
It's 2022 and we can't even get a plaintext configuration format from 1980 right.
> It's 2022 and we can't even get a plaintext configuration format from 1980 right.
To me, it's more depressing that we've been at this for 50-60 years and still seemingly don't have an unambiguously good plaintext configuration format at all.
I've been a Professional Config File Wrangler for two decades, and I can tell you that it's always nicer to have a config file that's built to task rather than being forced to tie yourself into knots when somebody didn't want to write a parser.
The difference between a data format and a configuration file is the use case. JSON and YAML were invented to serialize data. They only make sense if they're only ever written programmatically and expressing very specific data, as they're full of features specific to loading and transforming data types, and aren't designed to make it easy for humans to express application-specific logic. Editing them by hand is like walking a gauntlet blindfolded, and then there's the implementation differences due to all the subtle complexity.
Apache, Nginx, X11, RPM, SSHD, Terraform, and other programs have configuration files designed by humans for humans. They make it easy to accomplish tasks specific to those programs. You wouldn't use an INI file to configure Apache, and you wouldn't use an Apache config to build an RPM package. Terraform may need a ton of custom logic and functions, but X11 doesn't (Terraform actually has 2 configuration formats and a data serialization format, and Packer HCL is different than Terraform HCL). Config formats minimize footguns by being intuitive, matching application use case, and avoiding problematic syntax (if designed well). And you'd never use any of them to serialize data. Their design makes the programs more or less complex; they can avoid complexity by supporting totally random syntax for one weird edge case. Design decisions are just as important in keeping complexity down as in keeping good UX.
Somebody could take an inventory of every configuration format in existence, matrix their properties, come up with a couple categories of config files, and then plop down 3 or 4 standards. My guess is there's multiple levels of configuration complexity (INI -> "Unixy" (sudoers, logrotate) -> Apache -> HCL) depending on the app's uses. But that's a lot of work, and I'm not volunteering...
I quite like CUELang (https://cuelang.org/), although it not yet widely supported.
It has a good balance between expressivity and readability, it got enough logic to be useful, but not so much it begs for abuses, it can import/export to yaml and json and features an elegant type system which lets you define both the schema and the data itself.
Although I do feel like there is a case to be made that if you need a Turing complete configuration language then in most cases you failed your users by pushing too many decisions on to them instead of deciding on sensible defaults.
And if you are dealing with one of the rare cases where Turing complete configuration is desirable then maybe use Lua or something like that instead.
I'm not defending YAML. YAML is terrible. It's even worse with logic and/or templates (looking at you, Ansible). Toml is certainly better but I'm still baffled as to why we don't have a "better YAML". YAML could almost be okay.
Followup to my own post: don't forget about Scheme! Same nice properties as Lua, but you get some extra conveniences from using s-expressions (which can represent objects somewhat more flexibly, like XML, than Lua, which is more or less 1:1 with JSON).
There's StrictYAML[1][2]. Can't say I've used it as let's face it, most projects bind themselves to a config language - whether that be YAML, JSON, HCL or whatever - but I'd like to.
Yeah, I think it's because nobody sat down and methodically created it.
People create config languages that work for their use case and then it is just a happy accident if it works for other things.
I don't think anyone has put serous effort into designing a configuration language. And by that I mean collect use cases, study how other config languages does things, make drafts, and test them. etc...
I know a lot of people hate it but I find it to be the only configuration language that makes any sense for moderately large configs.
It’s short, readable, unambiguous, great IDE support. Got built in logic, variables, templates, functions and references to other resources - without being Turing complete imperative language, and without becoming a xml monstrosity.
Seriously there is nothing even close to it. Tell me one reasonable alternative in wide use that’s not just some preprocessor bolted onto yaml, like Helm charts or Ansible jinja templates.
There's a world of difference between "simple configuration needs" and "complex configuration needs".
I will take a kubernetes deployment manifest as an example that you would want to express in a hypothetically perfect configuration language. Now, eventueally, you end up in the "containers" bit of the pod template inside the deployment spec.
And in that, you can (and arguably should) set resources. But, in an ideal world, when you set a CPU request (or, possibly, limit, but I will go with request for now) for an image that has a Go binary in it, you probably also want to have a "GOMAXPROCS" environment variable added that is the ceiling of your CPU allocation. And if you add a memory limit, and the image has a Java binary in it, you probably want to add a few of the Java memory-tuning flags.
And it is actually REALLY important that you don't repeat yourself here. In the small, it's fine, but if you end up in a position where you need to provide more, or less, RAM or CPU, on short notice (because after all, configuration files drive what you have in production, and mutating configuration is how you solve problems at speed, when you have an outage), any "you have to carefully put the identical thing in multiple places" is exactly how you end up with shit not fixing themselves.
So, yeah, as much hate as it gets, BCL may genuinely be better than every other configuration language I have had the misfortune to work with. And one of the things I looked forward to, when I left the G, was to never ever in my life have to see or think about BCL ever again. And then I saw what the world at large are content with. It is bloody depressing is what it is.
Yeah absolutely. I think there are four corners to the square: "meant to be written by humans/meant to be written by computers" and "meant to be read by humans/not meant to be read by humans". JSON is the king of meant to be written by computers read by humans, grpc and swift and protobuf and arrow can duke it out the written by computer/not read corner. We are missing good options in written by humans half.
And the sysadmin in me developed a dislike of both within 1 minute of looking at them.
Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation.
See, there are two sides to configuration, the user and the program. Knowledge about the values, defaults and types should live on the program side and should be documented. Then the user side of configuration can be clean and easy to read/write and most important of all, allow the user to accomplish the most common configuration without having to learn a new config language on top of learning the application.
> Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation
You just described CUELang.
The type system allows to define a schema as well as the data, in the same file, or in 2 separate ones. Then you can call either a cli tool (that works on linux, windows or mac) or use the Go lib (or bind to it).
For compat, cue can import and export to yaml, json and protobuf, as well as validate them.
Exactly. So if I'm going to learn/use one of them, there's no clear winner, really. Both also seem to also have about the same amount of adoption (zero?).
About Ansible, I think it gained it's success partially due to YAML.
Ansible is worse than Puppet and CFEngine in many ways, but it is superior in the user interface.
It managed to not only be a config management solution, but provide a universal config language that most apps could be configured with. So for a lot of use cases, if you know Anisible/YAML then you don't have to learn a new configuration language on top of learning a new application.
The problem with Ansible is it's not universal, because most app playbooks, are configured in the worst possible way. In my experience typically you get handed an Ansible script, something which you'd hoped was declarative but isn't (like a version that apt-get grabs isn't fixed, or even, gets patched) then suddenly a downstream templated command fucks up, and the person who wrote the script isn't around anymore (or you don't trust their chops because they are a blowhard that worked at Google/Facebook and had a coddling ops team behind them in the past) or worse it's from "community" and has a billion hidden settings that you can't be bothered to grok - and so you have to dig so many layers down that you are better off just fucking rewriting the Ansible script to do the one thing which probably should have been four lines.
In any case, I found Ansible scripts to have like a 3 month half life. If we were lucky. I'm not bitter.
haha, I can go on lengthy rants about every single configuration management system that I have used.
My dream configuration system should revert to default when the config is removed (keeping data). Have a simple/easy user interface. Have maintained modules with sane defaults for the 500 most common server software. I would rather there be no module than an abandoned one with unsafe defaults, that way it is clear that I would have to maintain my own if I want to use that particular piece of software. Performant, it really shouldn't take more than a few minutes to apply a config change. No more than 30 min for initial run.
Early on, Ansible was primarily agent-less from the start which made it ridiculously easy to sneak into existing infrastructure and manual workflows. I probably would not have been able to stand up Puppet or Salt or whatever but I could run Ansible all by myself with no one to stop me :).
I'm curious what your thoughts are on a config language I'm working on.
GitHub.com/vitiral/zoa
It has both binary and textual representation (with the first byte being able to distinguish them), and the syntax is clean enough I'm planning on extending it into a markup language as well.
This is why I like INI. It doesn't have these problems, because it doesn't try to wrangle the notion of nested objects (or lists) in the first place. The lack of a formal spec is a problem, sure, but it such a basic format that it's kind of self-explanatory.
When the problem is TOML not supporting easy nesting, a solution of "Don't nest." works just as well in TOML as it does in ini. It's not really an advantage of ini. Especially when a big factor in TOML not making it easy is that TOML uses the same kind of [section]\nkey=value formatting that ini does!
I wrote an INI parser that has numerical, boolean, timestamp, MAC address, and IP address types ;) "advantages" of not having a spec!
Seriously: for application-specific config files, the lack of a formal spec can be kind of a nice thing. You can design your parser to the exact needs of your program, with data types that makes sense for your use case. Throw together a formal grammar for use in regression testing, and you're all set.
Obviously a formal spec is essential for data interchange, but that's why JSON exists. To me, YAML is in a gray area that doesn't need to exist. The same thing goes for TOML, but to a far lesser extent.
Everything gets serialized to a string of bytes. The point is that you can fail at parsing when the value doesn't make sense, rather than failing at some point in the future when you decide to use the value and it doesn't make sense. And if you have a defined schema, you can have your editor validate it against the schema when saving, so you don't accidentally have "FILENOTFOUND" in a Boolean.
TOML sucks for list of tables simply because they intentionally crippled inline tables to only be able to occupy one line. For ideology reasons ("we don't need to add a pseudo-JSON"). Unless your table is small, it's going to look absolutely terrible being all crammed into one line.
I would still reach for TOML first if I only needed simple key-value configuration (never YAML), but for anything requiring list-of-tables I would seriously consider JSON with trailing commas instead.
I see the point and this is certainly a drawback of TOML but for me this is something of a boundary case between configuration and data.
When configuration gets so complicated that the configuration starts to resemble structured data I tend to prefer to switch to a real scripting language and generate JSON instead.
This is why there should be a way to automatically install software into a sandboxed location, e.g. a virtualenv.
Considering we are having software drive cars today it should be trivial and I would say even arguably expected that software should be able to autonomously "figure out" how to run itself and avoid conflicts with other software since that's a trivial task in comparison to navigating city streets.
Tested on python what? I was curious to see what error that produced, figuring it would be some whitespace due to the difference between the list items, but using the yamlized python that I had lying around, it did the sane thing:
PATH=$HOMEBREW_PREFIX/opt/ansible/libexec/bin:$PATH
pip list | grep -i yaml
python -V
python <<'DOIT'
from io import StringIO
import yaml
print(yaml.safe_load(StringIO(
'''
{
"list": [
{},
{}
]
}
''')))
DOIT
$ sed 's/\t/--->/g' break-yaml.json
--->{
--->--->"list": [
--->--->--->{},
--->--->--->{}
--->--->]
--->}
$ jq -c . break-yaml.json
{"list":[{},{}]}
$ yaml-to-json.py break-yaml.json
ERROR: break-yaml.json could not be parsed
while scanning for the next token
found character '\t' that cannot start any token
in "break-yaml.json", line 1, column 1
$ sed 's/\t/ /g' break-yaml.json | yaml-to-json.py
{"list": [{}, {}]}
It would be great if instead of the histrionic message on CPAN (which amusingly accuses others of "mass hysteria"), the author would just say "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML".
The YAML spec should be updated to reflect this, but I wonder if a simple practical workaround in YAML parsers (like replacing each tab at the beginning of the document with two spaces before feeding it to the tokenizer) would be sufficient in the short term.
> "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML"
But YAML can start with tabs. Tabs are allowed as separating whitespace in most of the spec productions but are not allowed as indentation. Even though those tabs look like indentation, the spec productions don't interpret them as such.
Note: the YAML spec maintainers (I am one) have identified many issues with YAML which we are actively working on, but (somewhat surprisingly) we have yet to find a case where valid JSON is invalid YAML 1.2.
Thanks for the clarification. Let's fix it in PyYAML then :)
Speaking of PyYAML, I recently ran into an issue where I had to heavily patch PyYAML to prevent its parse result from being susceptible to entity expansion attacks. It would be nice to at least have a PyYAML mode to completely ignore anchors and aliases (as well as tags) using simple keyword arguments. Protection against entity expansion abuse would be nice too.
They should remove the phrase "every JSON file is also a valid YAML file" from the YAML spec. 1) it isn't true, and 2) it seems like it goes against the implication made here:
> This makes it easy to migrate from JSON to YAML if/when the additional features are required.
If JSON interop is provided solely as a short-term solution that eases the transition to YAML, then I applaud the YAML designers for making a great choice.
I'm not a fan of YAML either, but I think you should not generate YAML files if you can avoid it. All YAML you encounter should be hand-written, so this problem should not occur.
I read "YAML is a superset of JSON" not as a logical statement, but as instructions to humans writing YAML. If you know JSON, you can use that syntax to write YAML. Just like, if you know JavaScript or Python (or to some extent PHP) object syntax, you can write JSON.
If you get a parse error, no biggie, you Alt+Tab to the editor where you are editing the config file and correct it. It is not like you are serving this over the net to some other program.
As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right? That sounds like a superset to me. Syntactically there are no problems, and the error messages are just messages.
> As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right?
Does such code count as valid TypeScript though? It sounds more as if the compiler has an option to accept certain invalid programs.
You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)
Personally I'm inclined to agree with this StackOverflow comment. [0] It's an interesting edge-case though.
It's syntactically and functionally correct, so despite the error messages I think 'valid' is a better label.
> You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)
The way I see it, these errors are already on par with C++ warnings. C++ won't stop you if you make a pointer null or use the wrong string as a map key.
(I am an absolutist on this matter. To be a superset, all, that's A L L valid JSON strings must also be valid YAML to be a superset. A single failure makes it not a superset. At scale, any difference will eventually occur, which is why even small deviations matter.)