Hacker News new | past | comments | ask | show | jobs | submit login

YAML was intended to be a superset, but it isn't quite, which is about the worst case scenario. See https://metacpan.org/pod/JSON::XS#JSON-and-YAML , for instance.

(I am an absolutist on this matter. To be a superset, all, that's A L L valid JSON strings must also be valid YAML to be a superset. A single failure makes it not a superset. At scale, any difference will eventually occur, which is why even small deviations matter.)




I’ve often heard this (YAML is a superset of JSON) but never looked into the details.

According to https://yaml.org/spec/1.2.2/, YAML 1.2 (from 2009) is a strict superset of JSON. Earlier versions were an _almost_ superset. Hence the confusion in this thread. It depends on the version…


CPAN link provided by the parent says 1.2 still isn't a superset:

> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON, even though the incompatibilities have been documented (and are known to Brian) for many years and the spec makes explicit claims that YAML is a superset of JSON. It would be so easy to fix, but apparently, bullying people and corrupting userdata is so much easier.


Are these documented YAML 1.2 JSON incompatibilities listed / linked to somewhere?

I assume these are something related to non-ascii string encoding / escapes?


They are listed in that same CPAN link

"Please note that YAML has hardcoded limits on (simple) object key lengths that JSON doesn't have and also has different and incompatible unicode character escape syntax... YAML also does not allow \/ sequences in strings"


The JSON::XS documentation linked above reports that YAML 1.2 is not a strict superset of JSON:

> Addendum/2009: the YAML 1.2 spec is still incompatible with JSON

The author also details their issues in, ah, getting some of the authors of the YAML specification to agree.


I just checked YAML 1.2 and it seems that 1024 limit length on keys still in spec (https://yaml.org/spec/1.2.2/, ctrl+f, 1024). So any JSON with long keys is not compatible with YAML.


The JSON specification [1] states:

> An implementation may set limits on the length and character contents of strings.

So this length limit is not a source of incompatibility with JSON.

[1] https://datatracker.ietf.org/doc/html/rfc7159#section-9


Wow! That makes it pretty hard to know you've generated useful JSON especially if your goal is to for cross-ecosystem communication.


To be fair, any JSON implentation is going to have a practical limit on the key size, it's just a bit more random and harder to figure out :)


If you mean limited by available memory, then sure but that does not apply just to key size. If you mean something else, could you elaborate?


Another reason to have a limit well below the computer's memory capacity is that one could find ill-formed documents in the wild, e.g., an unclosed quotation mark, causing the "rest' of a potentially large file to be read as a key, which can quickly snowball (imagine if you need to store the keys in a database, in a log, if your algorithms need to copy the keys, etc.)


I assume JSON implementations have a some limit on the key size (or on the whole document which limits the key size), hopefully far below the available memory.


I assume and hope that they do not, if there is no rule stating that they are invalid. There are valid reasons for JSON to massive keys. A simple one: depending on the programming language and libraries used, an unordered array ["a","b","c"] might be better mapped as a dictionary {"a":1,"b":1,"c":1}. Now all of your keys are semantically values, and any limit imposed on keys only makes sense if the same limit is also imposed on values.


Yes absolutely, in practice the limit seems to be on the document size rather than on keys specifically. That said it still sets a limit on the key size (to something a bit less that the max full size), and some JSON documents valid for a given JSON implentation might not be parsable by others, in which case the Yaml parsers are no exceptions ;)

I'm not even sure why I'm playing the devil's advocate, I hate Yaml actually :D


I guess it is about different implementations of some not properly formalized parts of the JSON spec.

There was also an article here some time ago but I cannot find it right now.


1024 limit is for unquoted keys, which do not occur in JSON


Have a closer look. The 1024 limit in version 1.2 is only for implicit block mapping keys, not for flow style `{"foo": "bar"}`


In the beginning was the SGML.

Then we said it's too verbose. We named some subsets XML, HTML, XLSX.

Then we said it's still too long. So we named some subsets Markdown, and YML.

Then we said it's still too long, and made JSON.

What's wrong with subsets? Ambiguity in naming things.

https://martinfowler.com/bliki/TwoHardThings.html

Is JSON the same as YML?

NO.

Norwegian?

https://news.ycombinator.com/item?id=26671136


> Then we said it's too verbose. We named some subsets XML, HTML, XSLX

If anything, XML as an SGML subset is more verbose than SGML proper; in fact, getting rid of markup declarations to yield canonical markup without omitted/inferred tags, shortforms, etc. was the entire point of XML. Of course, XML suffered as an authoring format due to verbosity, which led to the Cambrian explosion of Wiki languages (MediaWiki, Markdown, etc.).

Also, HTML was conceived as an SGML vocabulary/application [1], and for the most part still is [2] (save for mechanisms to smuggle CSS and JavaScript into HTML without the installed base of browsers displaying these as content at the time, plus HTML5's ad-hoc error recovery).

[1]: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html

[2]: http://sgmljs.net/docs/html5.html


Well, Markdown and YML and JSON are not subsets of SGML, nobody claims they are, and nobody intented them as such. So there's that.


While indeed neither markdown, much less JSON syntax has been intended as an SGML app, that doesn't stop SGML from parsing JSON, markdown, and other custom Wiki syntax using SHORTREF [1] ;) In fact, the original markdown language is specified as a mapping to HTML angle-bracket markup (with HTML also an SGML vocabulary), and thus it's quite natural to express that mapping using SGML SHORTREF, even though only a subset can be expressed.

[1]: https://www.balisage.net/Proceedings/vol17/html/Walsh01/Bali...

[2]: https://daringfireball.net/projects/markdown/


First they came for the angle brackets. And I did not speak out. Because I did not use XML...


You didn't use XML? But We use XML to read the comments here on this HTML web page.

But I came for the angle brackets. Because I < We, eternally.


> Then we said it's still too long. So we named some subsets Markdown, and YML.

> Then we said it's still too long, and made JSON.

JSON is older than markdown and yaml.


Thank you for correcting history! I'd forgotten >_<


I think you'll find that in the beginning were M-expressions, but they were evil, and were followed by S-expressions, which were and are and ever will be good.

SGML and its descendants are okay for document markup.

XML for data (as opposed to markup) is either evil or clown-shoes-for-a-hat insane — I can’t figure out which.

JSON is simultaneously under- and over-specified, leading to systems where everything works right up until it doesn't. It shares a lot with C and Unix in this respect.


If XML for data is bad, check out XML as a programming language. I think this has cropped up a few times, one that stuck with me was as templating structures in the FutureTense app server, before being acquired by OpenMarket and they switched to JSPs or something.

Lots of <for something> <other stuff> </for> sorts of evil.


note: HTML5 is not a subset of SGML.


For example, this valid JSON doesn't parse as YAML:

    {
        "list": [
            {},
                {}
        ]
    }
(tested on Python)

edit: whitespace didn't quite make it through HN, here:

    json.loads('{\n  "list": [\n    {},\n\t{}\n    ]\n}')
    yaml.load ('{\n  "list": [\n    {},\n\t{}\n    ]\n}')


Python's .netrc library also hasn't supported comments correctly for like 5 years. The bug was reported, it was never fixed. If I want to use my .netrc file with Python programs, I have to remove all comments (that work with every other .netrc-using program).

It's 2022 and we can't even get a plaintext configuration format from 1980 right.


> It's 2022 and we can't even get a plaintext configuration format from 1980 right.

To me, it's more depressing that we've been at this for 50-60 years and still seemingly don't have an unambiguously good plaintext configuration format at all.


I've been a Professional Config File Wrangler for two decades, and I can tell you that it's always nicer to have a config file that's built to task rather than being forced to tie yourself into knots when somebody didn't want to write a parser.

The difference between a data format and a configuration file is the use case. JSON and YAML were invented to serialize data. They only make sense if they're only ever written programmatically and expressing very specific data, as they're full of features specific to loading and transforming data types, and aren't designed to make it easy for humans to express application-specific logic. Editing them by hand is like walking a gauntlet blindfolded, and then there's the implementation differences due to all the subtle complexity.

Apache, Nginx, X11, RPM, SSHD, Terraform, and other programs have configuration files designed by humans for humans. They make it easy to accomplish tasks specific to those programs. You wouldn't use an INI file to configure Apache, and you wouldn't use an Apache config to build an RPM package. Terraform may need a ton of custom logic and functions, but X11 doesn't (Terraform actually has 2 configuration formats and a data serialization format, and Packer HCL is different than Terraform HCL). Config formats minimize footguns by being intuitive, matching application use case, and avoiding problematic syntax (if designed well). And you'd never use any of them to serialize data. Their design makes the programs more or less complex; they can avoid complexity by supporting totally random syntax for one weird edge case. Design decisions are just as important in keeping complexity down as in keeping good UX.

Somebody could take an inventory of every configuration format in existence, matrix their properties, come up with a couple categories of config files, and then plop down 3 or 4 standards. My guess is there's multiple levels of configuration complexity (INI -> "Unixy" (sudoers, logrotate) -> Apache -> HCL) depending on the app's uses. But that's a lot of work, and I'm not volunteering...


I quite like CUELang (https://cuelang.org/), although it not yet widely supported.

It has a good balance between expressivity and readability, it got enough logic to be useful, but not so much it begs for abuses, it can import/export to yaml and json and features an elegant type system which lets you define both the schema and the data itself.

I hope it gains traction.


toml is pretty much the best one I have seen so far. At least for small to medium size config files.


Toml has some hairy bits. Lists of objects, lists of lists of objects, objects of lists of objects. Complex objects with top level fields...


Yep,

Although I do feel like there is a case to be made that if you need a Turing complete configuration language then in most cases you failed your users by pushing too many decisions on to them instead of deciding on sensible defaults.

And if you are dealing with one of the rare cases where Turing complete configuration is desirable then maybe use Lua or something like that instead.


I'm not defending YAML. YAML is terrible. It's even worse with logic and/or templates (looking at you, Ansible). Toml is certainly better but I'm still baffled as to why we don't have a "better YAML". YAML could almost be okay.


There's also Lua, which is a full Turing complete language but is still pretty nice for writing config files, and is easy to embed.


Followup to my own post: don't forget about Scheme! Same nice properties as Lua, but you get some extra conveniences from using s-expressions (which can represent objects somewhat more flexibly, like XML, than Lua, which is more or less 1:1 with JSON).


There's StrictYAML[1][2]. Can't say I've used it as let's face it, most projects bind themselves to a config language - whether that be YAML, JSON, HCL or whatever - but I'd like to.

[1] https://hitchdev.com/strictyaml/

[2] https://github.com/crdoconnor/strictyaml


Yeah, I think it's because nobody sat down and methodically created it.

People create config languages that work for their use case and then it is just a happy accident if it works for other things.

I don't think anyone has put serous effort into designing a configuration language. And by that I mean collect use cases, study how other config languages does things, make drafts, and test them. etc...


Terraforms HCL is well designed.

I know a lot of people hate it but I find it to be the only configuration language that makes any sense for moderately large configs.

It’s short, readable, unambiguous, great IDE support. Got built in logic, variables, templates, functions and references to other resources - without being Turing complete imperative language, and without becoming a xml monstrosity.

Seriously there is nothing even close to it. Tell me one reasonable alternative in wide use that’s not just some preprocessor bolted onto yaml, like Helm charts or Ansible jinja templates.


There's a world of difference between "simple configuration needs" and "complex configuration needs".

I will take a kubernetes deployment manifest as an example that you would want to express in a hypothetically perfect configuration language. Now, eventueally, you end up in the "containers" bit of the pod template inside the deployment spec.

And in that, you can (and arguably should) set resources. But, in an ideal world, when you set a CPU request (or, possibly, limit, but I will go with request for now) for an image that has a Go binary in it, you probably also want to have a "GOMAXPROCS" environment variable added that is the ceiling of your CPU allocation. And if you add a memory limit, and the image has a Java binary in it, you probably want to add a few of the Java memory-tuning flags.

And it is actually REALLY important that you don't repeat yourself here. In the small, it's fine, but if you end up in a position where you need to provide more, or less, RAM or CPU, on short notice (because after all, configuration files drive what you have in production, and mutating configuration is how you solve problems at speed, when you have an outage), any "you have to carefully put the identical thing in multiple places" is exactly how you end up with shit not fixing themselves.

So, yeah, as much hate as it gets, BCL may genuinely be better than every other configuration language I have had the misfortune to work with. And one of the things I looked forward to, when I left the G, was to never ever in my life have to see or think about BCL ever again. And then I saw what the world at large are content with. It is bloody depressing is what it is.


Cuelang?


Yeah absolutely. I think there are four corners to the square: "meant to be written by humans/meant to be written by computers" and "meant to be read by humans/not meant to be read by humans". JSON is the king of meant to be written by computers read by humans, grpc and swift and protobuf and arrow can duke it out the written by computer/not read corner. We are missing good options in written by humans half.


Dhall and Cue come to mind as ones that _feel_ more designed

https://github.com/dhall-lang/dhall-lang

https://cuelang.org/docs/usecases/configuration/


Interesting...

Me, the programmer finds those kinda cool.

And the sysadmin in me developed a dislike of both within 1 minute of looking at them.

Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation. See, there are two sides to configuration, the user and the program. Knowledge about the values, defaults and types should live on the program side and should be documented. Then the user side of configuration can be clean and easy to read/write and most important of all, allow the user to accomplish the most common configuration without having to learn a new config language on top of learning the application.


> Honestly, I think a good configuration library should be more than a spec, it should come with a library that handles parsing/validation

You just described CUELang.

The type system allows to define a schema as well as the data, in the same file, or in 2 separate ones. Then you can call either a cli tool (that works on linux, windows or mac) or use the Go lib (or bind to it).

For compat, cue can import and export to yaml, json and protobuf, as well as validate them.


Isn't Dhall basically the same (=have the same set of features)?


In the same way python and js are basically the same.


Exactly. So if I'm going to learn/use one of them, there's no clear winner, really. Both also seem to also have about the same amount of adoption (zero?).


Ok, you have convinced me to give it a serious look.


> Yeah, I think it's because nobody sat down and methodically created it.

I think it's the opposite. There isn't a single config file that suits all needs.

Especially when you realize config isn't a single thing.

http://mikehadlow.blogspot.com/2012/05/configuration-complex...


I guess that could also be the case.

I haven’t studied it, I am just generally feeling unhappy about most software configuration.


About Ansible, I think it gained it's success partially due to YAML.

Ansible is worse than Puppet and CFEngine in many ways, but it is superior in the user interface.

It managed to not only be a config management solution, but provide a universal config language that most apps could be configured with. So for a lot of use cases, if you know Anisible/YAML then you don't have to learn a new configuration language on top of learning a new application.


The problem with Ansible is it's not universal, because most app playbooks, are configured in the worst possible way. In my experience typically you get handed an Ansible script, something which you'd hoped was declarative but isn't (like a version that apt-get grabs isn't fixed, or even, gets patched) then suddenly a downstream templated command fucks up, and the person who wrote the script isn't around anymore (or you don't trust their chops because they are a blowhard that worked at Google/Facebook and had a coddling ops team behind them in the past) or worse it's from "community" and has a billion hidden settings that you can't be bothered to grok - and so you have to dig so many layers down that you are better off just fucking rewriting the Ansible script to do the one thing which probably should have been four lines.

In any case, I found Ansible scripts to have like a 3 month half life. If we were lucky. I'm not bitter.


haha, I can go on lengthy rants about every single configuration management system that I have used.

My dream configuration system should revert to default when the config is removed (keeping data). Have a simple/easy user interface. Have maintained modules with sane defaults for the 500 most common server software. I would rather there be no module than an abandoned one with unsafe defaults, that way it is clear that I would have to maintain my own if I want to use that particular piece of software. Performant, it really shouldn't take more than a few minutes to apply a config change. No more than 30 min for initial run.


Early on, Ansible was primarily agent-less from the start which made it ridiculously easy to sneak into existing infrastructure and manual workflows. I probably would not have been able to stand up Puppet or Salt or whatever but I could run Ansible all by myself with no one to stop me :).


I'm curious what your thoughts are on a config language I'm working on.

GitHub.com/vitiral/zoa

It has both binary and textual representation (with the first byte being able to distinguish them), and the syntax is clean enough I'm planning on extending it into a markup language as well.


Even if you have sensible defaults don't you still need to be able to parse configured changes?


Not always, sometimes all other options are just wrong. Or you can auto detect the correct setting.


I understand the pragmatic reasons for it being the way it is, but I still wish TOML didn't require all strings to be quoted.


This is why I like INI. It doesn't have these problems, because it doesn't try to wrangle the notion of nested objects (or lists) in the first place. The lack of a formal spec is a problem, sure, but it such a basic format that it's kind of self-explanatory.


When the problem is TOML not supporting easy nesting, a solution of "Don't nest." works just as well in TOML as it does in ini. It's not really an advantage of ini. Especially when a big factor in TOML not making it easy is that TOML uses the same kind of [section]\nkey=value formatting that ini does!


You can use toml as a better ini by limiting yourself to the key / value schema. It still superior because:

- it has a spec

- it has other types than strings

- you can always decide you actually need nested data, and add them later


I wrote an INI parser that has numerical, boolean, timestamp, MAC address, and IP address types ;) "advantages" of not having a spec!

Seriously: for application-specific config files, the lack of a formal spec can be kind of a nice thing. You can design your parser to the exact needs of your program, with data types that makes sense for your use case. Throw together a formal grammar for use in regression testing, and you're all set.

Obviously a formal spec is essential for data interchange, but that's why JSON exists. To me, YAML is in a gray area that doesn't need to exist. The same thing goes for TOML, but to a far lesser extent.


> it has other types than strings

But isn't the config file just a string?


Everything gets serialized to a string of bytes. The point is that you can fail at parsing when the value doesn't make sense, rather than failing at some point in the future when you decide to use the value and it doesn't make sense. And if you have a defined schema, you can have your editor validate it against the schema when saving, so you don't accidentally have "FILENOTFOUND" in a Boolean.


... lack of a hex float representation ...


We do, it’s called TOML. The future is here it’s just not equally distributed.


TOML sucks for list of tables simply because they intentionally crippled inline tables to only be able to occupy one line. For ideology reasons ("we don't need to add a pseudo-JSON"). Unless your table is small, it's going to look absolutely terrible being all crammed into one line.

https://github.com/toml-lang/toml/issues/516

The official way to do list of tables is (look at how much duplication there is)

  [[main_app.general_settings.logging.handlers]]
    name = "default"
    output = "stdout"
    level = "info"

  [[main_app.general_settings.logging.handlers]]
    name = "stderr"
    output = "stderr"
    level = "error"

  [[main_app.general_settings.logging.handlers]]
    name = "access"
    output = "/var/log/access.log"
    level = "info"
vs

  handlers = [
    {
      name = "default",
      output =  "stdout",
      level = "info",
    }, {
      name = "stderr",
      output =  "stderr",
      level = "error",
    }, {
      name = "access",
      output =  "/var/log/access.log",
      level = "info",
    },
  ]
I would still reach for TOML first if I only needed simple key-value configuration (never YAML), but for anything requiring list-of-tables I would seriously consider JSON with trailing commas instead.


I see the point and this is certainly a drawback of TOML but for me this is something of a boundary case between configuration and data.

When configuration gets so complicated that the configuration starts to resemble structured data I tend to prefer to switch to a real scripting language and generate JSON instead.


Expression languages like Nix, Jsonnet, Dhall, and Cue are really nice in these situations.


for this reason I couldn't see a CI platform ever seriously consider TOML

(someone may point out to me a CI platform that relies on TOML—which I welcome)


Rust is built on TOML. For better or worse.


Do you mean Cargo? Because Cargo is not a CI system. You never embed shell commands in a Cargo.toml.

If you need to program complex logic to build a crate, you don’t write TOML. You write a build.rs file in actual Rust.


If embedding shell commands in a configuration language is considered a CI system I think we are doomed.


> JSON with trailing commas

JSON5?


It's perfect until you do a lot of nesting..


...or any nesting. TOML sucks for anything non-trivial.


It makes me sad every time I see a newly announced tool that went for YAML instead of TOML.


XML is still good.


Hmm, it looks like it’s handled comments for at least a decade:

https://github.com/python/cpython/blame/d75a51bea3c2442f81d3...

Oh, maybe it’s this issue:

https://bugs.python.org/issue34132

If I’ve read it correctly, there was a regression from Python 2.x to 3.x such that you now need to format comments:

    #like this 
Instead of:

    # like this
(A space after the # isn’t accepted by the parser.)


    try:
        try:
            import orjson as json
        except:
            try:
                import rapidjson as json
            except:
                try:
                    import fast_json as json
                except:
                    import json
        foo = json.loads(string)
    except:
        try:
            import yaml
        except:
            # try harder
            import os
            try:
                assert(os.system("pip3 install yaml") == 0)
            except:
                # try even harder
                try:
                    assert(os.system("sudo apt install python3-pip && pip3 install yaml") == 0)
                except:
                    assert(os.system("sudo yum install python3-pip && pip3 install yaml") == 0)
            import yaml
        try:
            foo = yaml.loads(string)
        except:
            try:
                ....


Great idea.

  pip install --user yaml
increases the chances it will work


A note to readers: it's not always a good idea to put automated software installation in a place that users don't expect it.

I've seen that kind of approach cause a ton of issues the moment that the software was used in a different environment than the author expected.

It's much better IMO to fail with a message about how to install the missing dependency.


This is why there should be a way to automatically install software into a sandboxed location, e.g. a virtualenv.

Considering we are having software drive cars today it should be trivial and I would say even arguably expected that software should be able to autonomously "figure out" how to run itself and avoid conflicts with other software since that's a trivial task in comparison to navigating city streets.


Brilliant! What license is this published under?


Free Art License


Tested on python what? I was curious to see what error that produced, figuring it would be some whitespace due to the difference between the list items, but using the yamlized python that I had lying around, it did the sane thing:

    PATH=$HOMEBREW_PREFIX/opt/ansible/libexec/bin:$PATH
    pip list | grep -i yaml
    python -V
    python <<'DOIT'
    from io import StringIO
    import yaml
    print(yaml.safe_load(StringIO(
    '''
        {
            "list": [
                {},
                    {}
            ]
        }
    ''')))
    DOIT
produces

    PyYAML                6.0
    Python 3.10.1
    {'list': [{}, {}]}


With leading tabs it does not work.

  $ sed 's/\t/--->/g' break-yaml.json
  --->{
  --->--->"list": [
  --->--->--->{},
  --->--->--->{}
  --->--->]
  --->}
  $ jq -c . break-yaml.json
  {"list":[{},{}]}
  $ yaml-to-json.py break-yaml.json
  ERROR: break-yaml.json could not be parsed
  while scanning for the next token
  found character '\t' that cannot start any token
    in "break-yaml.json", line 1, column 1
  $ sed 's/\t/    /g' break-yaml.json | yaml-to-json.py
  {"list": [{}, {}]}


This is completely valid YAML.

YAML does not allow tabs in indentation, but the tabs in your example are not indentation according to the YAML spec productions.

You can see it clearly here against many YAML parsers: https://play.yaml.io/main/parser?input=CXsKCQkibGlzdCI6IFsKC...

As tinita points out, sadly PyYAML and libyaml implement this wrong.

See https://matrix.yaml.info/


That's because PyYAML doesn't implement the spec correctly.


Tabs are not valid JSON


Do you have a link for that?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... says:

> Insignificant whitespace may be present anywhere except within a JSONNumber [forbidden] or JSONString [interpreted as part of the string]

And specifically lists tab as whitespace:

> The tab character (U+0009), carriage return (U+000D), line feed (U+000A), and space (U+0020) characters are the only valid whitespace characters.

More specifically, expanding https://datatracker.ietf.org/doc/html/rfc8259#section-2 gives an array as (roughly)

> ws %x5B ws value (ws %x2C ws value)* ws %x5D ws

Where `ws` explicitly includes `%x09`. Which seems to cover this case?


Per RFC 8259:

      ws = *(
              %x20 /              ; Space
              %x09 /              ; Horizontal tab
              %x0A /              ; Line feed or New line
              %x0D )              ; Carriage return


The grammar in https://www.json.org/json-en.html disagrees. It has

  json
    element

  element
    ws value ws

  ws
    ‘0009’ ws


Edited with string escapes, the tab didn't make it through HN.

The error from PyYaml 5.3.1:

    yaml.scanner.ScannerError: while scanning for the next token
    found character '\t' that cannot start any token
      in "<unicode string>", line 4, column 1


If it continues to be hard to share, I suggest encoding it as a base64 string so folks can decode it into a file with exactly the right contents.


This is, unwittingly, the most YAML-relevant comment in this thread.


Not base64, but this should be easy to reproduce:

  $ printf '{\n\t"list": [\n\t\t{},\n\t\t{}\n\t]\n}\n' > test.json

  $ jq < test.json 
  {
    "list": [
      {},
      {}
    ]
  }

  $ yamllint test.json 
  test.json
    2:1       error    syntax error: found character '\t' that cannot start any token (syntax)


Thanks, I'm finally able to reproduce this.

It would be great if instead of the histrionic message on CPAN (which amusingly accuses others of "mass hysteria"), the author would just say "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML".

The YAML spec should be updated to reflect this, but I wonder if a simple practical workaround in YAML parsers (like replacing each tab at the beginning of the document with two spaces before feeding it to the tokenizer) would be sufficient in the short term.


> "YAML documents can't start with a tab while JSON documents can, making JSON not a strict subset of YAML"

But YAML can start with tabs. Tabs are allowed as separating whitespace in most of the spec productions but are not allowed as indentation. Even though those tabs look like indentation, the spec productions don't interpret them as such.

See my comment above and esp see https://play.yaml.io/main/parser?input=CXsKCQkibGlzdCI6IFsKC...

Note: the YAML spec maintainers (I am one) have identified many issues with YAML which we are actively working on, but (somewhat surprisingly) we have yet to find a case where valid JSON is invalid YAML 1.2.


Thanks for the clarification. Let's fix it in PyYAML then :)

Speaking of PyYAML, I recently ran into an issue where I had to heavily patch PyYAML to prevent its parse result from being susceptible to entity expansion attacks. It would be nice to at least have a PyYAML mode to completely ignore anchors and aliases (as well as tags) using simple keyword arguments. Protection against entity expansion abuse would be nice too.


This parses fine as YAML in all the tools I've tried. Can you provide the specific versions of the libraries you're using?


They should remove the phrase "every JSON file is also a valid YAML file" from the YAML spec. 1) it isn't true, and 2) it seems like it goes against the implication made here:

> This makes it easy to migrate from JSON to YAML if/when the additional features are required.

If JSON interop is provided solely as a short-term solution that eases the transition to YAML, then I applaud the YAML designers for making a great choice.


> YAML was intended to be a superset

My impression was JSON came years after YAML, and it was somehow coincidental that YAML was almost a superset of JSON.

(Shockingly wikipedia tells me they both came out within a month of each other in 2001).


On the upside, if it's almost a superset then a data producer can make sure it is polyglot by sticking to the intersection of the two.

C++ is not a strict superset of C, but the ability to include C headers is very valuable.


I wasn't able to reproduce any of the issues listed on that page. Does anyone have an example?


I'm not a fan of YAML either, but I think you should not generate YAML files if you can avoid it. All YAML you encounter should be hand-written, so this problem should not occur.

I read "YAML is a superset of JSON" not as a logical statement, but as instructions to humans writing YAML. If you know JSON, you can use that syntax to write YAML. Just like, if you know JavaScript or Python (or to some extent PHP) object syntax, you can write JSON.

If you get a parse error, no biggie, you Alt+Tab to the editor where you are editing the config file and correct it. It is not like you are serving this over the net to some other program.


Same applies to TypeScript. It is not a superset of JavaScript, although many people think it is.

https://stackoverflow.com/a/53698835/


As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right? That sounds like a superset to me. Syntactically there are no problems, and the error messages are just messages.


> As long as you tell the typescript compiler not to stop when it finds type problems, all JavaScript works and compiles, right?

Does such code count as valid TypeScript though? It sounds more as if the compiler has an option to accept certain invalid programs.

You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)

Personally I'm inclined to agree with this StackOverflow comment. [0] It's an interesting edge-case though.

[0] https://stackoverflow.com/questions/29918324/is-typescript-r...


It's syntactically and functionally correct, so despite the error messages I think 'valid' is a better label.

> You could build a C++ compiler with a flag to warn, rather than error, on encountering implicit conversions that are forbidden by the C++ standard. The language the compiler is accepting would then no longer be standard C++, but a superset. (Same for all compiler-specific extensions of course.)

The way I see it, these errors are already on par with C++ warnings. C++ won't stop you if you make a pointer null or use the wrong string as a map key.


congrats to all involved for sticking to their guns here. specs exist for a reason :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: