
Bugs in the YAML specification - networked
http://pyyaml.org/wiki/BugsInTheYAMLSpecification
======
eyelidlessness
I have minimal experience with YAML (100% of which configuring for CI
environments because apparently that's what all the modern ones prefer). I
don't really know the ins and outs of the format, and try to stick to
extremely simple representations.

Today I learned that what I perceived as somewhat complicated syntax is
actually overwhelmingly complex.

I know that my preference for s-expressions is not shared by everyone, but the
more complex a syntax I encounter, the more I wonder if simpler alternatives
were even considered.

Genuine question: apart from inertia, and apart from recursive references
(noted in comments before I wrote this), is there a use case for YAML that
isn't solved by simpler and less ambiguous formats like EDN or Transit?

~~~
fibo
YAML was the first human readable/writable format, so better t'han XML and
JSON for us humans. Also is a superset of JSON.

~~~
Someone
Human writable? I have little experience with it, but if what Amazon uses for
its CodeDeploy configuration files is representative for the format, I
disagree.

IMO, a human writable format shouldn't need a help page like this:
[http://docs.aws.amazon.com/codedeploy/latest/userguide/app-s...](http://docs.aws.amazon.com/codedeploy/latest/userguide/app-
spec-ref-spacing.html)

Yes, part of that is CodeDeploy, but saying "tab characters must not be used
in indentation" for me, disqualifies a standard as "human writable".

CodeDeploy makes that horrendous by requiring five or so clicks to figure out
what causes an error (try forgetting adding an app spec.yml file or saving a
file with a BOM)

~~~
mst
It's an indentation-is-semantic format.

I've had more trouble with things trying to allow both spaces and tabs than
I've had with YAML's refusal to allow tabs.

------
crdoconnor
Despite being a nice readable language if you use it right, the spec is rife
with disgusting features and nasty surprises.

This was my attempt to cut out 90% of the crap, leaving the nice, readable
core (which maps on to JSON pretty neatly):

[https://github.com/crdoconnor/strictyaml](https://github.com/crdoconnor/strictyaml)

~~~
akkartik
I love the example at
[https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst...](https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst#what-
is-wrong-with-implicit-typing). Convinced in 2 seconds that full YAML is
idiotic.

~~~
guitarbill
this whole idea of strictyaml just makes so much sense to me. plus having
comments is a huge bonus. why am i only hearing about this now?

~~~
scrollaway
You're hearing about it now because 99+% of people who use YAML have no idea
about how horrible the spec is. They just assume that it's like what
strictyaml strives to be and don't understand why "such an elegant, simple and
readable language isn't used more".

In other words, people don't understand the need for it. Those that do use
other formats - json or toml, namely.

------
ThePhysicist
What I find more interesting is that YAML allows you to define self-
referential / circular data structures:

    
    
        foo: &foo
          bar: *foo
    

In PyYAML, this will give you a self-referntial dictionary. Powerful but
pretty catastrophic if you use (naive) recursion to analyze a user-submitted
data structure.

~~~
twelvechairs
Feature not a bug. YAML is what people turn to who need these kind of advanced
feautures which arent provided in json or similar.

~~~
OskarS
I would imagine this feature is primarily useful if you want to serialize a
whole bunch of objects that all reference each other, but it makes me feel a
bit icky. It feels like it breaks an intuitive assumption you have about
hierarchical formats, which is that there should be no cycles.

Maybe that's just my personal bias, but I feel like the relative simplicity of
JSON is a strong feature in its favor. As a developer, I have very clear
understanding of the data I'm reading, and with that I can more easily make
safer and more stable code.

~~~
junke
Yet, a lot of hierarchical structures I see are full of "ref" and "id"
attributes, which effectively encode cyclic graphs.

~~~
OskarS
That's certainly true, but at least that's not going to cause an infinite loop
while wandering down the structure, it's just gonna end at a leaf "ref" or
"id".

~~~
junke
In my view this is more a question of simple interface vs. simple
implementation. Having cross-references at the language level gives a uniform
and simplified interface at the cost of more having to develop careful tools.

Not saying refs/id are bad, but at least with a uniform syntax and language
support, you don't have to reimplement custom cross-reference resolvers.

------
jstimpfle
This is an opportunity to plug my private project "WSL" [1] which is a clean
text serialization format for relational databases. The scope is somewhat
different and it's not really released (but beginning to stabilize), but I'll
be happy to hear what you think.

[1]
[http://jstimpfle.de/projects/wsl/main.html](http://jstimpfle.de/projects/wsl/main.html)

~~~
kiwidrew
Interesting -- I like the concept.

But it's completely unusable for the Real World (tm) because strings cannot
contain the '[' and ']' characters AND YET there is no mechanism for escape
sequences. What if my data legitimately contains those two characters???

~~~
jstimpfle
Thanks for reading and the feedback!

That's just the default string type. I started out with escaping but noticed
it's a lot of complexity that is rarely needed (not for my own use case, which
is accounting, inventory, and some web apps which don't need it).

The advantage of not having escaping is easier seds and greps which don't miss
the field boundaries.

The important concept though is that arbitrary datatypes elements can be added
by the user of an API implementation (the python library already offers that).
The datatypes define their lexical syntax, like in perl6. I will also declare
more "default" datatypes and might include a C-like string after enough
consideration.

~~~
skybrian
This can be useful but to avoid confusion, I wouldn't call it a string type at
all. Maybe it's an identifier or a symbol or a label or something like that?
If you're excluding square brackets, there are probably other special
characters you want to exclude too?

~~~
jstimpfle
Thanks for the feedback, and I'm glad to see other people worry about these
details, too!

ASCII control characters are forbidden in the entire WSL file. Then apart from
[] everything is allowed.

For practical applications, by far the most important requirement of strings
is being able to include space characters to make a short sequence of words,
like [Trinidad and Tobago].

I don't know a better word for "sequence of words" than "String". Maybe
"Words", but technically it really is a string (containing an arbitrary
sequence of the allowed characters). Even approaching the enforcement of more
structure would be a lot of work with little returns. And you can't include a
literal newline in a C string literal, and you can't even have a NUL character
in the _interpretation_ (memory layout) of it, right?

I actually started out with a C-like string as default type (so named it
"String") but noticed

\- A big problem with string culture is that "" strings use identical start
and end markers. A problem which Joe Armstrong mentioned as well.

\- Escaping means significantly higher complexity of parsing out the
interpretation from the literal, while it's not really needed for most
applications.

Both these problems make data unnecessarily hard to process with dirty one-off
scripts. So after considering some other options I'm now with [this style]
because [] are not too often needed or when needed can often be substituted
with (), and are very pleasing on the eye in most fonts.

In conclusion I guess it will stay "String", and other less frequently needed
types will be called "CString", "Base64", and "BinaryString". Or optional
parameterization will be created for "String" to declare the escaping style
without needing a separate metatype.

~~~
skybrian
If you're designing a language it's probably okay to have string literals that
can't contain certain characters, so long as there is some other way to do it.
It's a bit different for a serialization format.

The question is what you do when you're converting some data from some other
format (for example, dumping a database) and there are strings that actually
contain these special characters. Even if it's a bit ugly, it's good to have
_some_ way to represent the data so that it can be read back in again without
any loss. In this kind of tool, you can't just say "don't do that" because the
data has already been saved - you're just converting it. (So my idea of not
calling it a string type probably doesn't make sense, on second thought, if
you want to be able to interoperate.)

Square brackets are an interesting choice. If you just want to do simple
lossless escaping, it doesn't seem that hard:

\\\ means a literal backslash \\] means a literal ']'

Anything else gets written as-is. (But, what if the database actually does
contain control characters in some of its strings?)

------
burke
Piling on to the general theme of the rest of the comments here:

I really wish there was a more popular middle-ground between YAML and JSON.
People use YAML because it's the next step up from JSON if you need comments,
etc., but I think most purposes would be better served by the likes of JSON5,
HJSON, or TOML (for example) if only any of those were as popular as JSON and
YAML.

I implemented a YAML parser from spec last month. It (YAML) goes to great
lengths to provide human-friendly features, trading off computer-friendliness
to an fairly extreme extent.

Eliminating 'plain' scalars (unquoted strings-as-values), folded multiline
literals, tags, anchors/aliases, and possibly directives, as a sort of reduced
yaml would make the language a lot less silly for the kinds of things a lot of
people end up using it for.

~~~
crdoconnor
JSON5, HJSON and TOML are not as readable or editable as YAML.

They may have a slight advantage over the default spec of YAML but they don't
stack up well against YAML with the features you mentioned removed.

I really don't think an entirely new standard is warranted.

~~~
burke
Right. My hypothesis here is that, if there were, instead of JSON and YAML to
choose from, three standards, one of which were something slightly less human-
friendly but much simpler than YAML, I think it would be widely adopted.

I think YAML is more widely-used than it would otherwise be, in a broader
variety of domains, simply because the only popular alternative doesn't have
comments.

I'm not drawing from this any sort of conclusion that we should try to push
one of these alternatives as viable, just lamenting the state of things.

------
jgalt212
If you stick the language agnostic core, yaml is great.

We use it for:

1\. test case specifications we use for unit tests suits that must run across
multiple languages.

2\. DSL for business processes

------
lwis
I've always felt that YAML was incomplete.

~~~
TillE
If you want to use it for complex things, sure, there's no end of potential
features. If you need some logic with your data, I like Lua.

But I think most uses of YAML are basically just a much more flexible JSON.

~~~
andybak
I use YAML as human-editable JSON - especially for non-programmers (although
it's surely more pleasant to write even for programmers).

I would like a clearly documented YAML subset that left out some of the more
complex stuff and avoided a few 'more than one way to do it' features. That
would go a long way to removing some of the criticism of YAML with very little
cost in functionality.

~~~
mjevans
I completely DISAGREE with this.

My experience with YAML is that it's very temperamental with respect to
/whitespace/ and that your editor might try to get too smart and damage the
document.

JSON, if you see a pattern and follow the pattern, is likely to work.

JSON CAN be stored in a 'pretty' way, with extra whitespace, which makes it
even more obvious how to format a document; that's frequently how I write out
small bootstrap config files (IE the database connection string to get the
main config from).

~~~
andybak
You see, as much as you have strong feelings about whitespace (to the extent
of reaching for your caps-lock key), so do I. I personally feel it's a great
loss to code style and readability that Python become an outlier in terms of
arguing for significant whitespace.

But this isn't the time for that tired old debate. I won't convince you any
more than I'll convince someone about the One True Brace Style or Vim vs EMACS
(actually - hold on. I can't stand either of them).

~~~
mjevans
Python's mistake is not including a default-coding-style formatting system
with the language (or if it does, making it so obscure I don't know about it).

The formatter should ALSO, when cleaning up the code, see if it compiles with
four spaces equal to a tab, and then if not, if eight spaces is a tab.

I'd prefer that levels of intent be a tab (not space; if you're going to make
presentation of indent important make it something the client CAN modify
without changing the meaning of the code).

