Hacker News new | past | comments | ask | show | jobs | submit login
The Norway Problem (hitchdev.com)
656 points by dedalus 9 days ago | hide | past | favorite | 325 comments





This is part of more general problem, they had to rename a gene to stop excel auto-completing it into a date.

https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Edit: Apparently Excel has its own Norway Problem ... https://answers.microsoft.com/en-us/msoffice/forum/msoffice_...


> This is part of more general problem

The more general problem basically being sentinel values (which these sorts of inferences can be treated as) in stringly-typed contexts: if everything is a string and you match some of those for special consideration, you will eventually match them in a context where that's wholly incorrect, and break something.


edit: fixed formatting problem

> sentinel values

Using in-band signaling always involves the risk of misinterpreting types.

> This is part of more general problem

DWIM ("Do What I Mean") was a terrible way to handle typos and spelling errors when Warren Teitelman tried it at Xerox PARC[1] over 50 years ago. From[2]:

>> In one notorious incident, Warren added a DWIM feature to the command interpreter used at Xerox PARC. One day another hacker there typed

    delete *$
>> to free up some disk space. (The editor there named backup files by appending $ to the original file name, so he was trying to delete any backup files left over from old editing sessions.) It happened that there weren't any editor backup files, so DWIM helpfully reported

    *$ not found, assuming you meant 'delete *'
>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete *$' twice.

Trying to "automagically" interpret or fix input is always a terrible idea because you cannot discover the actual intent of an author from the text they wrote. In literary criticism they call this problem "Death of the Author"[3].

[1] https://en.wikipedia.org/wiki/DWIM

[2] http://www.catb.org/jargon/html/D/DWIM.html

[3] https://tvtropes.org/pmwiki/pmwiki.php/Main/DeathOfTheAuthor


>> [...] The disgruntled victim later said he had been sorely tempted to go to Warren's office, tie Warren down in his chair in front of his workstation, and then type 'delete $' twice.

Ironically, this did not render the way you intended because HN interpreted the asterisk as an emphasis marker in this line.

It works here:

    ... type 'delete *$' twice.
because the line is indented and so renders as code, but not here:

> ... type 'delete $' twice.

because the subsequent line has emphasized text*. So the scoping of the asterisks is all screwed up.


Eh. "Death of the Author" is a reaction to the text not being dispositive as to what the author meant. It's deciding you don't care what the author meant, no longer considering it a problem that the text doesn't reveal that. Instead the text means whatever you can argue it means.

Which can be a fun game, but is ultimately pointless.


It gets more complicated when the author himself changes their mind about that.

That’s a shrewd observation. Static types help with this somewhat. E.g. in Inflex, if I import some CSV and the string “00.10” as 0.1, then later when you try to do work on it like

x == “00.10”

You’ll get a type error that x is a decimal and the string literal is a string. So then you know you have to reimport it in the right way. So the type system told you that an assumption was violated.

This won’t always happen, though. E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

The best approach is to ask the user at import time “here is my guess, feel free to correct me”. Excel/Inflex have this opportunity, but YAML doesn’t.

That is, aside from explicit schemas. Mostly, we don’t have a schema.


If we're talking about general problems, then I don't think we can be satisfied with "sometimes it's a problem with types and sometimes it's a UI bug." That's not general.

> E.g. sort by this field will happily do a decimal sort instead of the string 00.10.

So that system is not consistent with type checking? How is this not considered a bug?


I mean if the value is imported as a decimal, then a sort by that field will sort as decimal. This might not be obvious if a system imports 23.53, 53.98 etc - a user would think it looks good. It only becomes clear that it was an error to import as a decimal when we consider cases like “00.10”. E.g, package versions: 10.10 is a newer version than 10.1.

Types only help if you pick the right ones.


Sure. In most static type systems though, you would be importing the data into structures that you defined, with defined types. So you wouldn’t suddenly get a Decimal in place of a String just because the data was different. You’d get a type error on import.

And of course the plague that is CSV when your decimal delimiter is ,

Basically, autoimmune disease, but for software.

I suppose this is a cliched thought, but the more general problem kind of emblematic of current "smart" features... and their expected successors.

OOH, this is a a typically human problem. We have a system. It's partly designed, partly evolved^. It's true enough to serve well in the contexts we use it in on most days. There are bugs in places (like norway, lol) that we didn't think of initially, and haven't encountered often enough to evolve around.

In code, we call it bugs. In bureaucracy, we just call it bureaucracy. Agency A needs institution B's document X, in a way that has bugs.

Obviously, it's also a typical machine problem. @hitchdev wants to tell pyyaml that Norway exists, and pyyaml doesn't understand. A user wants to enter "MARCH1" as text (or the name of a gene), and excel doesn't understand.

Even the most rigid bureaucracy is made of people and has fairly advanced comprehension ability though. If Agency A, institution B or document X are so rigid that "NO" or "MARCH1" break them... it probably means that there's a machine bug behind the human one.

Meanwhile... a human reading this blog (even if they don't program) can understand just fine from context and assumptions of intent.

IDK... maybe I'm losing my edge, but natural language programming is starting to seem like a possibility to me.

^I feel like we need a new word for these: versioned, maybe?


"The computer won't let me" is a particularly maddening "excuse" from bureaucrats...

I don't understand why those support agents for Microsoft just threw their hands up in the air and asked customers to go through some special process for reporting the bug in Excel. Why are they not empowered/able to report the issue on behalf of customers? It's so clearly a bug in Excel that even they are able to reproduce with 100% reliability.

It looks like it is intended behavior in Excel.

Yes. Excel cells are set to a "General" format that, by default, tries to guess the type of data the cell should be from its content. A date looking entry gets converted to a date type. A number looking string to a number (so 5.80 --> 5.8, very annoying since I believe in significant digits) When you import cvs data, for example, the default import format is "General" so date looking strings will be changed to a date format. This can be avoided by importing the file and choosing to import the data as "Text". People having these data corruption problems forgot to do that.

It's "user error" except that there is no way to set the default import to import as "Text" (as far as I know), so one has to remember to do the three step "Text" import every time instead of the default one step "General" import.


Excel doesn't support CSV files. Anyone who believes that has never used Excel. [0] You're supposed to use spreadsheets as is. Programs that have excel export features should always directly export xlsx files.

[0] The only thing you can safely do with CSV files is to interpret every value as text cell. CSV files always require out of band negotiation on everything, including delimiters, quotation, escape characters, the data type of each column.


However....

Users BELIEVE Excel supports CSV file. That's the reality on the ground. Fighting against that is a losing battle.


I'd say the more general problem is a bad type system! In any language with a half decent type system where you can define `type country = Argentina | ... | Zambia` this would be correctly handled at compile-time, instead of having strange dynamic weak typing rules (?) which throw runtime errors in production (???).

I would like to see how your solution handles the case of new countries or countries changing name. Recompile and push an update? If the environment is governmental this can take a very very very long time.

The proper solution, in my opinion, is a lookup table stored in the database. It can be updated, it can be cached, it can be extended.

And for transfer of data, use formats to which you can attach a schema. This way type data is not lost on export. XML did this but everyone hates XML. And everyone hates XSD (the schema format) even more. However, if you use the proper tools with it, it is just wonderful.


An even more general problem is that we as humans use pattern-matching as a cerebral tool to navigate our environment, and sometimes the patterns aren't what they appear to be. The Norway problem is the programming equivalent of an optical illusion.

Good language design involves deliberately adding redundancy which acts like a parity bit in that errors are more likely to be detected.

That's an interesting statement to apply to natural languages.

Consider this headline in English: "Man attacks boy with knife". This can be read two ways, either the man is using a knife to attack the boy, or the boy had the knife and thus was being attacked.

The same sentence in Polish would make use of either genitive or instrumental case to disambiguate (although barely). However, a naive translation would only differ in the placement of a `z` (with) and so errors could still slip through. At least in this case the error would not introduce ambiguity, simply incorrectness.

Similar to language design we can also consider: does the inclusion/requirement of parity features reduce the expressivity of the language?


does the inclusion/requirement of parity features reduce the expressivity of the language?

This was a real eye-opener for me when learning Latin in school: stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.


> stylistic expressions such as meter, juxtaposition, symmetry are so much easier to include when the meaning of a sentence doesn't depend on word order.

Eh.... some things are easy and some things are hard in any language. The specifics differ, and so do the details of what kinds of things you're looking for in poetry. Traditional Germanic verse focuses on alliteration. Modern English verse focuses on rhyme. Latin verse focuses on neither. [1]

English divides poetically strong syllables from poetically weak syllables according to stress. It also has mechanisms for promoting weak syllables to strong ones if they're surrounded by other weak syllables.

In contrast, Latin divides strong syllables from weak syllables by length. Stress is irrelevant. But while stress can be changed easily, you're much more restricted when it comes to syllable length -- and so Publius Ovidius Naso is invariably referred to by cognomen in verse, because it isn't possible to fit his nomen, Ovidius, into a Latin metrical scheme. That's not a problem English has.

[1] I am aware of one exceptional Latin verse:

> O Tite, tute, Tati, tibi tanta, tyranne, tulisti.


The real problem here is that people use Excel to maintain data. Excel is terrible at that. But the fact that it may change data without the user being aware of it, is absolutely the biggest failing here.

The problem is more that it's insanly overpowered, while aiming for convenience out of the box. An "Excel Pro"-Version which takes away all the convenience and gives the user the power to configure the power pinpointet to their task might be a better solution. Funny part is, most of those things are already configurable now, but users are not educated enough about their tools to actually do it.

Excel allows people to maintain data all over the place. From golf league data to job actual data compared to estimates to so much more. And, excel is accessible enough that tens of millions (or maybe more) of people do it.

The one I’ve seen was a client who wanted to store credit card numbers in an Excel sheet (yes I know this is a bad idea, but it was 15 years ago and they were a scummy debt collection call center). Signed integers have a size limit, which a 16 digit credit card number significantly exceeds.

Now, you and I know this problem is solved by prepending ‘ to the number and it will be treated as a string, but your average Excel user has no understanding of types or why they might matter. Many engineers will also look past this when generating Excel reports.


and cusips, which are strings, get converted to scientific notation.

https://social.msdn.microsoft.com/Forums/vstudio/en-US/92e0a...


Easiest solution is just to rename Norway.

"Renaming it to Xorway resulted in untold damages from computer bugs..." - Narrator

Norway Orway Xorway Nandway Andway

Yes, yes, I see... This could be problematic, indeed. If only there were a logical solution.


Regarding Excel: It also happens with Somalia, which makes this issue even stranger. Apparently because of "SOM".

There’s a really simple solution to this problem, which has been around since the 70’s: schemas.

So basically they renamed a gene because they had employees who were too stupid to use excel?

> they had to rename a gene to stop excel auto-completing

I can just about understand that "No" might cause a problem, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.


>, but “Membrane Associated Ring-CH-Type Finger 1" being converted to MAR-1 defeats me.

No, that's not what's happening. To clarify...

If you type a 41 characters long string of "Membrane Associated Ring-CH-Type Finger 1" into a cell -- Excel will not convert that to a date of MAR-1.

On the other hand, it's if you type an 6-char abbreviation of "MARCH1" that looks like a realistic date -- Excel converts it to MAR-1.


> they had to rename a gene to stop excel auto-completing it into a date.

No one in their right mind uses a spreadsheet for data analysis. Good for working out your ideas but not in a production environment. I figure excel was chosen as this the utility the scientists were most familiar with.

The proper tool for the job would be a database. I recall reading about a utility, a highly customized database with an interface that looks just like a spreadsheet.


The analysis itself isn’t (usually) happening in Excel.

A lot of tools operate on CSV files. People use Excel to peek at the results or prepare input for other tools, and that’s how the date coercion slips in.

Sometimes, people do use it to collate the results of small manual experiments, where a database might be overkill. Even so, the data is usually analyzed elsewhere (R, graphPad, etc).


>A lot of tools operate on CSV files.

The mistake was to believe that Excel can operate on CSV files. It doesn't support them in any meaningful way. It supports them in a "I can sort of pretend that I support CSV files" way.


What is a good alternative to working with CSV files than Excel? Excel sure isn't ideal but it's always there as part of the MS Office suite, so I've never looked for anything esle.

And yet, we are still being taught to use an Excel (2003) spreadsheet for data analysis... (Because that's what most businesses are still using !)

The world desperately needs a replacement for YAML.

TOML is fine for configuration, but not an adequate solution for representing arbitrary data.

JSON is a fine data exchange format, but is not particularly human-friendly, and is especially poor for editable content: Lacks comments, multi-line strings, is far too strict about unimportant syntax, etc.

Jsonnet (a derivative of Google's internal configuration language) is very good, but has failed to reach widespread adoption.

Cue is a newer Jsonnet-inspired language that ticks a lot of boxes for me (strict, schema support, human-readable, compact), but has not seen wide adoption.

Protobuf has a JSON-like text format that's friendlier, but I don't think it's widely adopted, and as I recall, it inherits a lot of Protobufisms.

Dhall is interesting, but a bit too complex to replace YAML.

Starlark is a neat language, but has the same problem as Dhall. It's essentially a stripped-down Python.

Amazon Ion [1] is neat, but I've not seen any adoption outside of AWS.

NestedText [2] looks promising, but it's just a Python library.

StrictYAML [3] is a nice attempt at cleaning up YAML. But we need a new language with wide adoption across many popular languages, and this is Python only.

Any others?

[1] https://amzn.github.io/ion-docs/

[2] https://nestedtext.org/

[3] https://github.com/crdoconnor/strictyaml/


Seems you're missing my personal favorite, extensible data notation - EDN (https://github.com/edn-format/edn). Probably I'm a bit biased coming from Clojure as it's widely used there but haven't really found a format that comes close to EDN when it comes to succinctness and features.

Some of the neat features: Custom literals / tagged elements that can have their support added for them on runtime/compile time (dates can be represented, parsed and turned into proper dates in your language). Also being able to namespace data inside of it makes things a bit easier to manage without having to result to nesting or other hacks. Very human friendly, plus machine friendly.

Biggest drawback so far seems to be performance of parsing, although I'm not sure if that's actually about the format itself, or about the small adoption of the format and therefore not many parsers focusing on speed has been written.


Your list is like a graveyard of my dreams and hopes. Anything that doesn't validate the format of the underlying data is pretty much dead to me...

The problem with most of these is they're useless to describe the data. Honestly, it is completely not useful to have the following to describe data:

email => string

name => string

dob => string

IMHO, it is akin to having a dictionary (like Oxford English) read like:

email - noun

name - noun

birthday - noun

It says next to nothing except, yes, they are nouns. All too often I waste time fighting nils and bullshit in fields or duplicating validation logic all over the place.

"Oh wow, this field... is a string..? That's great... smiles gently except... THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID, SCHEMA-CHUD. GET THE FUCK OFF MY LAWN!"


It sounds to me like XML with a DTD & XSD would solve your problem. XML no longer fashionable, but its validation is Turing-complete

If you want automatic built-in string validation, one option that seems particularly interesting is to use a variant of Lua patterns, which are weaker and easier to understand than regular expressions, but still provide a significant degree of "sanity" for something like an email. The original version works on bytes and not runes, but you could simply write a parser that works on runes instead, and the pattern-matching code is just 400 old and battle-tested lines of C89. You might want to add one extension: allow for escape sequences to be treated as a single character (hence included in repetition operators and adding the capability to match quoted strings); with this extension, I think you could implement full email address validation:

https://i.stack.imgur.com/YI6KR.png

Lua patterns have also shown up in other places, such as BSD's httpd, and an implementation for Rust:

https://www.gsp.com/cgi-bin/man.cgi?section=7&topic=PATTERNS

https://github.com/stevedonovan/lua-patterns

http://lua-users.org/wiki/PatternsTutorial


Amazon Ion [1] supports schema [2] and it all looks quite nice to me. Maybe it deserves wider adoption.

[1] https://amzn.github.io/ion-docs/ [2] https://amzn.github.io/ion-schema/


My experience is that validation quickly becomes surprisingly complex, to the point of being infeasible to express in a message format.

Not only are the constraints very hard to express (remember that one 2000 char regexp that really validates email addresses?), they are also contextual: the correct validation in an Android client is not the same as on the server side. Eg you might want to check uniqueness or foreign key constraints that you cannot check on the client. Sometimes you want to store and transmit invalid messages (eg partially completed user input). And then you have evolving validation requirements: what do you do with the messages from three years ago that don't have field X yet?

Unfortunately I don't think you can express what you need in a declarative format. Even minimal features such as regexp validation or enums have pitfalls.

I think it's better to bite the bullet and implement the contextually required validation on each system boundary, for any message crossing boundaries.


I agree with this, something RON/JSON-like with type annotations would be great:

    {
      "isTrue":false:Boolean,
      "id":"123e4567-e89b-12d3-a456-426614174000":UUID
    }

Sounds like your issue is that UUID is NOT a string, but a 128-bit integer ?

>THERE SHOULD NOT BE EMOJI IN MY FUCKING UUID

thanks for the lolz


Still early, but here's my baby I hope can improve things:

website with grammar spec: https://tree-annotation.org/

prototype of a JSON/YAML alternative for JS: https://github.com/tree-annotation/tao-data-js

same thing, even less finished for C#: https://github.com/tree-annotation/tao-data-csharp

working on it constantly, more to come soon


XML and XML Schema solved this more than 20 years ago. It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data.

XML with RelaxNG (https://relaxng.org/) would have made life so much better than using XML Schema, but, as they say, that ship has long since sailed.

All except the easily written by humans part. Which is kind of a key part.

If all the smart people like you used XML, how come it was so painful to use and it died?

Because it offered all these things parent responded, but that made it too complex. You either provide schema and get commodities of describing it or you don't.

I had a chance of using SOAP at one point. It was a F5 device and I used a python library. What I really liked is that when it connected to it it downloaded its schema, and then used that to generate an object. At that point you just communicated with device like you did with any object in Python.

We abandoned it for inferior technologies like REST and JSON, because they were harder to use from JS, as parent mentioned.


Parent didn't say it was harder to use from JS. Parent said "It had to be replaced with JSON by the web developers though, so they could just “eval() it” to get their data."

First of all, I was there 20 years ago. I had to deal with XML, XSLT, one kind of Java XML parsers that didn't fully do what I needed, another kind of Java XML parsers that didn't fully do what I needed. And oh boy was it a pain. I just wanted to get a few properties of a bunch of entities in a bigger XML document, that's all. Big fail.

Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.

Third, JS actually had the best dev UX for XML of all languages 20 years ago. Maybe you know JavaScript from Node.js, but 20 years ago it used to run excusively in web browsers, which even then were pretty good at parsing XML documents. The browser of course had a JS DOM traversal API known to every single JS developer, and very soon (Although TBH I can't remember if before or after JSON) it also had xpath querying functions, all built in.

XML was so bad, that its replacement came from the language where it was actually easiest to use. think about that for a second.

So the answer to the question "Why was XML replaced?" is not "Because webdevs lol".

I suspect it was because it has both content and attributes, which all but guarantees it's impossible to create a bunch of simple, common data structures from it (like JSON does).


> Second, JSON always had a parser in JS, so I don't know where that eval nonsense is coming from.

Firstly, it sounds like XML ran over your dog or something. Sorry to hear about that. It wasn’t particularly hard to use at all, and if you’re dealing with the possibility of emojis in your JSON UUIDs in 2021, one might even say it’s easier to use.

If you’re referring to JSON.parse() in “had a parser” above, then you have a temporal problem. Regarding eval(), it’s suggested right in the original RFC for JSON. Check it out. Web developers at the time were following that advice.


Another issue is that due to their age, a lot of XML tools ignore the existence of Unicode (or UTF-8).

> The world desperately needs a replacement for YAML.

The world desperately needs support for YAML 1.2, which solves the problems the article addresses fairly completely (largely in the “default” Core schema[0], but more completely with the support for schemas in general), plus a bunch of others, and has for more than a decade. But YAML 1.2 libraries aren’t available for most languages.

[0] not actually an official default, but reflects a cleanup of the YAML 1.1 behavior without optional types, so its defaultish. Back when it looked like YAML 1.3 might happen in some reasonably-near future, it was actually indicated by team members that the JSON Schema for YAML (not to be confused with the JSON Schema spec) would be the explicit default YAML Schema in 1.3, which has a lot to recommend it.


Nope nope nope. YAML is awful and needs to die. The more you look at it the worse it gets. The basic functionality is elegant (at least until you consider stuff like The Norway Problem), but the advanced parts of YAML are batshit insane.

“The Norway Problem" is a YAML 1.1 problem, of which there are many.

What advanced parts of YAML are you talking about that remain problems in YAML 1.2?


From the article:

> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.


The article is simply, factually wrong; there is no “YAML 2.0 specification” [0], and everything they point to is YAML 1.1, and addressed in YAML 1.2 (the most recent YAML spec, from 2009.)

[0] https://yaml.org/


You seem pretty quick to disregard TOML. I switched all my JSON and YAML for TOML. Do you care to detail what is missing?

TOML quickly breaks down with lots of nested arrays of objects. For example:

    a:
      b:
      - c: 1
      - d:
        - e: 2
        - f:
            g: 3
Turns into this, which is unreadable:

    [[a.b]]
    c = 1

    [[a.b]]
    [[a.b.d]]
    e = 2

    [[a.b.d]]
    [a.b.d.f]
    g = 3

TOML also has a few restrictions, such as not supporting mixed-type arrays like [1, "hello", true], or arrays at the root of the data. JSON can represent any TOML value (as far as I know), but TOML cannot represent any JSON value.

At my company we use YAML a lot for table-driven tests (e.g. [1]), and this not only means lots of nested arrays, but also having to represent pure data (i.e. the expected output of a test), which requires a format that supports encoding arbitrary "pure" data structures of arrays, numbers, strings, booleans, and objects.

[1] https://github.com/sanity-io/groq-test-suite/


Looks fine to me:

    [[a.b]]
    c = 1
    d = [
       { e = 2 },
       { f = { g = 3 } }
    ]

An improvement, but the original YAML is still significantly better, in my opinion.

Also many (most? all?) serializers don't let you control which fields are serialized inline vs not. So if you have a program that generates configuration, you're going to end up with the original unreadable form anyway.

S-expressions are super easy to parse and are fairly easy for humans to read. See e.g. using s-expressions in OCaml: https://dev.realworldocaml.org/data-serialization.html

Apropos of this, in Clojure-land the idiomatic serialization is, EDN [1], which is pretty ergonomic to work with IMO, since in most cases it is the same as a data-literal in Clojure.

My feeling is that :keywords reduce the need and temptation to conflate strings and boolean/enumerations that occurs when there's no clear way to convey or distinguish between a string of data and a unique named 'symbol'. I miss them when I'm in Pythonland.

[1] https: https://www.compoundtheory.com/clojure-edn-walkthrough/


S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).

You get neat ways of nesting data, but that is not enough for a robust and mistake-resilient configuration language.

The problem isn't parsing in itself. The problem is having clear sematics, without devolving into full SGML DTDs (or worse still, XML schemas).


> S-expressions inherits all trouble with data types from json (dates, times, booleans, integer size, number vs numeric string).

Hm, not sure that's true, S-expressions would only define the "shape" of how you're defining something, not the semantics of how you're defining something. EDN https://github.com/edn-format/edn for all purposes is S-expressions and have support for custom literals and more, to avoid "the trouble with data types from JSON"


Yes, EDN is S-expressions plus a bunch of semantic rules. Parsing EDN is quite a bit more complex than just parsing S-expressions, just because you need to support a bunch of built in types, as well as arbitrary exensions through 'tags'.

The tag system is quite brilliant though.


Jsonnet hasn't taken off because it's turing complete. It's a really great language for generating JSON but not a replacement for JSON.

I’ve used most of the technologies you listed. Cue is the best, and the only one with strong theoretical foundations. I’ve been using it for some time now and won’t go back to the others.

> The world desperately needs a replacement for YAML.

For situations like TFA you really want a configuration language that behaves exactly like you think it will, and since you don't have to interop with other organizations you don't really need a global standard.

Moreover, broadly used config languages can be somewhat counterproductive to that goal. Take JSON as an example; idiomatic JSON serdes in multiple programming languages has discrepancies in minint, maxfloat, datetime, timezone, round-tripping, max depth, and all kinds of other nuanced issues. Existing tooling is nice when it does what you expect, but for a no-frills, no-surprises configuration language I would almost always just prefer to use the programming language itself or otherwise write a parser if that doesn't suffice (e.g., in multilingual projects).

Mildly off-topic: The problem here, more or less, was that the configuration change didn't have the desired effect on an in-memory representation of that configuration. We can mitigate that at the language level, but as a sanity check it's also a good idea to just diff the in-memory objects and make sure the change looks kind of like what you'd expect.


You don't need wide adoption for internal projects in an organization, but you do want great toolchain support.

For example, the fact that NestedText is a Python library means a Python team could use it, but it's a poor fit for an organization whose other teams use Go and JavaScript/TypeScript.

We use YAML for much more than configuration, by the way. I feel like YAML hits a nice sweet spot where it's usable for almost everything.


> and since you don't have to interop with other organizations

Until you have to, and all hell breaks loose ?

Now, the example of codepages maybe isn't really appropriate to companies, but is still a good enough metaphor ?


I don't think YAML is going anywhere, largely because it was the first format to prioritize readability and conciseness, and has used that advantage to achieve critical mass.

It's far more productive to push for incremental changes to the YAML spec (or even a fork of it) to make it more sane and better defined. Things like a StrictYAML subset mode for parsers in other popular languages.


> It's far more productive to push for incremental changes to the YAML spec

The problems this article raises and strictyaml purports to address were addressed in YAML 1.2, already supported in python via ruamel.yaml; YAML 1.2 addresses much of this in the Core schema which is the closest successor to the default behavior of earlier spec versions, and does so more completely in the support for schemas more generally, which define both the supported “built-in" tags (roughly, types) and how they are matched from the low-level representation which consists only of strings, sequences, and maps (which, incidentally, are the only three tags of the “Failsafe” schema; there’s also a “JSON” Schema between Failsafe and Core, which has tags corresponding to the types supported by JSON.


JSON5 is the best option currently. A fair number of tools in the JS ecosystem support it.

JSON5 is better than JSON on my points, but it has downsides compared to YAML. For example, YAML is very good at multiline strings that don't require any sort of quoting, and knows to remove preceding indentation:

  foo: |
    "This is a string that goes across
    multiple lines," he wrote.
   
In JSON5, you'd have to write:

  {
    foo: \"This is a string that goes across \
  multiple lines,\" he wrote."
  }
This sort of ergonomic approach is why YAML is so well-liked, I think. (Granted, YAML's use of obscure Perl-like sigils to indicate whitespace mode is annoying, but it does cover a lot of situations.)

YAML is also great at arrays, mimicking how you'd write a list in plaintext:

  foo:
  - "hello"
  - 42
  - true

Also RON: https://github.com/ron-rs/ron

A bit like JSON5, but I believe even more advanced.


You might look at JSON Next variants (if you remember - "classic" JSON is a subset of YAML), see https://github.com/json-next/awesome-json-next

My own little JSON Next entry / format is called JSON 1.1 or JSONX, that is, JSON with eXtensions, see https://json-next.github.io


The list is missing http://www.relaxedjson.org/

Also, there's no explanation what <..-..> and <..+..> do.


I will keep using YAML because I don't want to learn the pitfalls of your alternatives. With YAML everyone is complaining about the pitfalls, and therefore everyone is aware of them. A random replacement may not have this particular problem, but it may have other problems that remain unknown.

Thanks for this list, I’ve never heard of Ion. I’ll consider it for config and even replacing Avro & Protobuf in future projects.

Besides this issue, what's wrong with YAML ?

YAML had a worse example, once.

For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60). So 1:11:00 would parse to the integer 4260, as in 1 hour and 11 minutes equals 4260 seconds.

Now try plugging MAC addresses into that parser.

The most annoying part is that the MAC addresses would only be mis-parsed if there were no hex digits in the string. Like the bug in this post, it could only be reproduced with specific values.

Generally, if you're doing implicit typing, you need to keep the number of cases as low as possible, and preferably error out in case of ambiguity.


> For the ease of entering time units YAML 1.1 parsed any set of two digits, separated by colons, as a number in sexagesimal (base 60).

This is a mind-boggling level of idiocy. Even leaving aside the MAC address problem, this conversion treats "11:15" (= 675) different from "11:15:00" (= 40500), even though those denote the same time, while treating "00:15:00" (15 minutes past midnight) and "15:00" (3 in the afternoon) the same.


You know you've fucked up when you have to remove features from the spec (which they did in YAML 1.2).

On the other hand, you know that you did well, when a direct competitor would look exactly the same minus some undesired features.

> YAML had a worse example, once.

It had it literally at the same time as it had the problem in the article (the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.)

Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2. (As was the 0-prefixed-number-as-octal mentioned in a sibling comment.)


One that really surprised/confused me was that pyaml (and the yaml spec) attempts to interpret any 0-prefixed string into an octal number.

There was a list of AWS Account IDs that parsed just fine until someone added one that started with a 0 and had no numbers greater than 7 in it, after which our parser started spitting out decidedly different values than we were expecting. Fixing it was easy, but figuring out what in the heck was going on took some digging.


We had a Grafana dashboard where one of the columns was a short Git hash. One day, a commit got the hash `89e2520`, which Grafana's frontend helpfully decided to display as "+infinity". Presumably it was parsing 89E+2520.

Ha, that reminds me of some work I was doing just yesterday, implementing a custom dictionary for a postgres full text search index. Postgres has a number of mappings that you can specify, and it picks which one based on a guess of what the data represents. I got bit by a string token in this same format, because it got interpreted as an exponential number.

Sounds like the core issue is that an hexadecimal number was encoded as a string ?

slightly related, on my microwave 99 > 100, even 61 > 100

I try to optimize my microwave button pushing too. I also have a +30 seconds button, so for 1:30 I can hit "1,3,0,Start" or "+30" three times and save a press!

Why does your microwave compare numbers?

It doesn’t compare them, it just counts down.

If I enter 1-3-0-start, I get 90 seconds of cooking. If I enter 9-9-start, I get 99 seconds of cooking, so in that sense, 99 > 130.

If I want about 90 seconds, I’ll use 88 as it’s faster to enter (fewer finger movements).


I've done the same thing for decades! Soul mates?

You might like this one as well.

Load soap into the dishwasher after emptying rather than after loading. If the soap dispenser is closed, the dishes are dirty.


My rule is that loading the dishwasher means that one loads all the available dishes, and runs it, even if it's only x% full. We use the (large) sink as an input buffer.

If the dishwasher has dishes in it and it's not running, they're clean.


This is exactly our algorithm as all. I can't really imagine flipping it the other way, since leaving dirty dishes in a dishwasher will just let them completely dry out, making it more likely they won't get fully clean when the cycle is eventually run.

Rinse until visually clean, then put in dishwasher.

This doubles the time required to do the dishes, defeating much of the purpose of the dishwasher.

Idk, to me it's not about time but effort. Rinsing is just pleasant.

That’s not a zero-copy algorithm. The algorithm with using the soap dispenser being closed as a flag is zero-copy.

I want to have two dishwashers. One with the dirty dishes and one with the clean dishes. So you never have to put the dishes away. They go from the clean dishwasher to the table to the dirty one. And then flip them.

This idea comes up periodically on Reddit. [0] has a few posts from people who have installed them, mostly for bachelors.

[0] https://www.reddit.com/r/self/comments/ayr9c/when_im_rich_im...


There’s a community near here with a high fraction of Orthodox Jews. One condo I toured in my 20s had two dishwashers and without thinking about why they did it, I commented how I thought that was awesome that you’d never need to put dishes away. (They of course installed two dishwashers for orthodox separation of dishes from each other.)

Blasphemy! I do the inverse. You're wrong. /s

insert code flame war here


Vi Hart - "How to Microwave Gracefully"

https://www.youtube.com/watch?v=T9E0zSpULFY


Not the OP, but I have the same problem. For some reason that escapes me, pressing the “10 sec” button 7 times produces 00 70 instead of 01 10. If you then press the “1 min” button you get 01 70

Most microwaves (in the USA) do this, at least in my experience.

They treat the ":" like a sum of two sexagesimal numbers, rather than a sexagesimal digit separator.


How else would you prove it's turing complete and can run Doom?

The worst tragedy of this is the security implications of subtly different parsers. As your application surface increases, you're likely to mix languages (and thus different parsers), which means that the same input data will produce different output data depending on whether your parser replaces, truncates, ignores, or otherwise attempts to automatically "fix up" the data. A carefully crafted document could exploit this to trick your data storage layer into storing truncated data that elevates privileges or sets zero cost, while your access control layer that ignores or replaces the data is perfectly happy to let the bad document pass by.

And here's something else to keep you up at night: Just think of how many unintentional land mines lurk in your serialized data, waiting to blow up spectacularly (or even worse, silently) as soon as you attempt to change implementation technologies!

This is why I've been so anal about consistent decoder behavior in Concise Encoding https://github.com/kstenerud/concise-encoding/blob/master/ce...

https://concise-encoding.org/


This is exactly why configuration/serialization formats should make as few assumptions about value types as possible. Once parsing's done, everything should be a string (or possibly a symbol/atom, if the program ingesting such a file supports those), and it should be up to the application to convert values to the types it expects. This is Tcl's approach, and it's about as sensible as it gets.

...which is why it pains me to admit that in my own project for a Tcl-like scripting/config language[1] I missed the float v. string issue, so it'll currently "cleverly" return different types for 1.2 (float) v. 1.2.3 (atom). Coincidentally, I started work on a "stringy" alternative interpreter that hews closer to Tcl's philosophy (to fix a separate issue - namely, to avoid dynamically generating atoms, and therefore avoid crashing the Erlang VM when given potentially-adversarial input), so I'm gonna fix that case for at least the "stringy" mode (by emitting strings instead of numbers, too), knocking out two birds with one stone for the upcoming 0.3.0 release :)

----

[1]: https://otpcl.github.io, for those curious


It’s reasons like this that I want my configuration languages to be explicit and unambiguous. This is why I use JSON or if I want a human friendly format, TOML. Strings are always “quoted” and numbers are always unquoted 1.2, it can never accidentally parse one as the other. The convenience of omitting quotes is just not worth the potential for ambiguity or edge cases to me.

> Once parsing's done, everything should be a string

Or give a schema to the parser, defining what type is expected in each field.


Yes, that looks like a right way to handle this problem without ignoring YAML spec. Define what to parse upfront.

> The most tragic aspect of this bug, howevere, is that it is intended behavior according to the YAML 2.0 specification.

This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.

Other bad ideas that resurface constantly:

1. implicit declaration of variables

2. don't really need a ; as a statement terminator

3. assert should not abort because one can recover from assert failures


I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)

> Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.

This is also true of Haskell btw.


Another inreresting example is Lua. It's a free form language without semicolons. It's not indentation sensitive.

Lua does have semicolons!

It even has semicolon insertion, but because the language is carefully designed, this doesn't cause problems, and most users can go a lifetime without knowing about it.

Our coding style requires semicolons for uninitialized variables, so you'll see

    local x;
    if flag then
       x = 12 
    else
       x = 24
    end

As a way of marking that the lack of initialization is deliberate. `local x = nil` is used only if x might remain nil.

I don't like saying that it's semicolon insertion because it might give people the idea that the semicolons work similarly to Javascript. In Lua, inserting a semicolon is always optional and it's an stylistic matter (like in your example). It even allows putting multiple statements on the same line without a semicolon.

    -- Two assignment statements
    x = 10 y = 20

> I agree with the general observation, but the need for ";" ? Quite a few languages (over a few generations) have been doing fine without the semicolon. Just to mention two: python and haskell. (Yes, python has the semicolon but you'll only ever use it to put multiple statements on a single line.)

But then it's inconsistent and has unnecessary complexity because now there's one (or more) exceptions to the rules to remember: when the ';' is needed. And of course if you get it wrong you'll only discover it at runtime.

"Consistent applications of a general rule" is preferable to "An easier general rule but with exceptions to the rule".


Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception. The semicolon is used to put multiple statements on a single line. That's it's only use, and that's the only time it's 'needed' - no exceptions.

But python has instead the "insert \ sometimes" rule, which isn't better.

> Have you ever used Python? If you did you really wouldn't be saying this. There isn't an exception.

For the ';', perhaps not. For the token that is used to terminate (or separate) statements? Yes, the ';' is an exception to the general rule of how to terminate statements.

The semicolon also works on some sort of statements and not others, throwing errors only at runtime.

It's easier to remember one rule than many.


Honestly, the rule is "don't use semicolons in Python". I don't think there's a single one in the large codebase I work with, and there's really no reason at all to use it other than maybe playing code golf.

It's not a language in which you ever need be saving bytes on the source code. Just use a new line and indent. It's more readable and easier.


There are no exceptions. You only need it if/when you want to put multiple statements on a single line. That's its sole purpose.

And I'd also add that it's something that you almost never do. One practical use is writing single line scripts that you pass to the interpreter on the command line. E.g. `python -c 'print("first command"); print("second command")'`

If you don't know about the `;` at all in python then you are 100% fine.


When you use ; and possibly {, }, code statements / blocks are specified redundantly (indentation + separators), which can cause inconsistent interpretation of code by compiler / readers.

I find it much, much easier to look at code and parse blocks via indentation, than the many ways and exceptions of writing ; and {, }, while an extra or missing ';' or {} easily remains unspotted and leads to silly CVEs.


Haskell has the semicolon for the same reason!

> implicit declaration of variables

This is so true. I really like Julia and I know that explicitly declaring variables would be detrimental to adoption but I prefer it to the alternative, which is this: https://docs.julialang.org/en/v1/manual/variables-and-scopin...


What do think of implicit member access (C++, Java, C#) vs explicit (python, javascript)? Is there a concrete argument one way or the other?

I feel like I prefer explicit

    self.member = value
    this.member = value
vs implicit

    member = value
But clearly C++/Java/C# people are happy with implicit ... though many of them try to make it explicit by using a naming convention.

That was my single biggest pet-peeve of C++. A variable appears in the middle of a member function? Good luck figuring out what owns it. Is it local? Owned by the class? The super-class? (And in that case - which one?)

The added mental load of tracking variables' sources builds up.


FWIF, most C++ style guards recommend writing member variables like mVariableName or variable_name_ so they're easy to distinguish from local variables, and modern C++ doesn't generally make much use of inheritance so there's usually only one class it could belong to.

The fact that people introduce naming conventions to keep track of member variables is probably the biggest condemnation of implicit member access. People clearly need to know this, so you'd better make it explicit.

It's actually a bit surprising that this is one thing that javascript does better than Java. In most other areas, it's Java that's (sometimes overly) explicit.


I can tell for certain that as a JS/Python man, every time I look through Java code I have to spend a bit of time when stumbling upon such access, until I remember that it's a thing in Java. Pity that Kotlin apparently inherited it.

But at least, to my knowledge, in Java these things can't turn out to be global vars. Having this ‘feature’ in JS or Python would be quite a pain in the butt.


F#, Kotlin, Python, Nim and many others all seem to get by fine without semicolons as statement terminators.

In Python, a newline is a token and serves as a statement terminator.

What I'm referring to is the notion that:

    a = b c = d;
can be successfully parsed with no ; between b and c. This is true, it can be. But then it makes errors difficult to detect, such as:

    a = b
    *p;
Is that one statement or two?

This is one of those great ideas that sadly one needs experience to realize are really bad ideas. Every new generation of programmers has to relearn it.

It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on. These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲. Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.


The obvious first step toward the brighter future is to refrain from using any and all software that utilizes the malevolent formats you mentioned. Doing otherwise would mean simply being untrue to one's own conscience and word.

> It's a bad idea because ASCII already includes dedicated characters for field separator, record separator and so on.

ASCII is over 60 years old and separators haven't caught on yet; what's different now?

> These could easily be made displayable in a text editor if you wanted just as you can display newlines as ↲.

Can you name a common text editor with support for ASCII separators? It's a lot easier to use delimiters and escaping then change every text editor in the world.

> Anyone who invents a format that involves using normal printable characters as delimiters and escaping them when you need them, is, I feel very confident in saying, grotesquely and malevolently incompetent and should be barred from writing software for life. CSV, JSON, XML, YAML, all guilty.

All of the formats you rant about are widely used, well supported, and easy to edit with a text editor - none of these are true of ASCII separators. People chose formats they can edit today instead of formats they might be able to edit in the future. All of these formats have some issues but none of the designers were incompetent.


US-ASCII only has four information separators, and I believe they can only be used in a four-layer schema with no recursion, sort of like CSV (if your keyboard didn’t have a comma or quote or return key). When you need to pass an object with records of fields inside a field you’re out of luck, and everyone has to agree on quoting or encoding or escaping again.

I think SGML (roll your own delimiters and nesting) was pretty close to the Right Thing,™ but ISO has the specs locked down so everyone had a second-hand understanding of it.


how do you write them though

Ctrl-\, Ctrl-], Ctrl-^ and Ctrl-_ for file, group, record and unit separator, respectively.

However, your tty driver, terminal or program are all likely to eat them or munge them. Also, virtually nothing actually uses these characters for these purposes.


virtually nothing actually uses these characters for these purposes.

Right. Which is why we have all these hilarious escaping and interpolation problems. Any why programmers will never be taken seriously by real engineers. It's like we have cement mixed and ready to go but we decide to go and forage for mud instead and think that makes us cleverer than the cement guys.


> your tty driver, terminal or program are all likely to eat them or munge them

Maybe that has something to do with this?


I’m surprised that with your experience you come to such unbalanced conclusions. Everything in engineering is about trade-offs and while your conclusions may be indisputable for the design goals of D they may wrong in other contexts.

1. If I scribble some one time code etc. the probability of having an error coming from implicit declarations is in the same order of magnitude as missing out edge cases or not getting the algorithm right for most people. The extra convenience may well be worth it.

2. I would relax this it should be clear to the programmer where a statement ends.

3. Go on with a warning is a sane strategy in some situations. I happily ruin my car engine to drive out of the dessert. The assert might have been to strict and i know something about the data so the program can ignore the assert failure.


> 1. If I scribble some one time code

.... and here is another entry for Walter's list of bad ideas:

4. "It's okay. I will use this code only once"


My favorite Red Green quote is “now, this is only temporary … unless it works.”

Your rationale in this and your followups are exactly what I'm talking about.

1. You're actually right if the entire program is less than about 20 lines. But bad programs always grow, and implicit declaration will inevitably lead you to have a bug which is really hard to find.

2. The trouble comes from programmer typos that turn out to be real syntax, so the compiler doesn't complain, and people tend to be blind to such mistakes so don't see it. My favorite actual real life C example:

    for (i = 0; i < 10; ++i);
    {
        do_something();
    }
My friend who coded this is an excellent, experienced programmer. He lost a day trying to debug this, and came to me sure it was a compiler bug. I pointed to the spurious ; and he just laughed.

(I incorporated this lesson into D's design, spurious ; produce a compiler error.)

3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged. The proof of this working is how incredibly safe air travel is.


> 3. I used to work for Boeing on flight critical systems, so I speak about how these things are really designed. Critical systems always have a backup. An assert fail means the system is in an unknown, unanticipated state, and cannot be relied on. It is shut down and the backup is engaged.

I ask you to reconsider your assumptions. How did this play out in the 737 MAX crashes? Was there a backup AoA sensor? Did MCAS properly shut down and backup engaged? Was manual overriding the system not vital knowledge to the crew?

You don’t have to answer. I probably wouldn’t get it anyway.

But rest assured that I won’t try to program flight control and I strongly appreciate your strive for better software.


> How did this play out in the 737 MAX crashes?

They didn't follow the rule in the MCAS design that a single point of failure cannot lead to a crash.

> Was manual overriding the system not vital knowledge to the crew?

It was, and if the crew followed the procedure they wouldn't have crashed.


I disagree with most of what you said but I want to specifically call out:

> 3. Go on with a warning is a sane strategy in some situations.

No, if its sometimes ok, to continue, than you should not assert it.

Assert means "I assert this will always be true, and if it's not our runtime is in unknown/bad state."

If you think you can recover, or partially recover, throw/return appropriate error, and go into emergency/recovery mode.


Your reactor is boiling. Your control software shut down with assertion failed: temperature too high, cannot display more than 3 digits.

Downvote me if you want to open a bug ticket with the vendor and wait a week for the fix.

Upvote me if you’d give it a try to restart with a switch to ignore assertions.

You may abstain if you never shipped a bug.

Edit: not to forget that this website runs on lisp which violates all three. Was it really a bad choice for the website?


> Your reactor is boiling. Your control software shut down with assertion failed: temperature too high, cannot display more than 3 digits.

Several points:

1. Most of such critical components have several different and independent implementations, with analog backup (if possible).

2. You are arguing one specific safety critical case, that 99.999% or even more programmers will never face, should somehow inform decision about general purpose programming language.

3. Even if you are working in such safety critical situation, you should not really on assertion bypass, but have separate emergency procedure, which bypasses all the checks and try's to force the issue. (ever saw a --force flag ?)

Because what happens in reality, is developer encounters a bug (maybe while its still in development), notice you can bypass it by disabling assertions (or they are disabled by default), log it as a low priority bug, that never gets fixed.

Then a decade later me or someone like me is cursing you because you enterprise app just shit the bed, and is generating tons of assertion warnings, even when it running normally, so I have to figure out, which of them are "just normal" program flow, and which one just caused an outage.

I never experienced situation like you described, but I have experienced behavior like I wrote above, too many times.

Botom line is:

- don't assert if you don't mean it

- if you need bypass for various runtime checks, code one in explicitly.

Edit: Hacker News is written in ARC which is schema dialect. ARC doesn't have assertions as far as i can tell.

ARC doesn't have its own runtime and is run on racket language, that has optional assertion, that exit the runtime if they fail https://docs.racket-lang.org/ts-reference/Utilities.html


I agree with this. Nuclear reactors are a special case of systems where removing energy from the system makes it more unsafe, because it generates its own energy and without a control system it will generate so much energy that it destroys itself (and due to the nature of radiation, destroys the surrounding suburbs too).

With most systems, the safest state is off. CNC machine making a weird noise? Smash that e-stop. Computer overheating? Unplug it. With this in mind, "assert" transitions the system from an undefined state to an inoperative state, which is safer.

That isn't to say that that you want bugs in your code, and that energizing some system is free of consequences. Your emergency stop of your mill just scrapped a $10,000 part. Unplugging your server made your website go down and you lost a million dollars in revenue. But, it didn't kill someone or burn the building down, so that's nice.


Modern nuclear reactors are designed and built with the expectation that when they melt down, the results aren't catastrophic (at least for the outside world).

See my previous reply. Your reactor design is susceptible to a single point of failure, and, how do I say it strongly enough, is an utterly incompetent design. Bypassing assertions is not the answer.

If it ignores part of the spec, I don't think "strictyaml" is the correct name here. Instead, if it interprets everything as string, perhaps "stringyaml" would have been more accurate, though I'm sure that's not as good PR.

I'm reminded of the discussion we had a few days ago about environment variables; one problem there is that env variables are always strings, and sometimes you do want different types in your config. But clearly having the system automatically interpret whether it's a string or something else is a major source of bugs. Maybe having an explicit definition of which field should be which type would help, but then you end up with the heavy-handed XML with its XSD schema.

Or you just use JSON, which is light-weight, easy to read, but unambiguous about its types. I guess there's a good reason it's so popular.

Maybe other systems like yaml and environment variables should only ever be used for strings, and not for anything else, and I suppose replacing regular yaml with 'strictyaml' could play a role there. Or cause unending confusion, because it does violate the spec.


> JSON, which is [...] unambiguous about its types

With the one exception that with floatig point values the precision is not specified in the JSON spec and thus is implementation defined[1] which may lead to its own issues and corner cases. It for sure is better than YAML's 'NO' problem, but depending on your needs JSON may have issues as well

[1]: https://stackoverflow.com/questions/35709595/why-would-you-u...


Also JSON's complete lack of many commonly used types, and no way to define any new ones.

Isn't that a problem with most of these config languages, though? XML is the only one where I think this might be possible.

Allowing you to define types is quite uncommon, but many config languages allow more types than JSON (so more than boolean, number, string, list, dict). Date datatypes are a big one and are provided by about every second JSON variant, in addition to TOML, ION and others.

>If it ignores part of the spec, I don't think "strictyaml" is the correct name here.

The article didn't fully explain it but strictyaml requires a typed schema or defaults to string (or list or dict) if one is not provided. So it strictly follows the provided schema.


That makes a big difference indeed. It wasn't clear to me from the article, but string yaml + optional schema sounds like a useful combination.

“saneyaml” would not make for bad PR

I was helping out a friend of mine in the risk department of a Big 4; he was parsing CSV data from a client's portfolio. Once he started parsing it, he was getting random NaNs (pandas' nan type, to be more accurate).

I couldn't get access to the original dataset but the column gave it away. Namibia's 2-letter ISO country code is NA—which happens to be in pandas' default list of NaN equivalent strings.

It was a headache and a half...


Verbatim from the docs, on read-csv:

    na_valuesscalar, str, list-like, or dict, default None

    Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
You fix it by using `keep_default_na=False`, by the way.


That looks like an interesting hard-coded check, I wonder what it intended to fix.

There’s some analysis in this twitter thread: https://twitter.com/badedgecases/status/1368362392573317120

tl;dr: there are a bunch of fields of various types that arrive as strings, and they get coerced but without paying attention to which field should have which type


What I am most baffled by with Yaml is the fact that it’s a superset of JSON.

Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

It really surprised me when I found out and I use JSON Whenever possible since then since it’s much stricter

https://en.m.wikipedia.org/wiki/JSON#YAML


> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

...unless your parser strictly implements YAML 1.1, in which case you should be careful to add whitespace around commas (and a few other minor things). This is a valid JSON that some YAML parsers will have problems with:

    {"foo":"bar","\/":10e1}
The very first result Google gives me for "yaml parser" is https://yaml-online-parser.appspot.com, which breaks on the backslash-forward slash sequence.

> Whenever an input accepts YAML you can actually pass in JSON there and it’ll be valid

Strictly speaking, this is only true of YAML 1.2, not YAML 1.0-1.1 (the article here addresses YAML 1.1 behavior, the headline example od which was removed ib YAML 1.2 twelve years ago), though it calla YAML 1.1 “YAML 2.0”, which doesn’t actually exists.

Of course, there are lots of features, like custom types, that JSON doesn’t support, but you can still use YAML’s JSON-style syntax instead of actual JSON, for them.


Yes this is usually the best way. If you need some features for code reuse there are several preprocessors. I personally use Dhall to configure everything and then convert it to JSON for my application to consume. It is a lot more powerful than YAML and has a very safety-oriented type system.

> it’s equally true that extremely strict type systems require a lot more upfront and the law of diminishing returns applies to type strictness - a cogent answer to the question “why is so little software written in haskell?“

I was with the article up until that point. I don't agree that diminishing returns with regards to type strictness applies linearly. Term-level Haskell is not massively harder than writing most equivalent code in JavaScript — in fact I'd say it's easier and you reap greater benefit. Perhaps it's a different story when you go all-in on type-level programming, but I'm not sure that's what the author was getting at. This smells of the Middle Ground logical fallacy to me. Or of course the comment was tongue-in-cheek and I'm overreacting.


I had to rewrite some JavaScript code in Postgres recently that measured the overlap between different elevation ranges. In JS I had to write it myself and deal with the edge cases and bugs. In Postgres I just use the range type and some operators. It was brilliant in comparison. The tiny effort of learning it was worth it. The list of data types I use all the time is bigger than just string, numbers and booleans. Serialisation formats should support them. Particularly as there are often text format standards that already exist for a lot of them. Give me wkt geometry and iso formatted dates. It's not that difficult and totally with it.

That law of diminishing returns might actually apply, I am not 100% sure. But more powerful type systems allow for the more complex composition of more complex interfaces in a safe manner. Think of higher-level modules and data structures. Or dependent types and input handling. Or linear types and resource handling.

I agree. I would say that Erlang goes ~80% of the way compared to Haskell's type system and the last 20% really matter, to the point that in many cases I find myself not really using Erlang's (optional) type system at all. Better type coverage and more descriptive types allow the compiler to infer more and I'd say this is the opposite of diminishing returns.

Norwegian here. I’d say the problem is YAML, not Norway :D

That author's blog post sent me down a rabbit hole of insanity with YAML and the PyYAML parser idiosyncrasies.

First, he mentions "YAML 2.0" but there's no such reference about "2.0" from yaml.org or Google/Bing searches. Yaml.org and wikipedia says yaml is at 1.2. Apparently the other commenters in this thread clarified that the older "YAML 1.1" is what the author is referring to.

Ok, if we look at the official YAML 1.1 spec[1], it has this excerpt for implicit bool conversions:

   y|Y|yes|Yes|YES|n|N|no|No|NO
  |true|True|TRUE|false|False|FALSE
  |on|On|ON|off|Off|OFF

But the pyyaml code excerpts[2][3] from resolver.py has this:

  u'tag:yaml.org,2002:bool',
  re.compile(ur'''^(?:yes|Yes|YES|n|N|no|No|NO
              |true|True|TRUE|false|False|FALSE
              |on|On|ON|off|Off|OFF)$''', re.X),
The programmer omitted the single character options of 'y' and 'Y' but it still has 'n' and 'N' ?!? The lack of symmetry makes the parser inconsistent.

And btw for trivia... PyYAML also converts strings with leading zeros to numbers like MS Excel: https://stackoverflow.com/questions/54820256/how-to-read-loa...

[1] https://yaml.org/type/bool.html

[2] 2020 latest: https://github.com/yaml/pyyaml/blob/ee37f4653c08fc07aecff69c...

[3] 2006 original : https://github.com/yaml/pyyaml/blob/4c570faa8bc4608609f0e531...


You can catch this with yamllint (https://github.com/adrienverge/yamllint):

    % cat countries.yml 
    ---
    countries:
      - US
      - GB
      - NO
      - FR

    % yamllint countries.yml 
    countries.yml
      5:4       warning  truthy value should be one of [false, true]  (truthy)

YAML seems like a really neat idea, but over time, I have I have come to regard it as being too complicated for me to use for configuration.

My personal favorite is TOML, but I would even prefer plain JSON over YAML

The last thing I want at 2 AM when trying to look figure out if an outage is due to a configuration change is having to think if each line of my configuration is doing the thing I want.

YAML prizes making data look nicely formatted over simplicity or precision. That for me, is not a tradeoff, I am willing to make.


They all have their downsides.

JSON:

- no comments, unless you fake them with fake properties, unless your configuration has a schema that doesn't allow extra fake properties

- no trailing commas; makes editing more annoying

- no raw strings

YAML:

- the automatic type coercion

- the many ways to encode strings ( https://yaml-multiline.info/ )

- the roulette wheel of whether this particular parser is anal about two-space indentation or accepts anything as long as it's used consistently

- the roulette wheel of whether this particular parser supports uncommon features like anchors

TOML:

- runtime footguns in automated serialization ( https://news.ycombinator.com/item?id=24853386 )

- hard to represent deeply-nested structures, unless you switch to inline tables which are like JSON but just different enough to be annoying


For hand-writing I love jsonnet, which produces JSON, is much more convenient to write, and has some templating, functions etc. https://jsonnet.org/

You wouldn't serialize data structures to jsonnet though, you'd just generate JSON.


This makes me sad. It's 2021 and we still haven't figure out how to serialize configuration in a format that is easy-to-edit and predictable.

This is the problem space I'm targeting with https://concise-encoding.org/

* Text AND binary so that humans can edit easily, and machines can transmit energy and bandwidth efficiently.

* Carefully designed spec to avoid ambiguities (and their security implications).

* Strong type support so you're not using all kinds of incompatible hacks to serialize your data.

* Versioned, because there's no such thing as the perfect format.

* Also, the website is 32k bytes ;-)


+ Has binary format.

+ Avoids ambiguities.

- The format seems to feel the need to support everything, including things I am not sure are actual usecases (what's the point of Markup element for example? What does Metadata save us compared to just including it in document, given that parsers must parse it anyway?). This must make implementation most complex and costly, and makes reading the text format more difficult.

- Not a fan of octal notation. At 3am not sure I can't confuse 0 and o given certain fonts. Does anyone even use it these days?

- Unquoted string were discussed in the thread, I'd like to point out that it's very easy to make an unquoted string not "text-safe" (according to the spec) without noticing it, at which point document is invalid.

Just add white-space (maybe a user pasted a string from somewhere without noticing whitespace at the end or forgot the rules), a dot, an exclamation or a question mark. Having surprises like that is IMHO worse than a consistent quoting method.

Basically all the things I don't like are about the format supporting a bit too much. YAML 1.1 should teach us more is sometimes less.


Alright that's two votes against unquoted strings so far (plus my wife agrees so that's three against!)

I put in octal because it was trivial to implement after the others. The canonical format when it's stored or being sent is binary, and a decoder shouldn't be presenting integers in octal (that would just be weird). But a human might want octal when inputting data that will be converted to the binary format.

Markup is for presentation data, UI layouts, etc, but with full type support rather than all the hacky XML+whatever solutions that many UI toolkits are adopting. Also, having presentation data in binary form is nice to have.


Well, unquoted strings work when a format is built for that. If the default was "it's text unless we see the special sequences" it would be better for unquoted strings. But even then there are too many special characters in this format IMHO.

I saw there's a 'Media' type in the spec. It's seems the type is actually for serializing files. But there's no "name" (or we can call it "description") field. Of course we could accomplish this with a separate field - but than again the entire type's functionality could be accomplished with a u8x array and a string field. So if you're specifying this type at all, might as well add a name field to make it useful.


The media object is for embedding media within a document (an image, a sound, an animation, some bytecode to execute in a sandbox, or whatever). It's not intended to be used as an archive format for storing files (which, as you said, could be trivially accomplished with a byte array for the data, a string for the file name, and some metadata like permissions etc). A file is just one way among many to store media (in this case as an entry in a hierarchical database - the filesystem - keyed by filename). CE is only interested in the media itself, not the database technology.

The media object is a way to embed media data directly into a document such that the receiving end will have some idea of how to deal with it (from its media type). It won't have or need a "file name" because it's not intended to be stored in a filesystem, but rather to be used directly by an application. Yes, it could be built up from the primitives, but then you lose the canonical "media" type, and everyone invents their own incompatible compound types (much like what happened with dates in JSON and XML).


OK, after more discussion and thought:

- I'm removing the metadata type. You're right that it's not really gaining us anything.

- I'm changing strings so they always must be quoted. This actually simplifies a lot of things.

Thanks for the critique!


I'm skimming through the human readable spec, and it seems decent, but I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.

Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.


> I noticed the spec allows unquoted strings. What's the reasoning for this? In my experience unquoted strings cause nothing but trouble, and are confusing to humans who may interpret them as keywords.

Unquoted strings are much nicer for humans to work with. All special keywords and object encodings are prefixed with sigils (@, &, $, #, etc), so any bare text starting with a letter is either a string or an invalid document, and any bare text starting with a numeral is either a number or an invalid document.

> Any reason for not using RFC2119 keywords in the spec? Using them should make the spec easier to read.

I use a superset of those keywords to give more precision in meaning: https://github.com/kstenerud/concise-encoding/blob/master/ce...


If strings are always unambiquously detectable, why allow quoting them at all? Having two representations for the same data means you can't normalize a document unambiguously. I can understand having barewords seems cleaner for things like map keys, but I am not convinced that it's a worthwhile tradeoff.

An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must"). This makes requirements and recommendations stand out amid explanatory text, improving legibility. For example, RFC2119 itself uses MUST and must with different meanings.


> If strings are always unambiquously detectable, why allow quoting them at all?

Because strings can contain whitespace and other structural characters that would confuse a parser.

> Having two representations for the same data means you can't normalize a document unambiguously.

The document will always be normalized unambiguously in binary format. The text format is a bit more lenient because humans are involved.

The idea is that the binary format is the source of truth, and is what is used in 90% of situations. The text format is only needed as a conduit for human input, or as a human readable representation of the binary data when you need to see what's going on.

> An important feature of RFC2119 keywords is that they're always capitalized (ie. the keyword is "MUST", not "Must", or "must").

Hmm good point. I'll add that.


Update: I'm removing unquoted strings. Thanks for the critique!

Nice! I like some concepts that this format proposes, but the `@` and `|` modifier feels a bit too "loaded".

It's a compromise; there are only so many letters, numbers, and symbols available in a single keystroke on all keyboards, and I don't want there to be any ambiguity with numbers and unquoted strings (e.g. interpreting the unquoted string value true as the boolean value true).

So everything else needs some kind of initiator and/or container syntax to logically separate it from the other objects when interpreted by a human or machine.


We had such: XML. With proper editor support it is easy. I guess it needs rediscovery /s ;)

I used XML and didn't like it:

- A proper editor was never around.

- Closing tags were verbose.

- Attributes vs tags was confusing.

- It didn't map "naturally" to common data types, like lists, maps, integers, float, etc.


Don't forgot about namespaces, another fiddly bit of XML that caused all sorts of problems and headaches.

You've just used XML tech as it was designed to post this comment.

XML is serialization. I hardly believe you was concerned about serialization while posting comment or thought about attributes-tags distinction.

This page utilizes request to server for multi-user editing. But it is easy to build truly serverless (like a file) document with same interface:

    data:text/html,<html><ul>Host: <span class=host contenteditable>example.com
Change it, save it, done. Web handles input of lists, maps, integers, float and much more.

You are right. XML is great for encoding the DOM. However, I didn't find it practical for interfacing with humans, due to the concerns I raised.

It is not practical to edit plain text in binary:

    636f 756e 7472 6965 733a 0a2d 2047 420a
    2d20 4945 0a2d 2046 520a 2d20 4445 0a2d
It is not practical to edit Excel documents in plain text:

    <?xml version="1.0"?>
    <Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
      xmlns:o="urn:schemas-microsoft-com:office:office"
      xmlns:x="urn:schemas-microsoft-com:office:excel"
      xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
      xmlns:html="http://www.w3.org/TR/REC-html40">
      <Worksheet ss:Name="Sheet1">
        <Table>
          <Row>
            <Cell><Data ss:Type="String">ID</Data></Cell>
Tim Berners-Lee browser was browser-editor. Can't you see parallels?

XML with a convenient UI tools to edit should have fit the bill. Yet, for whatever reason a convenient UI tool would never happen to be there when needed, and thus scared and tired of manual editing of XML the world have embraced YAML.

> XML with a convenient UI tools to edit should have fit the bill.

"You need this special tool to work" immediately and instantly rules out "easy to edit". Or makes the debate irrelevant: every format is easy to edit if you have "a convenient UI" to do it for you.


The fault was in XML editing, pure data authoring is hard. We have convenient UI — web browser, think of it as literate programming, a way to merge man page and configuration file.

And plain text editor is a "widely deployed special tool to work". Actual data is

    countries:\n- GB\n- IE\n- FR\n- DE\n- NO
Or

    636f 756e 7472 6965 733a 0a2d 2047 420a
    2d20 4945 0a2d 2046 520a 2d20 4445 0a2d

Opening XMLs in ZIP containers is easy! Just spin up Word. :)


> - the automatic type coercion

Only when you "unmarshal" to an untyped data structure and then make assumptions about the type. I've used yaml with a go application, and it can't interpret NO as a bool when the field is a string.


Correct, like TFA.

Btw, the reason Haskell isn’t used more isn’t type system per se, as all types can be inferred at the compilation time. People would sometimes use this feature even to see if GHCi guesses the type correctly (by correctly I mean exactly how the user wants, technically it’s correct always) first time and save them some time writing it either with an extension or just copy&paste from the interpreter window.

When it gets hairy is that most programming languages have low entrance barrier. To write Haskell effectively you’ve got to unlearn a lot of rooted bad habits and you get to dive into the “mathematical” aspect of the language. Not only you got monads, but there’s plethora of other types you need to get comfortably onboard with and the whole branch of mathematics talking about types (you don’t need to even know that such a field as category theory exists to use it).

However, since most people just want to write X, or just want hire a dev team at price they can afford, Haskell rarely is the first choice language.


In my opinion the mathematical concepts and abstractions are not the issue with Haskell. The issue is that it's a pain to use in practice because of:

1. Really annoying to do any kind of i/o

2. Extremely poor interoperability with non-Haskell code

3. (opinion) Unpleasant, inconsistent, hairy syntax


This comment was buried in a thread, but I'm bringing it out because it's very relevant to the conversation:

https://news.ycombinator.com/item?id=26679728

> the article refers to YAML 2.O, a nonexistent spec, and to PyYAML, a real parser which supports only YAML 1.1.

> Both the unquoted-YES/NO-as-boolean and sexagesimal literals were removed in YAML 1.2.


Yeah, I'd bet that YAML "two point oh" (rather than "two point zero") doesn't exist ! :p

I will never understand why YAML didn't just require quoted strings. Did the creator not anticipate how many problems the ambiguity would cause?

Never's a strong word, seems quite easy to understand why to me. You've got ease of use reasons, historical reasons like the mis-guided Robustness principle, etc.

And these sort of things happen time and time again.

And although officially JSON requires quoted strings, almost none of the parsers actually enforce that, and so you will find a huge amount of JSON out there that is not actually compliant with the official spec.

Just like browsers have huge hacks in them to handle misformed HTML.


> And although officially JSON requires quoted strings, almost none of the parsers actually enforce that

What programming language? I'm not familiar with those parsers, the ones I know of very much do enforce quoted strings.

> you will find a huge amount of JSON out there that is not actually compliant with the official spec

The parsers I use all follow the current JSON RFC specification, and I've never encountered any JSON from APIs which they reject.

> Just like browsers have huge hacks in them to handle misformed HTML.

Web browsers do deal with a variety of things, not so much JSON parsers in my experience.


I think the point is that they accept more than the spec dictates - do your JSON parsers accept e.g. the vs code config file (JSON with comments) or JSON with unquoted keys?

The most commonly used parsers only accept valid JSON - including the one included within most JS runtimes (JSON.stringify/parse). VSCode explicitly uses a `jsonc` parser, the only difference being that it strips comments before it parses the JSON. There's also such thing as `json5`, which has a few extra features inspired by ES5. None of them are unquoted strings. I've never come across anything JSON-like with unquoted strings other than YAML, and everything not entirely compliant with the spec has a different name.

Can you name a JSON parser which accept comments or unquoted keys?

I've never seen one


IIRC, Gson accepts unquoted keys.

If you want no misunderstandings, be explicit. This applies to YAML and life in general. There's an annoying but fairly accurate saying about assumptions that applies.

If you want something to be a specific type, you better have an explicit way of indicating that. If you say quotes will always indicate a string, great. Of course we know it's not that simple, since there are character sets to consider.

The safest answer is to do something like XML with DTDs. But that imposes a LOT of overhead. Naturally we hate that, so we make some "convention over configuration" choices. But eventually, we hit a point where the invisible magic bites us.

This is one case where tests would catch the problem, if those tests are thorough enough - explicitly testing every possibility or better yet, generative testing.


Or just opening your browser and trying out norwegian on a QA environment.

I don't understand why Haskell gets brought up in the middle of an otherwise interesting and useful article. This sort of thing cannot happen in Haskell. And while Haskell is not universally admired, I can't recall seeing Haskell's flavor of type inference being a reason why someone claimed to dislike Haskell.

I have never gotten far into a project and thought, "my config files are too verbose. I wish there were clever shorthands."

Does Yaml have any sort of strict mode?

I imagine I could find a linter that disallows implicit strings.


Not YAML by itself, but there are libraries that parse a YAML-like format that is typed. For example this one: https://hitchdev.com/strictyaml/. Technically, it is not compatible with the YAML spec.

There exists a couple of mainstream languages that are full of these sorts of interesting behavior, one of them is supposedly cool and productive and the other is supposedly ugly and evil.

The "Wat?" Talk got quite a few example and is hilarious.

https://www.destroyallsoftware.com/talks/wat


and yet I don't see anyone complain about bash which is arguably far worse than those 2. When things get hard on bash, you will start to see python scripts on CI and whole thing is complete unreadable mess

> I don't see anyone complain about bash

You're not looking really hard then, but really

> When things get hard on bash, you will start to see python scripts

That's kinda the thing innit? Unless the system specifically only allows shell scripts (something I don't think I've ever encountered though I'm sure it exists) it's quite easy to just use something else when bash sucks, so while people will absolutely complain about it they also have an escape: don't use bash.

When a piece of software uses YAML for its configuration though, you don't really have such an option.

Furthermore, bash being a relatively old technology people know to avoid it, or what the most common pitfalls are. Though they'll still fall into these pitfalls regularly.


There is a lot of elitism around bash, like the "Arch btw" thing but far worse because a lot of important things depends on it.

Powershell has been working on linux for quite a while now and doesnt seem get any attention even when it has a nice IDE support and copy the good things about bash.


It doesn't copy all the good things about the Unix shell though.

The reason people are comfortable with the POSIX shell is because you use the same syntax for typing commands manually as you do for scripts. But, you're going to have a hard time finding people who prefers writing:

    Remove-Item some/directory -recursive
Rather than

    rm -fr some/directory
People who write shellscripts are often not seeing themselves writing a "program". They are just automating things they would do manually. Going to an IDE in this case is not something you'd consider.

I happen to be very aware of all the pitfalls in POSIX shell, and it's rare that I see a shellscript where I cannot immediately point out multiple potential problems, and I definitely agree that most scripts should probably be written in a language that doesn't contain so many guns aimed at the user's feet. I'm just pointing out a likely reason why people are not adopting powershell in the huge numbers that Microsoft may have hoped for.


Nonsence. This is the same in powershell:

    rm -r -f some/directory

Bash is a total disaster, I complain about it all the time. Unfortunately, rather like JS, it's unavoidable.

I'd not consider bash a

1. mainstream

2. programming language

(of course technically it is a programming language, but it is also more precisely a scripting language)


Python vs JavaScript?

Python vs PHP also.

> full of these sorts of interesting behavior

I don’t think that applies to Python - it’s quite strongly (although not statically) typed. I agree that it does apply to JavaScript and PHP.


I think this applies to Python pretty well. Although certainly not as bad as PHP, most JS traps also exist in Python (falsy values, optional glitchy semicolons, function scoped variables, mutable closure). There is many JS specific traps like this and also other Python specific ones (like static fields are also instance fields, Python versions and library dependency hell). However I find it easier to avoid them in JS than in Python with TypeScript, avoiding classes, ...

Javascript and PHP is correct.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: