Hacker News new | comments | show | ask | jobs | submit login
Son – A minimal subset of JSON for machine-to-machine communication (github.com)
47 points by seagreen on Mar 14, 2017 | hide | past | web | favorite | 70 comments

Author here. I originally started this project because I wanted a consistent way to serialize JSON so that the serialized bytes would hash the same way every time.

As I worked on it though I realized it might be of general interested to people. Thus the example in the README of piping JSON through multiple tools without generating trivial changes that mess up diffs.

Most of the decisions I made were clear: no insignificant whitespace, object keys must be ordered, etc.

There are two things I'm still not sure about:

+ Son doesn't provide escape sequences for any Unicode character that JSON allows to be written unescaped. This includes U+007f (ASCII "delete"). Will that cause a problem for many programs? All the other ASCII control characters are required to be escaped by JSON, U+007f is the only one left out.

+ Son doesn't allow trailing zeros in fractions. This means you can't serialize `1.0`, you have to serialize it as `1`.

I was confident in the decision to take out scientific notation (it would be cool if JSON parsers actually treated numbers as being in scientific notation and tracked significant digits, but they don't so I feel like that ship has sailed). Trailing zeros are different though because some JSON generators do use them to distinguish integers from fractions. The problem is that many parsers don't care about them, so you end up in a situation where parsers are tossing out information about documents, meaning they can't serialize them faithfully again which is the whole point of Son.

instead of messing with JSON and making it less human friendly (arguably the thing that made JSON so popular in the first place), why not just use something designed for efficient machine-to-machine transfer?

Basically making JSON more machine friendly and stricter to parse undermines the original reason to use JSON. If efficiency and absolute correctness are important use something not designed to be forgiving to humans :)

Postel's law: things should parse JSON, but probably only emit SON. That gets you the best of both worlds: you can assume all messages internal to your system are in a canonicalized form (and thus do thing like hashing them with dumb tools), but your system still interoperates with other systems that don't bother to canonicalize, parsing their inputs and sending responses they understand.

>making it less human friendly (arguably the thing that made JSON so popular in the first place)

I would not call a JSON human friendly due to its lack for comments.

You might be interested in JSON5 (http://json5.org/) which goes the opposite direction of Son and makes JSON more human-writeable. It's the best, cleanest "JSON+" I've found so far.

How about Bencode[1], the data format used in Bittorrent? Values and encodings are already bijective, the format is extremely simple and it's already standardized and described. The only feature "SON" has which Bencode doesn't support is floating-point values, and as you imply above they are very likely to be mutilated in JSON or a subset thereof.

[1] https://en.wikipedia.org/wiki/Bencode

This is the second time Bencode's been mentioned in the context of Son. I didn't know about it before, it definitely looks interesting! However if you're sending data to a service that only accepts JSON then it's obviously not an option.

Fair enough. Backwards compatibility is useful in some contexts.

By the way, I love that Bencode is bijective. That's a really nice property for a serialization format, and even though I don't mention it on the README it's one of the goals of Son.

I am tempted to suggest either:

1. Keep 1.0 as a special case to maintain the int/float distinction (it's a float; calling it a fraction is kinda-of-a-lie).

2. Refuse to handle floats at all, at which point people can pass [ <mantissa>, <exponent> ] for reals or [ <numerator>, <denominator> ] for rationals.

There is of course (3), "build a compliance suite and claim the parsers that toss out information are Incorrect", but that doesn't seem compatible with your postel-ish goals.

> it's a float; calling it a fraction is kinda-of-a-lie

This happens not to be correct. By specification JSON numbers are just a series of characters, arranged in a certain way: https://tools.ietf.org/html/rfc7159#section-6

In practice though many JSON parsers will parse non-integer numbers to floats.

> 2. Refuse to handle floats at all, at which point people can pass [ <mantissa>, <exponent> ] for reals or [ <numerator>, <denominator> ] for rationals.

This is a really interesting idea. If you're writing something that's super important like medical software it would probably be worth considering. However, my goals are just to make minimal changes that improve JSON some while still keeping it fairly readable, so I think that means I should stick with allowing `123.456` or whatever. I'd like to try to keep an open mind on this though.

Ooh, my apologies wrt 'kind-of-a-lie'.

I think all the parsers I've used inflate 1 to an int and 1.0 to a float ... that or they don't, and I just thought they did.

Either way, many thanks for the correction.

You might find the discussions around XDR (eXternal Data Representation) helpful in this regard.

[1] https://tools.ietf.org/html/rfc4506.html

Is XDR in wide use? Don't think I've ever seen it in the wild.

Edit: Apparently used by NFS and ZFS.

heh, that's funny. Every Linux, FreeBSD, MacOS, and most of the Windows systems. :-) It is how you encode data for ONCRPC. It's fast, it can encode anything, and its dense (no 'convert this to isolatin-8891 first' step). Its also from the 80's so not as visible :-)

But perhaps more interesting is the really great machine to machine communication of complex types research that went on in the 80's. Some was interesting (CORBA), some was really scary (ASN.1), and some very fast and minimalist (XDR).

Oh, it's what is also called Sun RPC? I didn't know they renamed it.

I guess it's one of the things that fly under one's radar if one isn't actively using a protocol.

On Windows a similar niche is (mostly "was") filled by Microsoft RPC, an implementation of DCE RPC, which forms the underlying protocol for DCOM.

From https://www.iana.org/assignments/service-names-port-numbers/... :-)

   sunrpc	111	tcp	SUN Remote Procedure Call	[Chuck_McManis]	[Chuck_McManis]						
   sunrpc	111	udp	SUN Remote Procedure Call	[Chuck_McManis]	[Chuck_McManis]

Ha! :-)

If you're looking to be able to consistently hash JSON objects you might want to look at Ben Laurie's objecthash: https://github.com/benlaurie/objecthash

It describes a consistent way to hash an object without defining a new format.

"Object members must be sorted by ascending lexicographic order of their keys."

How should the following be serialized?

{"öp":1, "op":0}

Couldn't you solve that by sorting by the underlying byte order? Since it is required to be UTF-8 that should be fast and unambiguous.

You could, and that's partially what prompted the question, but that's not quite what the specification says currently.


and unambiguous.

var x = "\u0307\u0323\u0073"

var y = "\u0073\u0323\u0307"

These are arguably the same grapheme when displayed on a screen, but they would be considered different keys in SON, and sorted in different spots. I'm actually unsure if most diffing systems would catch that they are different or not, and if so, how they would represent that difference to the end user (since just showing the grapheme would be useless as they are the same).

The specification is silent on normalization, so a SON pipe could output an object with those as keys (and still meet spec), but not be accepted by a conforming SON parser (that did unicode normalization on incoming keys). So really, the specification should say which of the canonical normalized forms to use, or that keys should specifically may not be normalized/modified in any way. This works fine if you're running something like JavaScript -> JavaScript -> JavaScript, but would fail if a (currently conforming) system in the middle defaults to normalized unicode representations. In other words, would a pipeline like Firefox -> Swift -> Python -> Swift -> Nodejs -> Swift -> Firefox, even if they had conforming SON implementations (of the current specification)?

> not be accepted by a conforming SON parser (that did unicode normalization on incoming keys)

I don't think RFC 7159 actually mentions Unicode normalization at all. If it was in there I think it would be mentioned in the section on string comparisons, but that just talks about going unescaped codepoint by unescaped codepoint: https://tools.ietf.org/html/rfc7159#section-8.3

EDIT: Does Unicode normalization change over time? If so we definitely have to leave it out and just go codepoint by codepoint, because we don't want the validity of a Son document to change.

Good catch. I kind of slacked on this since I knew it would be a solvable problem once everything else got worked out.

Right now the reference implementation sorts by comparing codepoints one-on-by one. When it reaches a codepoint that's unequal or nonexistant it orders the string with the lesser or nonexistant codepoint first.

So in your example, after serializing from JSON and deserializing to Son we get `{"op":0,"öp":1}`.

Two remaining questions:

1. What's the most unambiguous way to describe this process?

2. Right now the comparison is on unescaped strings. Should it be on escaped strings instead?

There's an issue open to discuss this here: https://github.com/seagreen/Son/issues/1

EDIT: I am curious what Unicode recommends for language-aware sorting, even though we're not going to use it. Is this the right place to look? http://unicode.org/reports/tr10/

EDIT2: RFC 7159 has language about equality that's relevant to comparisons. I'm confident now we're on the right track: https://tools.ietf.org/html/rfc7159#section-8.3

Isn't that invalid? You can't have two object members with the same name, can you? I would assume that should be impossible to emit.

Um, the first character of each key is different -- they are not the same.

Ah, I see. That wasn't obvious to me when I first looked.

Isn't sorting of Unicode characters defined under Unicode? I think the correct answer for "how do I sort my unicode strings" is "defer to unicode".

That may be quite expensive in the end, depending on object size and content.

Of those two? That depends on whether your language is German or Swedish.

Alphabetical order changes from language to language. If you don't specify a collation, you can run into edge cases.

And you can't specify "a collation" in a way that will be guaranteed to mean exactly the same thing next year. I don't believe there's a way to do a permanently stable sort if you also want it to reflect the current state of Unicode. (An obvious example is that an entire language might have been added to Unicode since you serialized something with those code points in it.)

NTFS, which has to build a b-tree of filenames in a permanently stable sort order, solved this by writing the collation table to the disk when it's formatted, and never changing it again. Which means the Windows shell still has to re-sort the filenames because the currently "correct" collation may be different.

Hmm, then maybe the easiest solution is to sort numerically by unicode code point, and accept that combining characters will look weird.

Or are there problems with that as well that I'm not considering (beyond it being less human readable)?

I think the way to go is definitely by code point. We don't want to pull any more of the Unicode spec into Son than we have to.

Happily sorting is only used to order members within Son objects. Individual strings are never sorted, so there isn't a worry about sorted strings coming out messy.

Great point about new languages meaning language-aware sorting systems will definitely have breaking changes.

Cute name, but 1) it's totally un-Googleable, and 2) it doesn't mean anything.

I wanted to name it something like "mson" for "minimal-JSON", but that's already taken.

EDIT: Not meaning anything is a plus! But I see your point about being un-googleable.

ASN.1 uses the name DER Distinguished Encoding Rules for this purpose.

There are some other edgecases that you aren't considering (or that I missed)

Min and Max integer values - JavaScript has some pretty tight limits here

Keys should be in lexicographic order — you might want to be more specific. Is there a permitted subset of Unicode? Which normalization rule must be used?

There's an issue for min and max number values here: https://github.com/seagreen/Son/issues/3

Son seems like a natural place to put restrictions on them, but I'm not sure if there's way to do it that (A) still makes for a clean spec and (B) still allows everything floats can encode to to be used.

Yeah, as with the trailing zero, you're being burned by JavaScript's decision to just have a Number type. As far as js is concerned 1 and 1.0 are the same number.

On an unrelated note. If human readability isn't a primary concern, I'd recommend requiring that < > & all be escaped, this prevents some crappy attacks

Also, no top level lists!

Why no top level lists?

Ha! I guess I'm showing my age. There used to be an associated vulnerability, which is why some frameworks won't let you do it. It appears that the vulnerability was fixed in what are now fairly old versions of browsers.

More info at http://stackoverflow.com/questions/16289894/is-json-hijackin...

JSCON: JavaScript Canonical Object Notation

This is a great name. I'll probably stick with Son because it's already released, but honestly this would have been better.

How about "System Object Notation"?

How about sson?

Umm... why? The only rationale for this project I can see is this:

> Piping JSON through multiple programs creates lots of trivial changes, which makes it hard to do things like take meaningful diffs.

But you can always put JSON through a pretty-printer which puts values into canonical form before diffing. I wouldn't bother turning it into a formal specification. And you mention 'No insignificant whitespace' in the README, so it's not like your format makes line-by-line diffs any clearer.

It's as if IT hasn't had enough solutions in search of a problem...

> I wouldn't bother turning it into a formal specification.

Too late!

> But you can always put JSON through a pretty-printer which puts values into canonical form before diffing.

Son is a starting point for building such a pretty printer. It takes care of messy details like eliminating the redundancies in string and number encoding, so all you have to do to specify the pretty-printer format is say where you want your newlines and how much to indent by.

> Son is a starting point for building such a pretty printer.

I don't understand: such pretty printers already exist (e.g. jq, which does a whole lot more [0]). If you're transmitting JSON and want to diff two documents, just pipe them into jq or another pretty-printer with key ordering, then use one of many existing line-by-line diff tools.

[0]: github.com/stedolan/jq

  $ echo "1." | jq .
jq is great, but doesn't preserve float/int distinction. There may be other canonicalization details that pretty printers don't address.

It might be simpler to change jq and/or other pretty printers (though might be implementation or design reasons not to).

But the other advantage of a new project like SON is positioning not engineering, in our heads, not in the code: tool for the job. If so, the page should spend more time on what that job is (for when people search for that job).

And, as the jq example shows, tools not designed for this job mightn't do it exactly right.

EDIT I was reading offline, so didn't see the sibling reply.

I love that we both started our comments with "jq is great". I hadn't seen yours either=)

jq is great. You can actually see how it's used to test the reference implementation of Son here (https://github.com/seagreen/Son/blob/master/implementation/t...) using `jq --compact-output --sort-keys .`

Unfortunately, jq doesn't provide flags to control scientific/non-scientific notation or which characters are escaped, meaning if you want very tight control over the JSON generated it's not a full option.

(Consider that the motivating example in the Son README isn't all you might want to use Son for. For instance some people need consistent hashing of serialized JSON documents).

Ah, understood - thanks for clarifying!

Side note: I respect your positivity and steadfastness in the face of a skeptical HN response.

Sure thing. And thanks, that made me smile.

I've implemented similarly constrained string representations for numerics for a config file format, but I ended up allowing both "1" and "1.0" so I could distinguish integer and floating point literals ("1.00" was still forbidden, as was "0.0"... zero was stored in a specially tagged way).

The exponent syntax in JSON is fun because it allows numerous representations of 0 and 1 using e.g. 0e1 and 7e0

> I ended up allowing both "1" and "1.0" so I could distinguish integer and floating point literals

Definitely the toughest call when making a JSON subset like this.

You could use JSON as a host: https://github.com/cognitect/transit-format

I do like the comment about having the S-expression format for data interchange defined since 1997.

Isn't this just a reinvention of Canonical JSON [1]? There were efforts [2] to standardize it, although it seems this fizzled out.

[1] http://wiki.laptop.org/go/Canonical_JSON

[2] https://datatracker.ietf.org/doc/draft-staykov-hu-json-canon...

Thanks for brining Canonical JSON up. There's an issue to investigate it here: https://github.com/seagreen/Son/issues/7

It doesn't appear that Canonical JSON does anything about redundant escape sequences in JSON. I'm still looking into it to be sure. This is a big part of the motivation behind the Son spec, and represents about half the EBNF: https://github.com/seagreen/Son/blob/master/son.ebnf#L19

For anyone else who knows of more JSON subsets, if you report them here: https://housejeffries.com/page/7 it would be really appreciated. I definitely don't want to do duplicate work.

"Object keys must be unique.""

How should the following be serialized (if at all)?


Note that I'm using the \u forms even though they aren't allowed via the specification because, HN doesn't show unicode combining properly sometimes, and so that's it's clear to see that it's a different byte pattern.

The plan is to go codepoint by codepoint in ascending order:


See this issue for the discussion: https://github.com/seagreen/Son/issues/1

My current understanding is that RFC 7159 doesn't require Unicode normalization to be performed, if it does we're in trouble: https://tools.ietf.org/html/rfc7159

Why not use something like Protobufs and Cap'n'Proto instead of yet another serialization format?

They require a schema. A schemaless binary serialization like CBOR or BSON is more amenable to JSON interchange.

How about http://msgpack.org/?

Son's a niche tool to allow talking to JSON services in a more consistent way. It's not meant as an competitor to any other data format.

EDIT: s/alternative/competitor

I always find myself finding myself back with JSON. The advantage it can be parsed easily in any language -- everything has a JSON parser.

protobufs are fairly wide spread... from action script to erlang to haskell to visual basic.

Last time I tried (6 months ago) I couldn't find a proto3 implementation for Haskell that was complete ( https://github.com/google/proto-lens doesn't do Any, https://github.com/alphaHeavy/protobuf doesn't do proto3). Maybe one exists, or I could use proto2 and rewrite the other system I was working with.

I decided it was easier in my case to just output JSON.

Dang, `protobuf` doesn't support proto2? I'm kind of a Haskell die-hard, this is a big hole in our ecosystem=(

You don't really need proto3, though. Proto2 is compatible and proto3 mostly only removes features.

> Piping JSON through multiple programs creates lots of trivial changes, which makes it hard to do things like take meaningful diffs.

That's the virtue of having a format which offers a canonical representation for data.

> No insignificant whitespace.

I.e., it's not human-readable. It also has no decent way to exchange binary data (there is no byte-sequence type: one must either use Base64 or an array of integers, neither of which is space-efficient).

It'd be nice to have a format which is both human-readable and has a canonical representation. Fortunately, such a thing already exists (and I've even linked to it once already today), and has since 1997: http://people.csail.mit.edu/rivest/Sexp.txt

Here's a JSON example (from http://json.org/example.html):

        "glossary": {
            "title": "example glossary",
        "GlossDiv": {
                "title": "S",
          "GlossList": {
                    "GlossEntry": {
                        "ID": "SGML",
              "SortAs": "SGML",
              "GlossTerm": "Standard Generalized Markup Language",
              "Acronym": "SGML",
              "Abbrev": "ISO 8879:1986",
              "GlossDef": {
                            "para": "A meta-markup language, used to create markup languages such as DocBook.",
                "GlossSeeAlso": ["GML", "XML"]
              "GlossSee": "markup"
Here it is in SON:

    {"glossary":{"GlossDiv":{"GlossList":{"GlossEntry":{"Abbrev":"ISO 8879:1986","Acronym":"SGML","ID":"SGML","GlossDef":{"GlossSeeAlso":["GML","XML"],"para":"A meta-markup language, used to create markup languages such as DocBook."},"GlossSee":"markup","GlossTerm":"Standard Generalized Markup Language","SortAs":"SGML"}},"title":"S"},"title":"example glossary"}}
Here it is in an advanced S-expression representation:

    (glossary "example glossary"
              (div S
                   (entry SGML
                          (sort SGML)
                          (term "Standard Generalized Markup Language")
                          (acronym SGML)
                          (abbrev "ISO 8879:1986")
                          (def "A meta-markup language, used to create markup languages such as DocBook."
                               (see-also GML XML))
                          (see markup))))
And here it is in its canonical representation:

    (8:glossary16:example glossary(3:div1:S(5:entry4:SGML(4:sort4:SGML)(4:term36:Standard Generalized Markup Language)(7:acronym4:SGML)(6:abbrev13:ISO 8879:1986)(3:def72:A meta-markup language, used to create markup languages such as DocBook.(8:see-also3:GML3:XML))(3:see6:markup))))
What, you'd like something which is immune to 7-bit/8-bit or email mangling? Here's the same data in transport format:

All three of those S-expression formats can be losslessly converted to one another. Ordering is exactly as specified (they are lists, not unordered or ordered dicts — although one can understand them as dicts, if desired).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact