
Son – A minimal subset of JSON for machine-to-machine communication - seagreen
https://github.com/seagreen/Son
======
seagreen
Author here. I originally started this project because I wanted a consistent
way to serialize JSON so that the serialized bytes would hash the same way
every time.

As I worked on it though I realized it might be of general interested to
people. Thus the example in the README of piping JSON through multiple tools
without generating trivial changes that mess up diffs.

Most of the decisions I made were clear: no insignificant whitespace, object
keys must be ordered, etc.

There are two things I'm still not sure about:

\+ Son doesn't provide escape sequences for any Unicode character that JSON
allows to be written unescaped. This includes U+007f (ASCII "delete"). Will
that cause a problem for many programs? All the other ASCII control characters
are required to be escaped by JSON, U+007f is the only one left out.

\+ Son doesn't allow trailing zeros in fractions. This means you can't
serialize `1.0`, you have to serialize it as `1`.

I was confident in the decision to take out scientific notation (it would be
cool if JSON parsers actually treated numbers as being in scientific notation
and tracked significant digits, but they don't so I feel like that ship has
sailed). Trailing zeros are different though because some JSON generators do
use them to distinguish integers from fractions. The problem is that many
parsers don't care about them, so you end up in a situation where parsers are
tossing out information about documents, meaning they can't serialize them
faithfully again which is the whole point of Son.

~~~
throwawaysed
instead of messing with JSON and making it less human friendly (arguably the
thing that made JSON so popular in the first place), why not just use
something designed for efficient machine-to-machine transfer?

Basically making JSON more machine friendly and stricter to parse undermines
the original reason to use JSON. If efficiency and absolute correctness are
important use something not designed to be forgiving to humans :)

~~~
snakeanus
>making it less human friendly (arguably the thing that made JSON so popular
in the first place)

I would not call a JSON human friendly due to its lack for comments.

~~~
seagreen
You might be interested in JSON5 ([http://json5.org/](http://json5.org/))
which goes the opposite direction of Son and makes JSON more human-writeable.
It's the best, cleanest "JSON+" I've found so far.

------
dsp1234
_" Object members must be sorted by ascending lexicographic order of their
keys."_

How should the following be serialized?

{"öp":1, "op":0}

~~~
amock
Couldn't you solve that by sorting by the underlying byte order? Since it is
required to be UTF-8 that should be fast and unambiguous.

~~~
dsp1234
You could, and that's partially what prompted the question, but that's not
quite what the specification says currently.

edit:

 _and unambiguous._

var x = "\u0307\u0323\u0073"

var y = "\u0073\u0323\u0307"

These are arguably the same grapheme when displayed on a screen, but they
would be considered different keys in SON, and sorted in different spots. I'm
actually unsure if most diffing systems would catch that they are different or
not, and if so, how they would represent that difference to the end user
(since just showing the grapheme would be useless as they are the same).

The specification is silent on normalization, so a SON pipe could output an
object with those as keys (and still meet spec), but not be accepted by a
conforming SON parser (that did unicode normalization on incoming keys). So
really, the specification should say which of the canonical normalized forms
to use, or that keys should specifically may not be normalized/modified in any
way. This works fine if you're running something like JavaScript -> JavaScript
-> JavaScript, but would fail if a (currently conforming) system in the middle
defaults to normalized unicode representations. In other words, would a
pipeline like Firefox -> Swift -> Python -> Swift -> Nodejs -> Swift ->
Firefox, even if they had conforming SON implementations (of the current
specification)?

~~~
seagreen
> not be accepted by a conforming SON parser (that did unicode normalization
> on incoming keys)

I don't think RFC 7159 actually mentions Unicode normalization at all. If it
was in there I think it would be mentioned in the section on string
comparisons, but that just talks about going unescaped codepoint by unescaped
codepoint:
[https://tools.ietf.org/html/rfc7159#section-8.3](https://tools.ietf.org/html/rfc7159#section-8.3)

EDIT: Does Unicode normalization change over time? If so we definitely have to
leave it out and just go codepoint by codepoint, because we don't want the
validity of a Son document to change.

------
nerdponx
Cute name, but 1) it's totally un-Googleable, and 2) it doesn't mean anything.

~~~
seagreen
I wanted to name it something like "mson" for "minimal-JSON", but that's
already taken.

EDIT: Not meaning anything is a plus! But I see your point about being un-
googleable.

~~~
pacaro
ASN.1 uses the name DER Distinguished Encoding Rules for this purpose.

There are some other edgecases that you aren't considering (or that I missed)

Min and Max integer values - JavaScript has some pretty tight limits here

Keys should be in lexicographic order — you might want to be more specific. Is
there a permitted subset of Unicode? Which normalization rule must be used?

~~~
seagreen
There's an issue for min and max number values here:
[https://github.com/seagreen/Son/issues/3](https://github.com/seagreen/Son/issues/3)

Son seems like a natural place to put restrictions on them, but I'm not sure
if there's way to do it that (A) still makes for a clean spec and (B) still
allows everything floats can encode to to be used.

~~~
pacaro
On an unrelated note. If human readability isn't a primary concern, I'd
recommend requiring that < > & all be escaped, this prevents some crappy
attacks

Also, no top level lists!

~~~
seagreen
Why no top level lists?

~~~
pacaro
Ha! I guess I'm showing my age. There used to be an associated vulnerability,
which is why some frameworks won't let you do it. It appears that the
vulnerability was fixed in what are now fairly old versions of browsers.

More info at [http://stackoverflow.com/questions/16289894/is-json-
hijackin...](http://stackoverflow.com/questions/16289894/is-json-hijacking-
still-an-issue-in-modern-browsers)

------
pwdisswordfish
Umm... why? The only rationale for this project I can see is this:

> Piping JSON through multiple programs creates lots of trivial changes, which
> makes it hard to do things like take meaningful diffs.

But you can always put JSON through a pretty-printer which puts values into
canonical form before diffing. I wouldn't bother turning it into a formal
specification. And you mention 'No insignificant whitespace' in the README, so
it's not like your format makes line-by-line diffs any clearer.

It's as if IT hasn't had enough solutions in search of a problem...

~~~
seagreen
> I wouldn't bother turning it into a formal specification.

Too late!

> But you can always put JSON through a pretty-printer which puts values into
> canonical form before diffing.

Son is a starting point for building such a pretty printer. It takes care of
messy details like eliminating the redundancies in string and number encoding,
so all you have to do to specify the pretty-printer format is say where you
want your newlines and how much to indent by.

~~~
kornish
> Son is a starting point for building such a pretty printer.

I don't understand: such pretty printers already exist (e.g. jq, which does a
whole lot more [0]). If you're transmitting JSON and want to diff two
documents, just pipe them into jq or another pretty-printer with key ordering,
then use one of many existing line-by-line diff tools.

[0]: github.com/stedolan/jq

~~~
hyperpallium

      $ echo "1." | jq .
      1
    

jq is great, but doesn't preserve float/int distinction. There may be other
canonicalization details that pretty printers don't address.

It might be simpler to change jq and/or other pretty printers (though might be
implementation or design reasons not to).

But the other advantage of a new project like SON is positioning not
engineering, in our heads, not in the code: tool for the job. If so, the page
should spend more time on what that job is (for when people search for that
job).

And, as the jq example shows, tools not designed for this job mightn't do it
exactly right.

EDIT I was reading offline, so didn't see the sibling reply.

~~~
seagreen
I love that we both started our comments with "jq is great". I hadn't seen
yours either=)

------
nly
I've implemented similarly constrained string representations for numerics for
a config file format, but I ended up allowing both "1" and "1.0" so I could
distinguish integer and floating point literals ("1.00" was still forbidden,
as was "0.0"... zero was stored in a specially tagged way).

The exponent syntax in JSON is fun because it allows numerous representations
of 0 and 1 using e.g. 0e1 and 7e0

~~~
seagreen
> I ended up allowing both "1" and "1.0" so I could distinguish integer and
> floating point literals

Definitely the toughest call when making a JSON subset like this.

------
slowmovintarget
You _could_ use JSON as a host: [https://github.com/cognitect/transit-
format](https://github.com/cognitect/transit-format)

I do like the comment about having the S-expression format for data
interchange defined since 1997.

------
bgamari
Isn't this just a reinvention of Canonical JSON [1]? There were efforts [2] to
standardize it, although it seems this fizzled out.

[1]
[http://wiki.laptop.org/go/Canonical_JSON](http://wiki.laptop.org/go/Canonical_JSON)

[2] [https://datatracker.ietf.org/doc/draft-staykov-hu-json-
canon...](https://datatracker.ietf.org/doc/draft-staykov-hu-json-canonical-
form/)

~~~
seagreen
Thanks for brining Canonical JSON up. There's an issue to investigate it here:
[https://github.com/seagreen/Son/issues/7](https://github.com/seagreen/Son/issues/7)

It doesn't appear that Canonical JSON does anything about redundant escape
sequences in JSON. I'm still looking into it to be sure. This is a big part of
the motivation behind the Son spec, and represents about half the EBNF:
[https://github.com/seagreen/Son/blob/master/son.ebnf#L19](https://github.com/seagreen/Son/blob/master/son.ebnf#L19)

For anyone else who knows of more JSON subsets, if you report them here:
[https://housejeffries.com/page/7](https://housejeffries.com/page/7) it would
be really appreciated. I definitely don't want to do duplicate work.

------
dsp1234
_" Object keys must be unique."_"

How should the following be serialized (if at all)?

{"a":0,"\u0073\u0323\u0307":1,"\u1E69":2,"Z":3}

Note that I'm using the \u forms even though they aren't allowed via the
specification because, HN doesn't show unicode combining properly sometimes,
and so that's it's clear to see that it's a different byte pattern.

~~~
seagreen
The plan is to go codepoint by codepoint in ascending order:

`{"Z":3,"a":0,"\u0073\u0323\u0307":1,"\u1E69":2}`

See this issue for the discussion:
[https://github.com/seagreen/Son/issues/1](https://github.com/seagreen/Son/issues/1)

My current understanding is that RFC 7159 doesn't require Unicode
normalization to be performed, if it does we're in trouble:
[https://tools.ietf.org/html/rfc7159](https://tools.ietf.org/html/rfc7159)

------
mixedCase
Why not use something like Protobufs and Cap'n'Proto instead of yet another
serialization format?

~~~
kevin_thibedeau
They require a schema. A schemaless binary serialization like CBOR or BSON is
more amenable to JSON interchange.

~~~
k__
How about [http://msgpack.org/](http://msgpack.org/)?

~~~
seagreen
Son's a niche tool to allow talking to JSON services in a more consistent way.
It's not meant as an competitor to any other data format.

EDIT: s/alternative/competitor

------
zeveb
> Piping JSON through multiple programs creates lots of trivial changes, which
> makes it hard to do things like take meaningful diffs.

That's the virtue of having a format which offers a canonical representation
for data.

> No insignificant whitespace.

I.e., it's not human-readable. It also has no decent way to exchange binary
data (there is no byte-sequence type: one must either use Base64 or an array
of integers, neither of which is space-efficient).

It'd be nice to have a format which is _both_ human-readable _and_ has a
canonical representation. Fortunately, such a thing already exists (and I've
even linked to it once already today), and has since _1997_ :
[http://people.csail.mit.edu/rivest/Sexp.txt](http://people.csail.mit.edu/rivest/Sexp.txt)

Here's a JSON example (from
[http://json.org/example.html](http://json.org/example.html)):

    
    
        {
            "glossary": {
                "title": "example glossary",
            "GlossDiv": {
                    "title": "S",
              "GlossList": {
                        "GlossEntry": {
                            "ID": "SGML",
                  "SortAs": "SGML",
                  "GlossTerm": "Standard Generalized Markup Language",
                  "Acronym": "SGML",
                  "Abbrev": "ISO 8879:1986",
                  "GlossDef": {
                                "para": "A meta-markup language, used to create markup languages such as DocBook.",
                    "GlossSeeAlso": ["GML", "XML"]
                            },
                  "GlossSee": "markup"
                        }
                    }
                }
            }
        }
    

Here it is in SON:

    
    
        {"glossary":{"GlossDiv":{"GlossList":{"GlossEntry":{"Abbrev":"ISO 8879:1986","Acronym":"SGML","ID":"SGML","GlossDef":{"GlossSeeAlso":["GML","XML"],"para":"A meta-markup language, used to create markup languages such as DocBook."},"GlossSee":"markup","GlossTerm":"Standard Generalized Markup Language","SortAs":"SGML"}},"title":"S"},"title":"example glossary"}}
    

Here it is in an advanced S-expression representation:

    
    
        (glossary "example glossary"
                  (div S
                       (entry SGML
                              (sort SGML)
                              (term "Standard Generalized Markup Language")
                              (acronym SGML)
                              (abbrev "ISO 8879:1986")
                              (def "A meta-markup language, used to create markup languages such as DocBook."
                                   (see-also GML XML))
                              (see markup))))
    

And here it is in its canonical representation:

    
    
        (8:glossary16:example glossary(3:div1:S(5:entry4:SGML(4:sort4:SGML)(4:term36:Standard Generalized Markup Language)(7:acronym4:SGML)(6:abbrev13:ISO 8879:1986)(3:def72:A meta-markup language, used to create markup languages such as DocBook.(8:see-also3:GML3:XML))(3:see6:markup))))
    

What, you'd like something which is immune to 7-bit/8-bit or email mangling?
Here's the same data in transport format:

    
    
        {KDg6Z2xvc3NhcnkxNjpleGFtcGxlIGdsb3NzYXJ5KDM6ZGl2MTpTKDU6ZW50cnk0OlNHTUwoNDpz
        b3J0NDpTR01MKSg0OnRlcm0zNjpTdGFuZGFyZCBHZW5lcmFsaXplZCBNYXJrdXAgTGFuZ3VhZ2Up
        KDc6YWNyb255bTQ6U0dNTCkoNjphYmJyZXYxMzpJU08gODg3OToxOTg2KSgzOmRlZjcyOkEgbWV0
        YS1tYXJrdXAgbGFuZ3VhZ2UsIHVzZWQgdG8gY3JlYXRlIG1hcmt1cCBsYW5ndWFnZXMgc3VjaCBh
        cyBEb2NCb29rLig4OnNlZS1hbHNvMzpHTUwzOlhNTCkpKDM6c2VlNjptYXJrdXApKSkp}
    

All three of those S-expression formats can be _losslessly_ converted to one
another. Ordering is exactly as specified (they are lists, not unordered or
ordered dicts — although one can understand them as dicts, if desired).

