
Son: A minimal subset of JSON for machine-to-machine communication - ingve
https://github.com/seagreen/Son
======
niftich
The github for this project doesn't use this terminology, but the keyword here
is ' _canonicalization_ '. That's the process in which you transform a
document which can take any number of forms to a specific document that only
takes one form.

You do this to ensure that documents that contain the same "information" will
look identical on the wire, not just after-the-fact once you've processed them
into some execution-specific data structure.

For example, X.509 certificates use ASN.1 DER, a restricted subset of ASN.1
BER where each value takes a deterministic form, so that two certificates that
contain the same information will look identical when serialized into bytes.
This is a stricter application of ASN.1, while, say, when you talk to a
directory service over LDAP, you can speak BER, the looser encoding, because
the exact bytes by which you're making yourself understood don't really
matter.

Son is a canonicalized form of JSON, and the author maintains a page of JSON
subsets and supersets [1] that lists two other canonicalized encodings, both
of them containing silly flaws in their design. Son is a better effort.

[1] [https://housejeffries.com/page/7](https://housejeffries.com/page/7)

~~~
nly
A stricter _subset_ of JSON doesn't interest me without more specification
surrounding how numerics should be decoded.

How do I decode 18446744073709551615? What is the behaviour of a decoder that
can't handle numeric types of that size? Should it cause a parse error? Should
it truncate to 1.84467e+19? Should it overflow to 4294967295?

Presumably if you want a canonical textual representation, the parse error is
the only acceptable solution.

~~~
d33
> What is the behaviour of a decoder that can't handle numeric types of that
> size?

Is this actually relevant? It sounds like an implementation detail to me and
it seems to boil down to using arbitrary size bigint. If you can't handle a
bigint because it's too big for your machine, it sounds like a problem with
the system, not parser.

~~~
Spivak
We can play this game all day. What if I pass an integer that takes 32GiB of
memory to store? Also the system's fault if the resources aren't available?

~~~
d33
I'd say so. This is kind of a security question - the data is valid from the
protocol's perspective, it's the contents that are malicious. It's up to the
developer to decide how to handle that - if it's super important to be able to
process such big numbers, modify the parser to handle that (for example by
storing them on the HDD). Same goes for another serialization format:
bencoding. It supports arbitrarily nested data structures and most
implementations are recursive, which can easily result in stack overflow if
you mess up the implementation. But I wouldn't blame the standard - that's the
price you pay for flexibility. At the end of the day, you need to sanitize
your input anyway and keep your edge cases in mind. Same goes with XML
bombs...

~~~
rmrfrmrf
I would think the JS part of JSON would preclude any massive integers. It’s
still an IEEE floating point at the end of the day, so your max is (2^53)-1

------
maltalex
JSON loses its appeal once you're not targeting browsers and/or anything else
that uses Javascript. Sure, it's simple and familiar, but it has serious
drawbacks such as the time and CPU resources it takes for parsing and the lack
of schema.

For those of you interested, I suggest using Google's protobufs or even full
blown gRPC for such endeavors.

See discussion about gRPC here:
[https://news.ycombinator.com/item?id=12344995](https://news.ycombinator.com/item?id=12344995)

Protobufs: [https://developers.google.com/protocol-
buffers/](https://developers.google.com/protocol-buffers/)

~~~
perlgeek
Please correct me if I'm wrong, but I've always had the impression that
protobufs introduce more coupling than JSON.

Example: when you add a new element to the API response, the recipient needs
to update their schema.

Is that correct? If yes, how do you deal with that in a microservice
architecture where you have little control over consumers of your API?

~~~
weego
I don't understand the issue. The contract between producer and consumer
should never be fragile around additional values existing.

~~~
zbentley
> The contract between producer and consumer should never be fragile around
> additional values existing.

... it should never be fragile _where fragility might cost more than it gains
you_. The open/closed principle is the right thing to follow in most cases,
but not all.

For example, choking on unrecognized additional fields by default in
development is often a big time/bug-saver, since it gives you an early "hint"
that things might not be speaking the same dialect to each other.

Similarly, if you have a message convention that is heavy on "modifiers" which
control irreversible things, sometimes it's better to be fragile rather than
count on orchestration systems to work perfectly 100% of the time. Suppose you
have a system for buying bicycles with a required field of "bike model", and
modifiers of "color", "wheel size", etc. Now let's say your servers all got
patched to expect a new modifier of "frame size" . . . except for one server
which didn't restart due to a deployment bug, and is still serving traffic. If
a client sends that unexpected additional modifier to the "old" server, and it
gets ignored, this could result in sending a production order for an expensive
unit with the wrong specifications. If not caught, that cost could be
compounded by sending the wrong unit to the customer and pissing them off.
Now, this is clearly not a problem with the message format, but rather with
the orchestration/release system. But bugs in orchestration layers are
_incredibly common_ , and baking a bit of fragility into the messaging layer
at the right places can help mitigate those bugs' impact.

------
fanf2
This spec mostly concentrates on simplifying the syntax of numbers, but it
does not do anything about one of the big omissions in JSON: it does not
specify the range of integers or the precision of fractions. There are a lot
of interoperability problems in that omission.

(The other thing SON does is require unique keys, which is a good improvement
that fixes an interop bug that can cause security vulnerabilities.)

~~~
mehrdadn
> but it does not do anything about one of the big omissions in JSON: it does
> not specify the range of integers or the precision of fractions

Could you give an example of a storage format that _does_ do this and the
problems that doing so avoids?

~~~
majewsky
I'm not working with binary protocols that often, so I cannot point to an
existing protocol. However, if I had to implement a serialization format that
doesn't need to be human-readable, I would never use number literals. I would
write numbers as one byte whose value means e.g. "the next 4 bytes are an
uint32 in little-endian", followed by just these 4 bytes.

Then you also don't have to worry about whitespace, because there is none. The
next field can follow directly afterwards, because the type header byte
defines how long exactly the current field is. (For variable-length strings,
use something like netstrings.)

The only problem that I can see with this is that Son wants to ensure
deterministic encoding, but in my example, 0 could be encoded as an integer of
any size or signedness. You could add a rule like "encode every integer as the
smallest possible type", but that may have practical implications elsewhere.

I wouldn't be too worried about it, seeing how half-baked Son appears to be.
For example:

> Object members must be sorted by ascending lexicographic order of their
> keys.

Yeah right, because there is only one lexicographic order in the world.

~~~
pegasuscollins
> Yeah right, because there is only one lexicographic order in the world.

The specification tells you all strings (keys) must be UTF8. There _is_ only
one lexicographic ordering of UTF8 strings (RFC3629, page 2). This should
probably be explicitly mentioned in the linked specification though.

~~~
majewsky
> There is only one lexicographic ordering of UTF8 strings.

Then why is LC_COLLATE still a thing?

Also, you don't get to second-guess a spec. If it does not specify everything
that it needs to, it deserves the "half-baked" label.

------
matharmin
I see many comments here on the efficiency of binary formats versus JSON.
While it is certainly possible, especially with a well designed and
implemented protocol, just remember that you don't automatically get massive
gains by using a binary protocol: * JSON compresses well, and gzipped JSON is
usually smaller than a corresponding binary format without compression. *
Being such a common format, practically every language has a highly optimized
JSON parser. Just because a binary format can be faster to parse, doesn't mean
the implementation is necessarily faster.

As a specific example, I've had experience in the past where the most common
protobuf implementation for Ruby at the time was an order of magnitude slower
than the default JSON parser (it has probably improved by now).

So by all means use a binary format for (internal) machine-to-machine
communication if the performance gains are significant, but don't just assume
that JSON will be too inefficient.

~~~
alfanick
"Just because a binary format can be faster to parse, doesn't mean the
implementation is necessarily faster." \- except "JSON compresses well, and
gzipped JSON is usually smaller than a corresponding binary format without
compression", gzip takes some cycles...

------
js8
I wish people wouldn't use JSON for M2M communication and used e.g. CBOR
instead. It's such a waste of resources.

In the 70's, using text formats for protocols was understandable (7-bit serial
lines..). Today, it is not.

~~~
emodendroket
I think people like the ease of human-readable formats for debugging.

~~~
IshKebab
Just decode the data before logging it...

I mean, I can't think of many situations where I'm debugging something by
looking at raw bytes when I couldn't easily just dump the decoded
representation instead.

People don't use BCD for ease of debugging do they?

~~~
emodendroket
I mean, maybe it's worth doing anyway, but it's an extra bit of friction. Plus
you probably can avoid issues where the binary doesn't decode properly because
of version changes, different environment, etc.

~~~
foobarian
In Javaland BSON for production messages performs close to JDK serialization,
and for debugging it's not hard to override the content types at runtime (or
log the messages in readable form where they are getting serialized).

------
algesten
I remember reading this a while back. I think the SON specification could
tighten up a few more things, like number ranges and further string escaping
problems.

[http://seriot.ch/parsing_json.php](http://seriot.ch/parsing_json.php)

~~~
seagreen
Limiting the size of numbers is intentionally left out of Son right now,
because I'm not sure where that limit should be. My hope is that we eventually
have a good schema language for JSON (something like JSON Schema, though I'm
not sure about that one in particular) and then people can set number ranges
in their schemas based on their particular use case.

There shouldn't be any remaining string escaping problems though! If you find
one definitely open a GitHub issue.

------
mlawren
Somewhat off-topic, but I had a need recently for a simple serialization
format that satisfied the following:

    
    
        * Support for undef/null
        * Support for binary data
        * Support for UTF8 strings
        * Universally-recognized canonical form for hashing
        * Trivial to construct on the fly from SQLite triggers
    

Of all of the formats I looked at nothing matched. So I created Bifcode
[[https://metacpan.org/pod/Bifcode](https://metacpan.org/pod/Bifcode)]. It is
a bit of a mix between Netstrings, Bencode and JSON. It is not really human
readable, but it is extremely simple and robust, easy to generate and parse,
and therefore relatively secure.

I'm not promoting it for any particular use case, and I only have a Perl
implementation. Posting here as I think some of the audience in this thread
may find it useful.

------
joelthelion
If I could change JSON, I would add comments to it, not make it simpler.

~~~
IshKebab
HJSON.

~~~
joelthelion
Thanks, that looks useful.

------
tzahola
Daily reminder that there _must_ be a sane timeline where telcos didn't
proliferate ASN.1 to death, and machine-to-machine communication was solved
decades ago.

------
kbumsik
No exponential notation?? I don’t see why this change makes it suitable for
machine-to-machine communication. This looks like rather for the human
readability, not for the machine. It surely removes some branches but I don’t
think it will much improve decoding performance. At least this has to provide
convincing reason to use it over JSON e.g. performance comparison.

~~~
seagreen
I opened an issue for this here:
[https://github.com/seagreen/Son/issues/10](https://github.com/seagreen/Son/issues/10)

I'm curious about how often enough trailing zeros occur in real world
documents for lack of exponential notation to have a noticeable impact on
document size.

------
grogenaut
Man I want the opposite. If it is machine to machine I can use a machine to
decide it. For Jon I want comments, trailing commas, and a few other things I
can't think of at 9am Saturday

~~~
seagreen
You're looking for supersets of JSON, see the last section here:
[https://housejeffries.com/page/7](https://housejeffries.com/page/7). I'm
partial to JSON5.

------
c54
I love the parser/state machine diagrams for all the matchers, does anyone
(OP?) know what tool was used to generate those?

~~~
alfanick
Please do read: "# Notes

\+ The diagrams were generated with
[GrammKit]([https://github.com/dundalek/GrammKit](https://github.com/dundalek/GrammKit))."

------
iainmerrick
First bullet point:

\- No exponential notation

That's going to give you some pretty big numbers. It would be better to
_require_ exponential notation.

~~~
seagreen
[https://github.com/seagreen/Son/issues/10](https://github.com/seagreen/Son/issues/10)

------
k__
How about JSAN?

With pre-defined record structure you could simply use arrays instead of
objects and save much space.

------
Deid2AMa
Why not just use MsgPack or CBOR?

------
alfanick
CBOR, FlatBuffers, ProtoBuf, MsgPack, BSON - so many better choices.

Please _do not_ make JSON a standard for M2M communication, it's as bad as
calling Electron apps "native".

~~~
megous
It's already a standard for M2M communication, unless you see someone writing
JSON by hand when doing anything on the web.

