
Show HN: ION, a JSON alternative – Versatile, compact, fast, binary data format - VStack
https://github.com/jjenkov/iap-tools-java
======
efaref
Or you could use an actual RFC standard[1] that has numerous high-quality
implementations[2].

[1] [https://tools.ietf.org/html/rfc7049](https://tools.ietf.org/html/rfc7049)
[2] [http://cbor.io/](http://cbor.io/)

~~~
VStack
We have already look at CBOR and MessagePack. They miss the ION tables for
compact arrays of data. Here is a link to our comparison to other data
formats. [http://tutorials.jenkov.com/iap/ion-vs-other-
formats.html](http://tutorials.jenkov.com/iap/ion-vs-other-formats.html)

~~~
efaref
I don't really see how ION tables are an improvement over arrays of arrays,
e.g.:

    
    
        {
          "headers": [ "a", "b", "c" ],
          "rows": [
            [ 1, 2, 3 ],
            [ 4, 5, 6 ],
          ]
        }
    

Furthermore, ION appears to require you to know the length of your data up
front, whereas you could use CBOR unspecified length arrays to stream data
from your database without precalculating the table length. It seems like
quite a niche format, though. Most data is not truly tabular.

Also, in the table it claims "Yes" under support for "Cyclic references", and
yet further down the page:

> ION has support for expressing cyclic references between objects. At this
> point this support is not 100% finalized.

So surely this should be "Yes(*)"?

~~~
VStack
First of all, lots of results sent back from backend services (or databases)
are arrays of objects. So no, tables are not a "niche" format - tables are
heavily used.

Second, an array of arrays could mean anything. You have not semantics telling
whether the arrays are independent or if the first array is an array of
columns for the following arrays. ION Tables add that semantic information.

Third, yes, we could encode everything as text or as raw bytes and leave it up
to the user to make sense of it. But that is exactly what we are trying to
avoid with ION. We want to give devs a decent standard data format to use,
that doesn't require a lot of data encoding choices up front. The encoding
options have been thought through already, and sensible choices already made
which you can just follow.

Fourth, you can nest ION tables inside ION tables, and thus create a more
compact representation of an object graph. Using JSON / CBOR you would need a
lot of nested arrays inside arrays to emulate that. Possible, but not exactly
pretty.

~~~
VStack
Yes also, the TLV nature of ION means that you have to buffer the ION data
while writing it, in case you don't know the full size of the embedded ahead
of time. However, this is really only a problem for very big messages. HTTP
has worked like that for a long time already, and HTTP servers have proven
capable of scaling pretty well, won't you agree?

Additionally, if a message does not include its length up front, you are just
trading faster write time for slower read time. A node receiving dynamically
sized data then has the problem of knowing how much data to allocate for the
full message, plus the receiver has to inspect the data as it comes in to see
where it ends. With an ION message the receiver knows within the first 4-5
bytes (typically) how big that message will be, and can thus copy the
following bytes directly into the perfectly allocated memory area, without
having to examine them any further. Since data is most often read more times
than written (e.g. data written to a file), we felt that making a tradeoff
that favors read speed over write speed made sense.

------
woah
On one of these pages, the claim is made that Protobufs is not self-
describing, and therefore cannot be used for "network applications". It seems
that "self-describing" here means that the format includes key names, instead
of compressing them by using numbers like protobufs does. I can't understand
why having field names is going to make a difference for anyone. Once you are
setting up a system to deal with a specific format of data, why not just
include a protobufs schema?

~~~
ctz
Here's a use case where protobuf is terrible because it isn't self-describing:
write a wireshark plugin which parses and pretty-prints protobuf messages for
human consumption.

You can't, because such a plugin would have to have a-priori knowledge of the
schema in use.

~~~
VikingCoder
You mean something like this, which does exactly that?

[https://code.google.com/archive/p/protobuf-
wireshark/](https://code.google.com/archive/p/protobuf-wireshark/)

Sure, you don't know the NAME of the field, but you can see the ID of it.

------
dplgk
Binary format makes me believe it's not human-readable. How doesn't this
compare in size to gzipped JSON? JSON overhead is fairly small (some quotes,
colons, brackets and keys) - it's no XML.

~~~
VStack
Yes, binary formats are not easily readable in a text editor. But, it is
actually possible to convert ION to an XML format and back again without loss
of information (we have not implemented this yet). This should make it easier
to read messages during debug - especially because you don't need to know the
schema for the given message to conver it to XML.

Regarding GZipped JSON, it is true that GZiped JSON is small. But, due to the
CRIME and BREACH attacks it is not recommended to compress data sent over
encrypted connections (TLS).

If you look at our performance benchmarks page you can see a list of
serialized length comparisons. As you can see, as soon as you send a few
objects in an ION table, the difference is big. More than what you normally
can gain with GZip (except perhaps for
String).[http://tutorials.jenkov.com/iap/ion-performance-
benchmarks.h...](http://tutorials.jenkov.com/iap/ion-performance-
benchmarks.html#serialized-length)

Furthermore, GZip only helps with transfer time, and actually slows down
parsing time. If you look at our performance benchmarks you will see that ION
parsing time is a lot faster than JSON. Additionally, if you really, really
want high speed you do not parse ION (or JSON) into Java objects. You process
the data directly in its binary form. If you look at our read-and-use
benchmark you can see just how big a speed difference that gives. ION is
designed for being processed directly. JSON isn't as good for that purpose.

Finally, ION is designed for fast arbitrary hierarchical navigation. JSON is
not.

------
umanwizard
Note for anyone who was as confused as I was: this is not the same thing as
Amazon's internal typed JSON format, which is also called Ion.

~~~
rix0r
Well, why would it be? That's internal.

Hasn't been published as far as I can Google.

~~~
umanwizard
I don't work at Amazon anymore so I thought maybe they had open-sourced it

~~~
jjenkov
Yes, a guy told us that Amazon has an internal data format called ION. We
googled for it, but didn't find it, so we assumed Amazon wants to keep it
internal.

~~~
umanwizard
Not that you are under any obligation to care, but this will now be very
annoying for Amazon employees.

"Oh, you need to call that service using Ion, not Json. No, the other Ion..."

~~~
pinkunicorn
Not to mention if they try to use it internally, then might end up with
package name conflicts..

------
klodolph
> As you can see, an ION field can contain values that are up to 2^120 bytes
> long. If you need to encode larger blocks of data than that, you would need
> to break it up into multiple fields.

Har, har.

------
kentonv
I'm looking at: [http://tutorials.jenkov.com/iap/ion-vs-other-
formats.html](http://tutorials.jenkov.com/iap/ion-vs-other-formats.html)

As the author of Protobuf v2 (the version that was open sourced by Google), I
object to some of the "no"s in the protobuf column.

(Note: I no longer work on Protobuf, and I did not invent the format. I do
work on and did invent Cap'n Proto.)

> Protobuf apparently isn't great at encoding raw bytes either (according to
> their own website).

Protobuf can handle raw bytes just fine, using the "bytes" type. There is no
special encoding done on bytes; parsing and encoding is done by memcpy(). I'm
curious to know what part of the web site you interpret as saying otherwise.
It's entirely possible that the web site contains confusing language, but a
citation would have been a good idea here.

> Schema / Class Id > Self describing

The Protobuf libraries have extensive support for manipulating dynamic schemas
and transmitting schemas over the wire. See the "Descriptor" and
"DynamicMessage" APIs. This is mentioned on the web site:

[https://developers.google.com/protocol-
buffers/docs/techniqu...](https://developers.google.com/protocol-
buffers/docs/techniques#self-description)

> Even if these compact objects do not contain any property names, they are
> still self describing enough that you can see where fields start and end,
> plus their data type, without an external schema. You cannot do that with
> Protobuf (as far as we know).

You absolutely can do that with Protobuf. This is what the "protoc
--decode_raw" flag does, and it should be clear enough from reading the
encoding.

[https://developers.google.com/protocol-
buffers/docs/encoding](https://developers.google.com/protocol-
buffers/docs/encoding)

> Cyclic references

While it's true that Protobuf doesn't support these, I hope you've considered
the denial-of-service vulnerabilities they tend to create if the receiver is
not expecting them. Please ensure that cyclic references are only allowed in
cases where the app opted into it.

Relatedly, overlapping references / backreferences ("Copy" in your table)
potentially leads to an amplification attack where a small message on the wire
turns out to be much, much larger when traversed. If applications cannot
defend themselves from huge payloads by setting a message size limit, then
you'll need to give them some other way.

> All of the formats (except perhaps Protobuf) supports arbitrary hierarchical
> navigation of the encoded data, without first converting it to objects.

Protobuf supports this, and in fact should be an unqualified "Yes" rather than
"Yes(*)" like the others. Protobuf encoding is very similar to ION's. Sub-
messages are length-delimited, which seems to be exactly the advantage you're
claiming that ION has.

Note that none of these formats support random access in the way that Cap'n
Proto does.

In summary, I believe Protobuf deserves a "yes" in: "Raw bytes", "Good at raw
bytes", "Schema / Class Id", "Arbitrary hierarchical navigation", and "Self
describing".

~~~
VStack
If that is really true, then we will of course update the comparison page.
However, we have put it together from what we were able to find in Google
Protocol Buffer's own docs + stack overflow + googling. It is entirely
possible that we made mistakes.

Sending schema over the wire is not a good solution for anything else than
point-to-point communication. An intermediate node would need every single
schema transmitted along with every single messsage, or have another way to
keep the schemas cached. That becomes complicated.

The Protobuf documentation says very clearly that you cannot see when one
message ends and another begins. Then a protobuf message is not fully self
describing. This might be easy to add, but it doesn't have it (according to
Protobuf's own docs).

We have looked at Cap'n Proto - but late in the process where we had already
looked at quite a lot of formats. From what I can see, Cap'n Proto is pretty
much just a binary struct. That is pretty close to what we wanted to do with
ION, except we wanted it to be compact on the wire too. We have seen that
Cap'n Proto has a compaction mechanism, but we have not yet had time to
analyze and compare it to ION's.Cap'n Proto with compaction would be very
similar to ION - on a conceptual level.

However, we need to make space for some IAP specific fields coming later in
the process (like cache references, column stores and more). Stuff that is IAP
specific. That is why we chose to roll with our own encoding in the first
place.

~~~
kentonv
> Sending schema over the wire is not a good solution for anything else than
> point-to-point communication.

OK, let's back up a moment. I am not entirely sure what "Schema / Class Id" in
your table means. Your table claims Protobuf doesn't support it, but the text
below is unclear on what you think Protobuf doesn't support.

You frequently use the term "self-describing", but this could have two
meanings:

1) Like JSON, where the names of all fields appear in the message, so that a
human can read the message easily without external information.

2) Limited self-description in which field values can be identified and
parsed, but their names are not available (perhaps replaced by numeric tags or
indexes).

Protobuf can support (1) by including the schema in the payload. I agree this
is not commonly useful.

Protobuf supports (2) natively, by virtue of being a TLV format (just like
ION).

Re-reading the page, it sounds like you are assuming the Protobuf format
cannot be deciphered at all without the schema, but this simply isn't true.

If you meant something else, please explain.

> The Protobuf documentation says very clearly that you cannot see when one
> message ends and another begins.

I wrote that documentation. It doesn't mean what you think (my fault,
perhaps). What it's saying is that the top-level message is a series of tag-
value pairs with no explicit indication of where that series ends (on the
assumption that you already know, e.g. based on EOF). Thus, if you concatenate
two whole messages without adding any delimiter then it will look like one big
message containing all the fields from both. However, each field within the
message _is_ clearly delimited and sub-messages are length-delimited therefore
skipable.

> We have looked at Cap'n Proto - but late in the process where we had already
> looked at quite a lot of formats. From what I can see, Cap'n Proto is pretty
> much just a binary struct. That is pretty close to what we wanted to do with
> ION, except we wanted it to be compact on the wire too. We have seen that
> Cap'n Proto has a compaction mechanism, but we have not yet had time to
> analyze and compare it to ION's.Cap'n Proto with compaction would be very
> similar to ION - on a conceptual level.

ION is a TLV encoding like Protobuf. Cap'n Proto is fixed offsets + pointers.
These are vastly different styles of encoding that enable different modes of
use. You can certainly debate which is better but I don't think it's correct
to describe the formats as "pretty close", unless you consider all binary
formats to be "pretty close" to each other.

~~~
jjenkov
You are right, the term "self describing" as used in our docs could be more
clear. Being self describing means that you do not need a schema to make sense
of a stream of data of that format.

However, there is also a degree to which a data format can be self describing.
A CSV file is reasonably self describing because you can see where one field
ends and the next begins (at the comma / separator), and where one record ends
and the next begins (new line). With a header line of column names a CSV file
becomes more self describing, as you now also have a name indicating the
semantic meaning of fields in that column. If a CSV file could somehow contain
a specification of the data type of each column, it would be even more self
describing etc.

This is what we are trying to achieve with ION. If you need speed, you can
omit most of the meta data like property names etc. If you need messages to be
self describing, you can add a lot of meta data (like class / schema names +
version, property names etc.).

I apologize for having written incorrect documentation. If you wrote those
docs for Google Protocol Buffers, part of that is on you. They are not exactly
crystal clear ;-) (our doc's aren't either - still working on them!)

Thank you for clearing up that Protobuf fields can be distinguished in a
stream of Protobuf fields, even without schema. That was unclear to me before
now. By the way, that is pretty clear in Cap'n Proto - your invention right?
So - better docs already!

And - thank you for clearing up the difference in the encoding of Cap'n Proto.
Any link to where I can read about that encoding style in more details?

~~~
kentonv
> Any link to where I can read about that encoding style in more details?

Hmm, I'm not aware of any literature other than what's on the Cap'n Proto web
site. You can of course find the Cap'n Proto encoding documented here:

[https://capnproto.org/encoding.html](https://capnproto.org/encoding.html)

The format is, of course, a lot like how in-memory data structures are laid
out in C (fields of a struct have fixed offsets; variable-size fields are
behind pointers). Unlike native pointers, though, Cap'n Proto's pointers are
designed to be relocatable and easy to bounds-check, and they contain just
enough type information for the message to be minimally self-describing (so
that you can e.g. make a copy of a particular sub-object without knowing its
schema).

------
leeoniya
i have found that most of my huge json data is from uniform recordsets.
there's a great json-compatible encoder for such cases that stores them in a
format that's CSV-esque:

[https://github.com/WebReflection/JSONH](https://github.com/WebReflection/JSONH)

~~~
VStack
That might work for a web app, but not for mobile apps, or backend to backend
communication.

~~~
leeoniya
anything that used json before would benefit from JSONH. things that abused
json before, would be better off switching to a different format that is less
restrictive than json, like ION or msgpack [1]

[1] [http://msgpack.org/index.html](http://msgpack.org/index.html)

------
kevinSuttle
Is there a spec? Is this a Java API for it?

See also: [https://github.com/edn-format/edn](https://github.com/edn-
format/edn)

~~~
VStack
Yes please check:

Specification:
[http://tutorials.jenkov.com/iap/index.html](http://tutorials.jenkov.com/iap/index.html)

Benchmarks: [http://tutorials.jenkov.com/iap/ion-performance-
benchmarks.h...](http://tutorials.jenkov.com/iap/ion-performance-
benchmarks.html)

Tutorials: [http://tutorials.jenkov.com/iap-tools-
java/index.html](http://tutorials.jenkov.com/iap-tools-java/index.html)

Here is also a recent Infoq.com article [http://www.infoq.com/articles/IAP-
Fast-HTTP-Alternative?utm_...](http://www.infoq.com/articles/IAP-Fast-HTTP-
Alternative?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global).

~~~
profeta
the "message structure" page is 404ing. all the rest hints to a designed-by-
committee buzzword thingy.

~~~
jjenkov
Here is the Message Structure doc [http://tutorials.jenkov.com/iap/iap-
message-structure.html](http://tutorials.jenkov.com/iap/iap-message-
structure.html) . We are not as far with the core IAP protocol as we are with
ION, but we will get there.

Designed-by-committee - really? You pull that?

------
dmitrygr
HTTP ERROR 404

Problem accessing /iap/message-structure.html. Reason:

    
    
        Page not found

~~~
mike_hock
Pfff 404, that's so HTTP. Why aren't you using IAP?

~~~
jjenkov
That is coming... but browsers aren't supporting IAP (yet)!

------
blakecallens
[https://media.giphy.com/media/1M9fmo1WAFVK0/giphy.gif](https://media.giphy.com/media/1M9fmo1WAFVK0/giphy.gif)

------
shockzzz
Is there an ION vs. Thrift comparison?

~~~
VStack
Not yet. We have been asked to compare ION to Flatbuffers, Cap'n Proto,
Thrift, Avro, Transit, BSON and several other encodings. However, writing the
benchmarks and going through the features systematically is a lot of work, so
we have not yet had the time to go through them all.

------
Matthias247
I think the most interesting difference to the usual serialization formats
might be the copy and reference types. I'm a little bit undecided whether they
might be a brilliant idea or not. The decision whether to support copy or not
puts some extra effort in the serializer and deserializer, but the total
result is the same that you can have as without a copy field mechanism. The
support of cyclic references makes a big change, because you can't directly
model them with technologies. You might also have trouble using these data
structures in some programming languages or libraries (e.g. if you are only
using immutable types or want to use only value types). However for some kind
of data it seems to make sense to support cyclic data, as GraphQL and Falcor
have also added support for that.

I also don't see that many use cases for the table structure. I have deployed
thousands of RPC APIs into production, and I can't recall having the need for
it. And even if you need it, using an object with 2 arrays in it would be just
fine.

I also looked through the IAP documentation (btw. bad name => ipod accessory
protocol) because it's quite related to what I'm working on. I think that the
shown basic communication patterns are correct, but from the documentation I
can't really get a feeling what I could expect from an IAP library. Would it
be some low level messaging system (like MQTT, ZeroMQ, etc.) or would higher
level communication patterns (request/response, notifications) also be built
in. There are no predifined message formats for RPC listed in the
documentation which would outline that. The WAMP specification
([http://wamp.ws](http://wamp.ws)) e.g. makes it clearer what I could expect
from such a protocol. I'm not sure whether we need a new low level messaging
protocol or if the work should be more focused on adding higher level
semantics on top if it.

E.g. I think some pattern that I really need in my domain is remote object
synchronization, which means the status of an object on the server gets
automatically pushed towards all interested client and is continously updated
during changes (=> e.g. to build something like Firebase). Of course one can
built something like that on top of basic messages by defining subscribe and
update messages in the API, but I'm wondering if it's worthwhile to add
something like that directly in the protocol. On the one hand this is also a
special case of the subscription pattern which is also listed here, on the
other hand it can not directly be implemented with the subscription
possibilites of many message broker systems, because they won't send you the
current state of an object after subscription but will only forward you a
message after the value changes for the next time.

The connection and sequence definition in IAP looks a little bit redudant to
me on the first look. I really think there is a need for message ordering and
you must support it. The question for me is then if you don't need message
ordering, why not put the message into a seperate channel and let
channels/streams always be ordered (like in HTTP/2)? Overhead for channel
creation? Or to setup channels during creation either as ordered or unordered
and keep that for the lifetime of the channel?

~~~
VStack
Mathias, you also don't need a binary protocol. XML would work. JSON too. But
binary is faster and more compact. Same with the table construct. You can work
around not having it, but now ION has it built in. You don't make the mistake
of serializing an array of JSON objects because you are busy. The objects are
serialized as an ION table - not a list of ION object fields. If you ever send
an array of objects across the wire with ION (IAP Tools), you will be using
the table mode automatically. You save bandwidth and parsing time
automatically. Who don't want that - even if you don't need it?

Regarding Copy and Reference, the support for them is still not very good (=
not automatic). But imagine your service executes an SQL JOIN query, and in
that result a lot of objects are repeated (e.g. same zip + city for a lot of
objects). The Copy field can be use to include the zip + city fields just
once, and after that refer to them later with a Copy field. That is shorter
than including them again. These two fields still need some work to have full
support, but we are working on it.

Right now ION is the most well-defined part of IAP. The network protocol
itself is still not 100¤ finalized. But, now that we are close to being done
with ION (we still have extra fields to add as extended types), we can move
forward with the IAP core protocols and semantic protocols. If we do not
define a standard semantic protocol for remote object synchronization, IAP
will be designed so that you can plug in your own semantic protocol to meet
that need.

------
k__
I consider no need for a schema a plus of MessagePack. But the Nos are all
red, haha.

