On the other hand, they're not using this for the boto schemas, which seems like a natural place to show that it's able to capture real-world schemas so that makes it hard for me to think this has any traction
The SDKs use Smithy[1] which is tailored for defining+generating services and SDKs, Ion is more of a pure data serialization format. It's definitely niche but my org uses it in a few places and it has some nice properties that fit our case pretty well (rapidly evolving schema, most clients only care about a small subset of attributes, ability to apply multiple and different schemas based on regions or businesses).
It's the sort of thing where I'd advise exploring other options first and only using it if the whys[2] really resonate with you because it definitely comes with some overhead.
I would be interested to see how this compares to something like msgpack [1] in performance and final size of the binary. Msgpack has been my go-to for binary serialization for years due to how simple and fast it is, and how easy it is to make it work with native Clojure data structures.
That comparison would depend heavily on what you're storing.
Ion has the option of using symbol tables to replace strings (e.g. in struct/map keys or in values). So, if you benchmark had a large number of records with similar structures, I would expect Ion to pull ahead. On the other hand, if each record had nothing in common, I'd expect them to perform similarly.
One feature of the Ion libraries that I've liked is the parser will take any of the formats and figure out what to do with it (text, binary, compressed binary). It's one less thing to worry about. You can switch encodings later without breaking consumers, you can write plain text Ion when you're testing, etc.
Symbol tables, compression, etc seem one level of abstraction above what msgpack provides. Such features could be implemented on top of vanilla msgpack as long as all parties agree on the msgpack schema.
Saw this the other day, but the multiple types of null kind of turned me off - e.g. `null.int`, `null.float`, `null.null`. Is there a good justification for this? Seems like a kluge in any case.
Seems like the justification would be to keep the type information when going back and forth to Ion. More like "multiple nullable types" instead of "multiple types of null"
userBirthDay: null <-- ok, but what type is it? String? Int? Timestamp?
userBirthDay: null.timestamp <-- ok, it's a timestamp typed variable, but we don't know the value. Yay, happy programmer.
Am I reading this right that it's a binary format for real-time streaming data, similar to Avro, but can include arbitrarily deep nested structures unlike Avro?
Ion has a binary format but is not specifically about real-time streaming. It is a JSON replacement.
Ion originated 10+ years ago from the Amazon catalog team - the team that kept data about the hundreds of millions of items available on Amazon. Nearly every team in the company called the catalog to get information about items all the time - scanning the entire catalog, parts of the catalog, millions of individual item lookups every second, etc.
They did the math and some very large percentage of network traffic in Amazon Retails data centers was catalog data. If that data, currently in XML or JSON format, was sent in a more compact format it would save some ridiculous millions of dollars every year. So Ion was born and eventually open sourced.
Why do you single out avro & not any of the hundreds of other ser/de systems? Is that what you know best? Is there something specific about avro that makes it feel particularly similar? https://github.com/maximveksler/awesome-serialization
People keep inventing new ones because the old ones suck or they think the old ones suck. Look at all the discontents around JSON (no comments!), people react violently when people try to apply a little extra like JSON-LD. Then there are all the things like YAML, TOML and such that try to be a little better but are widely thought to be a little worse. (And that's just the human readable data formats) Then there is always
I work in an a shop where Kafka and Avro are everywhere. If I worked someone else I might make reference to something else if it was front-of-mind all the time.
lol, the internal docs on this at Amazon were something very close to "we invented this before Avro and we think that's probably a better choice if you need binary serialization."
I have never used Ion so I cannot speak to its use in practice, but I haven't really had too much of an issue with msgpack. It's faster than JSON, more compressed than JSON, without being any more difficult than any JSON library I've used. It's an almost-universal good for me; the only thing you lose is the ability to easily introspect the messages if there's an issue.
Honestly, if you’re in a case where you absolutely know none of these work for you and you can absolutely prove you need another, you’re probably just going to write your own. And that’s a fleetingly rare case.
It’s a superset of JSON with an isomorphic binary encoding, additional data types (blobs, s-exps, timestamps, symbols, etc.), better number handling, annotations, and the ability to pre-share symbol metadata for more efficient binary encoding (similar to how protobufs encodes fields, but optional).
You can write Ion by hand (like JSON) and share it without a schema (unlike protobufs). There’s fewer ways to express values than YAML, but more data types.
Having S-exps is convenient for writing DSLs in a data language that’s easily readable from other languages.
Wider range of data types - Ion supports decimals, symbols, blobs, and clobs which don't exist in CBOR.
Optional schemas and annotations - Ion allows attaching type/schema information to data for validation purposes. CBOR has no schema support.
Text format - Ion provides a human-readable text format for data interchange, CBOR is binary only.
Maturity - Ion has been used in production at Amazon since 2009, CBOR is a newer standard (RFC 7049 in 2014).
Language support - More mature library ecosystem around Ion vs CBOR which is still gaining adoption.
Pros of CBOR vs Ion:
Standardized - CBOR is an IETF standard, Ion is an Amazon-proprietary format.
Simplicity - CBOR has a smaller set of basic data types making it simpler to implement.
Used in other standards - CBOR is used in data formats like COSE for crypto operations and CWT for web tokens.
Efficiency - The CBOR binary format can have a smaller encoding size than Ion's.
JSON interoperability - CBOR is designed to be a JSON-compatible binary format. Ion is JSON-like but not fully compatible.
In summary, Ion has richer data typing and schema capabilities and a long production history. But CBOR is simpler, standardized, and gaining momentum - especially in crypto and web standards using it as a binary encoding basis.
So Ion may be better for applications dealing with complex, annotated data. But CBOR has advantages for an efficient binary interchange format, particularly when standards compatibility is important.
This is just the data serialization format, you have to build any other functionality yourself. We do have a pattern on a few of our APIs where there's a big fixed schema (i.e. it's just a struct and you can't do GraphQL things like following references and hydrating them into objects) and clients select the subset of attributes they want and we only return that. It's useful for reducing response sizes but the main benefit is we can pretty easily track which attributes are actually used over time. That helps us deprecate attributes with a lot less pain.
https://news.ycombinator.com/item?id=29284428 (2 years ago, 229 comments)
https://news.ycombinator.com/item?id=23921610 (3 years ago, 110 comments)
https://news.ycombinator.com/item?id=11546098 (7 years ago, 163 comments)