Amazon Ion Specification

jsnell · on Sept 13, 2023

Previous discussions on Ion:

https://news.ycombinator.com/item?id=29284428 (2 years ago, 229 comments)

https://news.ycombinator.com/item?id=23921610 (3 years ago, 110 comments)

https://news.ycombinator.com/item?id=11546098 (7 years ago, 163 comments)

mdaniel · on Sept 13, 2023

On the one hand, this seems more real than a lot of promotion-ware I've seen: https://github.com/amazon-ion/ion-intellij-plugin#readme

On the other hand, they're not using this for the boto schemas, which seems like a natural place to show that it's able to capture real-world schemas so that makes it hard for me to think this has any traction

kevan · on Sept 13, 2023

The SDKs use Smithy[1] which is tailored for defining+generating services and SDKs, Ion is more of a pure data serialization format. It's definitely niche but my org uses it in a few places and it has some nice properties that fit our case pretty well (rapidly evolving schema, most clients only care about a small subset of attributes, ability to apply multiple and different schemas based on regions or businesses).

It's the sort of thing where I'd advise exploring other options first and only using it if the whys[2] really resonate with you because it definitely comes with some overhead.

[1] https://smithy.io/2.0/index.html [2] https://amazon-ion.github.io/ion-docs/guides/why.html

jesterpm · on Sept 13, 2023

Ion is heavily used on the retail side of Amazon, but it's only recently started to appear in AWS products.

AWS is starting support PartiQL (https://partiql.org/) queries in some places and PartiQL uses Ion's type system internally.

tombert · on Sept 13, 2023

I would be interested to see how this compares to something like msgpack [1] in performance and final size of the binary. Msgpack has been my go-to for binary serialization for years due to how simple and fast it is, and how easy it is to make it work with native Clojure data structures.

[1] https://msgpack.org/index.html

jesterpm · on Sept 13, 2023

That comparison would depend heavily on what you're storing.

Ion has the option of using symbol tables to replace strings (e.g. in struct/map keys or in values). So, if you benchmark had a large number of records with similar structures, I would expect Ion to pull ahead. On the other hand, if each record had nothing in common, I'd expect them to perform similarly.

One feature of the Ion libraries that I've liked is the parser will take any of the formats and figure out what to do with it (text, binary, compressed binary). It's one less thing to worry about. You can switch encodings later without breaking consumers, you can write plain text Ion when you're testing, etc.

plq · on Sept 13, 2023

Symbol tables, compression, etc seem one level of abstraction above what msgpack provides. Such features could be implemented on top of vanilla msgpack as long as all parties agree on the msgpack schema.

paulddraper · on Sept 14, 2023

MessagePack goes to "extremes" to shrink message size; I suspect that at least to win out.

news_to_me · on Sept 13, 2023

Saw this the other day, but the multiple types of null kind of turned me off - e.g. `null.int`, `null.float`, `null.null`. Is there a good justification for this? Seems like a kluge in any case.

spaceywilly · on Sept 13, 2023

Seems like the justification would be to keep the type information when going back and forth to Ion. More like "multiple nullable types" instead of "multiple types of null"

userBirthDay: null <-- ok, but what type is it? String? Int? Timestamp?

userBirthDay: null.timestamp <-- ok, it's a timestamp typed variable, but we don't know the value. Yay, happy programmer.

zyang · on Sept 13, 2023

My guess is buffer size calculations.

steveBK123 · on Sept 13, 2023

typed nulls good

throwbadubadu · on Sept 13, 2023

Sounds like the grug brained developer speaking... Hi! ;)

mikece · on Sept 13, 2023

Am I reading this right that it's a binary format for real-time streaming data, similar to Avro, but can include arbitrarily deep nested structures unlike Avro?

leef · on Sept 13, 2023

Ion has a binary format but is not specifically about real-time streaming. It is a JSON replacement.

Ion originated 10+ years ago from the Amazon catalog team - the team that kept data about the hundreds of millions of items available on Amazon. Nearly every team in the company called the catalog to get information about items all the time - scanning the entire catalog, parts of the catalog, millions of individual item lookups every second, etc.

They did the math and some very large percentage of network traffic in Amazon Retails data centers was catalog data. If that data, currently in XML or JSON format, was sent in a more compact format it would save some ridiculous millions of dollars every year. So Ion was born and eventually open sourced.

jauntywundrkind · on Sept 13, 2023

Why do you single out avro & not any of the hundreds of other ser/de systems? Is that what you know best? Is there something specific about avro that makes it feel particularly similar? https://github.com/maximveksler/awesome-serialization

PaulHoule · on Sept 13, 2023

People keep inventing new ones because the old ones suck or they think the old ones suck. Look at all the discontents around JSON (no comments!), people react violently when people try to apply a little extra like JSON-LD. Then there are all the things like YAML, TOML and such that try to be a little better but are widely thought to be a little worse. (And that's just the human readable data formats) Then there is always

https://en.wikipedia.org/wiki/ASN.1

which is forgotten but not gone.

fieldcny · on Sept 14, 2023

Oh man ASN.1, makes me shiver, oh the memories…

I hadn’t thought about it in like a decade but yeah it’s still silently in the background…

abeppu · on Sept 13, 2023

In fairness, in the list you link to,

- Avro and Ion are the only two that are labeled Textual/Binary

- They are in the same Big Data grouping

- They both are schema-embedded, and support some rich nested datastructures, though they deviate on many of the specifics

So I think it's reasonable to pick out Avro as an especially similar point of comparison.

mikece · on Sept 13, 2023

I work in an a shop where Kafka and Avro are everywhere. If I worked someone else I might make reference to something else if it was front-of-mind all the time.

nathas · on Sept 13, 2023

lol, the internal docs on this at Amazon were something very close to "we invented this before Avro and we think that's probably a better choice if you need binary serialization."

My 2 cents: don't use it.

postalrat · on Sept 13, 2023

My 2 cents: don't use avro or anything like it unless you can prove its going to save you money

tombert · on Sept 13, 2023

What would you suggest? Just JSON everywhere?

postalrat · on Sept 13, 2023

json, csv, text, html, binary blobs you dont create, whatever is easiest

tombert · on Sept 13, 2023

I have never used Ion so I cannot speak to its use in practice, but I haven't really had too much of an issue with msgpack. It's faster than JSON, more compressed than JSON, without being any more difficult than any JSON library I've used. It's an almost-universal good for me; the only thing you lose is the ability to easily introspect the messages if there's an issue.

Waterluvian · on Sept 13, 2023

cbor even.

Honestly, if you’re in a case where you absolutely know none of these work for you and you can absolutely prove you need another, you’re probably just going to write your own. And that’s a fleetingly rare case.

loxias · on Sept 14, 2023

My $0.02 is "yes, JSON everywhere". Specifically, one object per line, newline delimited, sorted keys, compressed with zstd or gzip.

tombert · on Sept 14, 2023

rahkiin · on Sept 13, 2023

I’m interested in the answer as well. Also interested what’s wrong with Ion

Zenul_Abidin · on Sept 13, 2023

Am I right in assuming that this is like Protobuf but just for JSON objects?

jonhohle · on Sept 13, 2023

It’s a superset of JSON with an isomorphic binary encoding, additional data types (blobs, s-exps, timestamps, symbols, etc.), better number handling, annotations, and the ability to pre-share symbol metadata for more efficient binary encoding (similar to how protobufs encodes fields, but optional).

You can write Ion by hand (like JSON) and share it without a schema (unlike protobufs). There’s fewer ways to express values than YAML, but more data types.

Having S-exps is convenient for writing DSLs in a data language that’s easily readable from other languages.

glonq · on Sept 13, 2023

What's the pros and cons of this versus CBOR, which we had great success with in our system.

https://cbor.io

leonardspeiser · on Sept 13, 2023

Pros of Ion vs CBOR:

Wider range of data types - Ion supports decimals, symbols, blobs, and clobs which don't exist in CBOR. Optional schemas and annotations - Ion allows attaching type/schema information to data for validation purposes. CBOR has no schema support. Text format - Ion provides a human-readable text format for data interchange, CBOR is binary only. Maturity - Ion has been used in production at Amazon since 2009, CBOR is a newer standard (RFC 7049 in 2014). Language support - More mature library ecosystem around Ion vs CBOR which is still gaining adoption.

Pros of CBOR vs Ion:

Standardized - CBOR is an IETF standard, Ion is an Amazon-proprietary format. Simplicity - CBOR has a smaller set of basic data types making it simpler to implement. Used in other standards - CBOR is used in data formats like COSE for crypto operations and CWT for web tokens. Efficiency - The CBOR binary format can have a smaller encoding size than Ion's. JSON interoperability - CBOR is designed to be a JSON-compatible binary format. Ion is JSON-like but not fully compatible.

In summary, Ion has richer data typing and schema capabilities and a long production history. But CBOR is simpler, standardized, and gaining momentum - especially in crypto and web standards using it as a binary encoding basis.

So Ion may be better for applications dealing with complex, annotated data. But CBOR has advantages for an efficient binary interchange format, particularly when standards compatibility is important.

satvikpendem · on Sept 13, 2023

Can this be used similarly to GraphQL?

kevan · on Sept 13, 2023

This is just the data serialization format, you have to build any other functionality yourself. We do have a pattern on a few of our APIs where there's a big fixed schema (i.e. it's just a struct and you can't do GraphQL things like following references and hydrating them into objects) and clients select the subset of attributes they want and we only return that. It's useful for reducing response sizes but the main benefit is we can pretty easily track which attributes are actually used over time. That helps us deprecate attributes with a lot less pain.

ChrisArchitect · on Sept 13, 2023

(2016)