
Amazon open-sources Ion – a binary and text interchangable, typed JSON-superset - machinagod
https://github.com/amznlabs/ion-java
======
haberman
To think about the difference between serialization formats, here's an analogy
I hope will help.

Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or
C++: you declare your types ahead of time, and then you take some binary
payload and "cast" it (parse it actually) into your predefined type. If those
bytes weren't actually serialized as that type, you'll get garbage. On the
plus side, the fact that you declared your types statically means that you get
lots of useful compile-time checking and everything is really efficient. It's
also nice because you can use the schema file (ie. .proto files) to declare
your schema formally and document everything.

JSON and Ion are more like a Python/Javascript object/dict. Objects are just
attribute-value bags. If you say it has field fooBar at runtime, now it does!
When you parse, you don't have to know what message type you are expecting,
because the key names are all encoded on the wire. On the downside, if you
misspell a key name, nothing is going to warn you about it. And things aren't
quite as efficient because the general representation has to be a hash map
where every value is dynamically typed. On the plus side, you never have to
worry about losing your schema file.

I think this is a case where "strongly typed" isn't the clearest way to think
about it. It's "statically typed" vs. "dynamically typed" that is the useful
distinction.

~~~
jcrites
That's a great analogy! However, I do think strongly typed vs. weakly typed
has a role in thinking about this, just a different dimension than the one
you're describing. Let's say we come across a JSON structure that looks like
this:

    
    
      {"start": "2007-03-01"}
    

Is that a timestamp? Maybe! Does it support a time within the day? Perhaps I
can write "2007-03-01T13:00:00" in ISO 8601 format if we're lucky. Can I
supply a time zone? Who knows for sure? It's weakly typed data. The actual
specification of that type of that field lives in a layer on top of JSON, if
it's even specified at all. It might be "specified" only in terms of what the
applications that handle it can parse and generate. I could drop that value
into Excel and treat it as all sorts of different things.

Ion by comparison has a specific data type for timestamps defined in the spec
[1]. The timestamp has a canonical representation in both text and binary
form. For this reason, I know that "2007-02-23T20:14:33.Z" and
"2007-02-23T12:14:33.079-08:00" are valid Ion timestamp text values. In this
instance I would describe Ion as strongly typed and JSON as weakly typed. Or,
as the Ion documentation puts it, "richly typed".

To make an analogy, weakly typed is the Excel cell that can store whatever
value you put in it, or the PHP integer 1 which is considered equal to "1"
(loose equality). Strongly typed is the relational database row with a column
described precisely by the table schema. Weakly typed is the CSV file;
strongly typed is the Ion document.

[1] [http://amznlabs.github.io/ion-
docs/spec.html](http://amznlabs.github.io/ion-docs/spec.html)

~~~
haberman
Ion has more data types than JSON, it's true. Ion has a timestamp type and
JSON does not, so you could say it's "richer" if you want, but that just means
"it has more types."

However I don't think it's accurate to say that the typing of Ion is any
"stronger." Both Ion and JSON are fully dynamically typed, which means that
types are attached to every value on the wire. It's just that without an
actual timestamp type in JSON, you have to encode timestamp data into a more
generic type.

~~~
jcrites
The notions of "strong" and "weak" typing have never been particularly well-
defined, but I think my usage is in line with their usual meaning:
[https://en.wikipedia.org/wiki/Strong_and_weak_typing](https://en.wikipedia.org/wiki/Strong_and_weak_typing)

> Some programming languages make it easy to use a value of one type as if it
> were a value of another type. This is sometimes described as "weak typing".

Strong typing makes it difficult to use a value of one type as if it were
another. In PHP, you can compare the integer value 1 to the string value "1"
and the equality test returns boolean true. Conflating integer 1 and string
"1" is weak typing. A data format that expresses the concept of the timestamp
1999-12-31T23:14:33.079-08:00 using the same fundamental type as the string
"Party like it's 1999!" is what I would call weakly typed.

Ion does not make it easy to use a string as if it were a timestamp or vice
versa. It has types like arbitrary precision decimals, or binary blobs, that
can't easily be represented in a strongly-typed way in JSON. You can certainly
invent a representation, like specifying strings as ISO 8601 for timestamps,
or an array of numbers for binary -- actually, wait, how about a
base64-encoded string instead? Where there's choice there's ambiguity. These
concepts of "type" live in the application layer in JSON, instead of in the
data layer like they do in Ion.

Note as well that stronger is my term. The Ion documentation says "richly-
typed". Certainly Ion does not include every type in the world. Perhaps a
future serialization framework might capture "length" with a unit of "meters",
or provide a currency type with unit "dollars", and if that existed I'd call
it stronger-(ly?)-typed or more richly typed than Ion. In that case, the data
layer would prevent you from accidentally converting "3 inches" to "3
centimeters" by accident, since those would be different types. That would be
stronger typing than an example where you simply have the integer 3, and it's
the application's job to track which integers represent inches, and which
represent centimeters. So perhaps "strong" and "weak" are not the best terms,
so much as "stronger" and "weaker".

~~~
haberman
By your definition, any language with strings is weakly typed, since you can
always interpret a string as being something else. Strongly/weakly typed has
never been a particularly useful description (as the page you linked notes),
and I think it's particularly unhelpful here.

~~~
jcrites
> By your definition, any language with strings is weakly typed, since you can
> always interpret a string as being something else

No, I wouldn't say that's the case. For example, in PHP you can literally
write:

    
    
      if (1 == "1") { ...
    

... and the condition evaluates to true. You can do similar things in Excel;
Excel doesn't even really differentiate between those two values in the first
place. (At least that's how it seems as a casual user.)

This is not the case in strongly typed programming languages that have strings
such as C++ or Java. You can _convert_ from one type to another, sure, by
_explicitly_ invoking a function like atoi() or Integer.toString(), but the
conversion is deliberate and so it is strongly typed. A variable containing a
string (java.lang.String) cannot be compared against one containing a
timestamp (java.util.Date) by accident. An Ion timestamp is a timestamp and
can't be conflated with a string, although it can be _converted_ to one.

Edit: The set of types that are built in, in conjunction with how those types
are expressed in programming languages (e.g. timestamp as java.util.Date,
decimal as java.math.BigDecimal, blob as byte[]), is why I'd call Ion strongly
typed or richly typed in comparison to JSON. Specifically, scalar values that
frequently appear in common programs can be expressed with distinctly typed
scalar values in Ion. I don't know if there's a good formal definition. You
could probably define a preorder on programming languages or data formats
based simply on the number of distinct scalar or composite types (so in that
sense, yes, it's the fact that Ion has more). However it goes beyond that
subjectively. Subjectively it's about how often you have to, in practice,
convert from one type to another in common tasks. There is no clear way to
represent an arbitrary-precision decimal in JSON, or a byte array, or a
timestamp -- so you must "compress" those types down into a single JSON type
like string-of-some-format or array-of-number; and _several_ different scalar
types must _all_ map to that same JSON type, which creates the risk of
conflating values of different _logical_ types but the same _physical_ JSON
type with each other. There's no obvious or built-in way to reconstruct the
original type with fidelity. There's no self-describing path back from
"1999-12-31T23:14:33.079-08:00" and "DEADBEEFBASE64" back to those original
types.

I subjectively call JSON weakly typed because its types are not adequately to
uniquely store common scalar data types that I work with in programs that I
write. I call Ion strongly typed because it typically can. I acknowledged
earlier that a data format would be _even more strongly typed_ if it was
capable of representing not just the type "integer", but "integer length
meters". Ion does not have this kind of type built in, though its annotations
feature could be used to describe that a particular integer value represents a
length in meters.

~~~
haberman
> You can't misuse any kind of Ion value that is a string as if it were a
> timestamp without performing an explicit conversion.

The same is true of JSON. There is no difference, except that Ion has a
timestamp type and JSON does not.

If you disagree, please identify what characteristic of Ion's design makes it
more strongly typed than JSON, other than the set of types that is built in.

~~~
dietrichepp
You are choosing a definition of strong typing that supports your argument,
but the argument is over the meaning of strong typing to begin with. It's not
as if there's some universally accepted definition of strong typing. Like
functional programming, functional purity, object oriented, etc.—none of these
terms are universally defined.

~~~
haberman
The fact that "strong typing" has no universal definition is exactly why I
think it's not useful.

~~~
jcrites
I hate feeling like I'm nitpicking, but I don't think that's true. I think
they do have a well-accepted definition, which appears in Wikipedia, in
assorted articles online, and in computer science publications. Here are some
examples of CS publications that describe a research contribution in terms of
strong typing:

> Strong typing of object-oriented languages revisited. This paper is
> concerned with the relation between subtyping and subclassing and their
> influence on programming language design. [...] The type system of a
> language can be characterized as strong or weak and the type checking
> mechanism as static or dynamic.
> [http://dl.acm.org/citation.cfm?id=97964](http://dl.acm.org/citation.cfm?id=97964)

> GALILEO: a strongly-typed, interactive conceptual language. Galileo, a
> programming language for database applications, is presented. Galileo is a
> strongly-typed, interactive programming language designed specifically to
> support semantic data model features (classification, aggregation, and
> specialization), as well as the abstraction mechanisms of modern programming
> languages (types, abstract types, and modularization).
> [http://dl.acm.org/citation.cfm?id=3859](http://dl.acm.org/citation.cfm?id=3859)

> Design and implementation of an object-oriented strongly typed language for
> distributed applications.
> [http://dl.acm.org/citation.cfm?id=99813](http://dl.acm.org/citation.cfm?id=99813)

> Strongly typed heterogeneous collections. (Oleg Kiselyov et al.)
> [http://dl.acm.org/citation.cfm?id=1017488](http://dl.acm.org/citation.cfm?id=1017488)

> Strongly typed genetic programming. Genetic programming is a powerful method
> for automatically generating computer programs via the process of natural
> selection [but] there is no way to restrict the programs it generates to
> those where the functions operate on appropriate data types. [When] programs
> manipulate multiple data types and contain functions designed to operate on
> particular data types, this can lead to unnecessarily large search times
> and/or unnecessarily poor generalization performance. Strongly typed genetic
> programming (STGP) is an enhanced version of genetic programming that
> enforces data-type constraints and whose use of generic functions and
> generic data types makes it more powerful than other approaches to type-
> constraint enforcement
> [http://dl.acm.org/citation.cfm?id=1326695](http://dl.acm.org/citation.cfm?id=1326695)

The argument that the terms have no universal definition cannot be sound in
light of their widespread use in computer science publications, even in the
title and abstract. Perhaps what you mean to say is that the terms don't have
a _completely unambiguous_ or formal definition. That's probably true, but not
all CS terms do. The words are contextual and exist on a spectrum, in the
sense that a _strongly-typed thing_ is typically in comparison to a _more-
weakly-typed thing_ [1]. However, the fact that they're widely used by CS
researchers is why I think we should reject the argument that they don't have
a universal definition or are not useful. CS researchers like Oleg Kiselyov
use the term when describing their papers and characterizing their
contributions.

[1] This is true for static and dynamic typing as well: they exist in degrees.
Rust can verify type proofs that other languages can't regarding memory
safety. Some languages can verify that integer indexes into an array won't go
out of bounds. Thus it's not the case that a given language is either
statically typed or dynamically typed; rather, each aspect of how it works can
be characterized on a spectrum from statically verified to dynamically
verified.

~~~
haberman
> I think they do have a well-accepted definition [...] [You] shouldn't
> confuse your dislike for them for the absence of a well-accepted definition
> that's widely used in computer science literature.

Just upthread, you said:

> The notions of "strong" and "weak" typing have never been particularly well-
> defined

And the Wikipedia article you cited
([https://en.wikipedia.org/wiki/Strong_and_weak_typing](https://en.wikipedia.org/wiki/Strong_and_weak_typing))
says:

> These terms do not have a precise definition

The Wikipedia article also says:

> A number of different language design decisions have been referred to as
> evidence of "strong" or "weak" typing. In fact, many of these are more
> accurately understood as the presence or absence of type safety, memory
> safety, static type-checking, or dynamic type-checking.

Also on Wikipedia
([https://en.wikipedia.org/wiki/Type_system](https://en.wikipedia.org/wiki/Type_system)):

> Languages are often colloquially referred to as "strongly typed" or "weakly
> typed". In fact, there is no universally accepted definition of what these
> terms mean. In general, there are more precise terms to represent the
> differences between type systems that lead people to call them "strong" or
> "weak".

...which is exactly what I'm saying in this entire thread.

It's very strange to me how you seem really seem to want other people to be on
board with your particular interpretation of what everybody (even you, 13
hours ago) agrees is not a very well-defined concept.

------
leef
Finally! I've had to live the JSON nightmare since I left Amazon.

Some of the benefits over JSON:

* Real date type

* Real binary type - no need to base64 encode

* Real decimal type - invaluable when working with currency

* Annotations - You can tag an Ion field in a map with an annotation that says, e.g. its compression ("csv", "snappy") or its serialized type ('com.example.Foo').

* Text and binary format

* Symbol tables - this is like automated jsonpack.

* It's self-describing - meaning, unlike Avro, you don't need the schema ahead of time to read or write the data.

~~~
nikolay
Okay, but they did a really poor job marketing it in this release. Plus, if
it's used within Amazon, why it's Java-only so far?

~~~
makoz
Amazon's mainly a Java shop, not sure if that helps you.

~~~
nikolay
Not really the case - a lot of major projects (like boto and AWS CLI) are in
Python.

~~~
majewsky
These are only client-side interfaces. The server-side is usually much larger.

~~~
nikolay
True, but my point was that there's enough talent at Amazon, working on SDKs,
and others, and there are precedents where even more complex projects such as
JMESPath have wide support [0].

[0]: [http://jmespath.org/libraries.html](http://jmespath.org/libraries.html)

------
kazinator
I Consider this Harmful (TM) and will oppose the adoption in every
organization where I have an opportunity to voice such. (In its present form,
to be clear!)

There is no need to have a null which is fragmented into null.timestamp,
null.string and whatever. It will complicate processing. Just because you know
the type of some element is timestamp, you must worry whether or not it is
null and what that means.

There should be just one null value, which is its own type. A given datum is
either permitted to be null OR something else like a string. Or it isn't; it
is expected to be a string, which is distinct from the null value; no string
is a null value.

It's good to have a read notation for a timestamp, but it's not an elementary
type; a timestamp is clearly an aggregate and should be understood as
corresponding to some structure type. A timestamp should be expressible using
that structure, not only as a special token.

This monstrosity is not exhibiting good typing; it is not good static typing,
and not good dynamic typing either. Under static typing we can have some
"maybe" type instead of null.string: in some representations we definitely
have a string. In some other places we have a "maybe string", a derived type
which gives us the possibility that a string is there, or isn't. Under dynamic
typing, we can superimpose objects of different type in the same places; we
don't need a null version of string since we can have "the" one and only null
object there.

This looks like it was invented by people who live and breathe Java and do not
know any other way of structuring data. Java uses statically typed references
to dynamic objects, and each such reference type has a null in its domain so
that "object not there" can be represented. But just because you're working on
a reference implementation in such a language doesn't mean you cannot
_transcend_ the semantics of the implementation language. If you want to
propose some broad interoperability standard, you practically _must_.

~~~
jonhohle
> It will complicate processing.

In practice, it doesn't. If you want to know if an IonValue is null, ask it
with #isNull. If you don't care about the null's type, ignore it. On the other
hand, the type is an additional form of metadata which allows overloading the
meaning of a value.

nulls can also be annotated, so Ion doesn't really have the concept of a
singular shared null sentinel.

More so than JSON, Ion often uses nulls to differentiate presence from value
(that is, the lack of a field in a struct has a different meaning the presence
of that field with a null value). Since nulls are objects, they can be tested
separately from the lack of a field definition.

> a timestamp is clearly an aggregate and should be understood as
> corresponding to some structure type.

Timestamps are structured types with a literal representation that is
explicitly modeled in the specification. You're free to ignore it and use a
custom schema for representing time, but you've moved any validation into your
application at that point and are no better off than JSON.

~~~
aphexairlines
I think the concern is that if you take an IonValue and cast it to IonText,
then call stringValue(), you'll get an exception somewhere if the document
contained a null value.

It recalls the nullability arguments between the ML family and the C/Java
family.

kazinator is asking for safer document semantics and a type-safe API.

------
wyc
This reminds me a lot of Avro:

[https://avro.apache.org/docs/current/](https://avro.apache.org/docs/current/)

They both have self-describing schemas, support for binary values, JSON-
interoperability, basic type systems (Ion seems to support a few more field
types), field annotations, support for schema evolution, code generation not
necessary, etc.

I think Avro has the additional advantages of being production-tested in many
different companies, a fully-JSON schema, support for many languages, RPC
baked into the spec, and solid performance numbers found across the web.

I can't really see why I'd prefer Ion. It looks like an excellent piece of
software with plenty of tests, no doubt, but I think I could do without
"clobs", "sexprs", and "symbols" at this level of representation, and it might
actually be better if I do. Am I missing something?

~~~
umanwizard
Amazon invented Ion because yaml, Avro, etc. didn't exist at the time. Ion is
actually pretty old.

The timing of open-sourcing it mystifies me a bit. Maybe Amazon is trying to
become more open-source friendly, like Microsoft did?

Perhaps more likely: they're planning on making some internal APIs that use
ION heavily public?

~~~
kbenson
Are we talking about a different YAML, or has Ion existed for ~15 years?

~~~
umanwizard
Woah, you're right. I thought YAML was much younger. I stand corrected.

Ion is definitely from last decade, at least. So most of the speculation in my
post stands.

~~~
jonhohle
The JSON RFC was published right around the time Ion was being created making
Ion about 10 years old this year. It was not clear at the time that JSON would
become so popular ;-) and at the same time Ion fixed a lot of its weak areas
(numeric types, dates, struct/object syntax, etc.)

------
jonhohle
Big congrats to Todd, Almann, Chris, Henry, and everyone else who made this
happen.

Several years ago, I wouldn't have imagined this possible and I'm a little
bummed that I left before it happened.

Like leef said above, I'm glad to have Ion as an option again.

~~~
tveita
I am curious why you "wouldn't have imagined this possible", is the reason
technical or political?

~~~
serge2k
Political.

Amazon doesn't open source things, as a general rule. It can be done but it is
a lot of jumping through hoops and they generally need good reasons to do it
(as opposed to a lack of good reasons not to).

------
desdiv
Interestingly enough a JSON alternative named "ION" was just posted as a Show
HN[0] about three months ago.

So now not only do we have the problem of redundant and mutually incompatible
protocols (cue obligatory xkcd), but that we have _so many_ such protocols
that name collision is becoming an extra problem.

[0]
[https://news.ycombinator.com/item?id=11027319](https://news.ycombinator.com/item?id=11027319)

~~~
umanwizard
I mentioned this Ion in that thread, if anyone is interested in the ensuing
discussion:
[https://news.ycombinator.com/item?id=11028205](https://news.ycombinator.com/item?id=11028205)

------
drawkbox
Binary values can be stored as base64 in regular old JSON as well. Yes that is
bigger but same as email/MIME binary chunks are converted to base64. Email
messages and attachments are handled this way, we do this everyday. Base64
does bloat by 40%ish, so the larger content could be compressed/decompressed
prior to base64 encoding it and vice versa or even encrypted/decrypted on
either end in software/app layer.

No need for a new protocol when doing it that way for basic things, if you
need more binary (busy messaging/real-time) there are plenty of alternatives
to JSON.

I love the simplicity of JSON, so do others and it is successful so many try
to attach on to that success. The success part was that it was so damn simple
though, most attachments just complicate and add verbosity, echoes back to XML
and SOAP wars which spawned the plain and simple JSON. Adding complexity is
easy and anyone can do it, good engineers take complexity and make it simple,
that is damn difficult.

~~~
jonhohle
> Binary values can be stored as base64 in regular old JSON as well

But in JSON you'd encode that Base64 as a string and the application must know
that the data isn't really a string but a blob of some type of encoding. That
probably means wrapping in another struct to provide that metadata. Ion
provide a terse method of doing the same while maintaining data integrity:

    
    
        'image/gif'::{{ R0lGODlhAQABAIABAP8AAP///yH5BAEAAAEALAAAAAABAAEAAAICRAEAOw== }}
    

The 'image/gif' annotation is application specific, but all consumers know
that the contents of that value are binary. In the binary Ion representation,
those 43-bytes are encoded as a 45 byte value (one byte for the type marker
and a second for the length in this case; as little as 47 with the annotation
and a shared symbol table), making the binary representation very efficient
for transferring binary data.

Since Ion is a superset of JSON, it's by definition more complex, but the
complexity isn't unapproachable. Most of the engineers I worked with assumed
it was JSON until coming across timestamps, annotations, or bare word symbols.

------
deathanatos
I can't decide if "JSON-superset" is technically accurate or not.

JSON's string literals come from JavaScript, and JavaScript only sortof has a
Unicode string type. So the \u escape in both languages encodes a UTF-16 code
_unit_ , not a code _point_. That means in JSON, the single code point U+1f4a9
"Pile of Poo" is encoded thusly:

    
    
        "\ud83d\udca9"
    

JSON specifically says this, too,

    
    
       Any character may be escaped.  If the character is in the Basic
       Multilingual Plane (U+0000 through U+FFFF), then it may be
       represented as a six-character sequence: a reverse solidus, followed
       by the lowercase letter u, followed by four hexadecimal digits that
       encode the character's code point.  The hexadecimal letters A though
       F can be upper or lowercase.  So, for example, a string containing
       only a single reverse solidus character may be represented as
       "\u005C".
    
       [… snip …]
    
       To escape an extended character that is not in the Basic Multilingual
       Plane, the character is represented as a twelve-character sequence,
       encoding the UTF-16 surrogate pair.  So, for example, a string
       containing only the G clef character (U+1D11E) may be represented as
       "\uD834\uDD1E".
    

Now, Ion's spec says only:

    
    
       U+HHHH	\uHHHH	4-digit hexadecimal Unicode code point
    

But if we take it to mean code _point_ , then if the value is a surrogate…
what should happen?

Looking at the code, it _looks_ like the above JSON will parse:

    
    
      1. Main parsing of \u here:
         https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
    
      2. which is called from here, and just appended to a StringBuilder:
         https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
    

My Java isn't that great though, so I'm speculating. But I'm not sure what
_should_ happen.

This is just one of those things that the first time I saw it in JSON/JS… a
part of my brain melted. This is all a technicality, of course, and most JSON
values should work just fine.

~~~
jrgv
> But if we take it to mean code point, then if the value is a surrogate… what
> should happen?

Surrogates are code points. The spec does not say what should happen if the
surrogate is invalid (for example, if only the first surrogate of a surrogate
pair is present), but neither does the JSON spec.

Java internally also represents non-BMP code points using surrogates. So,
simply appending the surrogates to the string should yield a valid Java string
if the surrogates in the input are valid.

------
escherize
Is there a source for benchmarks/reviews for the various ways to represent
data? As far as I see it, there are a lot of them that I'd like to hear
pros/cons for: json, edn + transit (my fave), yaml, google protobufs, thrift
(?), as well as Ion.

And where does Ion fit here?

~~~
jcrites
Ion's advantage is that it's both strongly-typed with a rich type system, as
well as self-describing.

Data formats like JSON and XML can be somewhat self-describing, but they
aren't always completely. Both tend to need to embed more complex data types
as either strings with implied formats, or nested structures. (Consider: How
would you represent a timestamp in JSON such that an application could
unambiguously read it? An arbitrary-precision decimal? A byte array?) I'm not
familiar with EDN, but it appears to be in a similar position as JSON in this
regard. ProtocolBuffers, Thrift, and Avro require a schema to be defined in
advance, and only work with schema-described data as serialization layers. Ion
is designed to work with self-describing data that might be fairly complex,
and have no compiled-ahead-of-time schema.

Ion makes it easy to pass data around with high fidelity even if intermediate
systems through which the data passes understand only part of the data but not
all of it. A classic weakness of traditional RPC systems is that, during an
upgrade where an existing structure gains an additional field, that structure
might pass through an application that doesn't know about the field yet. Thus
when the structure gets deserialized and serialized again, the field is
missing. The Ion structure by comparison can be passed from the wire to the
application and back without that kind of loss. (Some serialization-based
frameworks have solutions to this problem too.)

One downside is that its performance tends to be worse than schema-based
serialization frameworks like Thrift/ProtoBuf/Avro where the payload is
generally known in advance, and code can be generated that will read and
deserialize it. Another downside is that it's difficult to isolate Ion-aware
code from the more general purpose "business logic" in an application, due to
the absence of a serialization layer producing/consuming POJOs; instead it's
common to read an Ion structure from the wire and access it directly from
application logic.

~~~
brandonbloom
EDN supports dates, etc, too.

However, it doesn't support blobs. I'm conflicted about this point. On one
hand, small blobs can occasionally be useful to send within a larger payload.
On the other hand, small blobs almost always become large blobs, and so I'd
rather plan for out-of-band (preferably even content addressable)
representations of blobs.

------
eyan
Surprised nobody mentioned CBOR ([http://cbor.io](http://cbor.io)) yet. Aka
RFC 7049
([http://tools.ietf.org/html/rfc7049](http://tools.ietf.org/html/rfc7049)).

~~~
LVB
It is referenced in the Ion docs: [http://amznlabs.github.io/ion-
docs/index.html](http://amznlabs.github.io/ion-docs/index.html)

~~~
brianolson
They complain about how CBOR is a superset of JSON data types and so some CBOR
values (like bignum) might not down-convert to JSON cleanly, and then in _the
next paragraph_ they talk about how Ion is a superset of JSON data types
including 'arbitrary sized integers'. Bad doubletalk. Boo. (I have implemented
CBOR in a couple languages and like it. Every few months we get to say, "oh
look, _another_ binary JSON.")

------
vparikh
Wasn't this solved already by the BSON specification -
[http://bsonspec.org](http://bsonspec.org) ? Sure this allows you a definition
of types, but this could easily be done using standard JSON meta data for each
field. I find BSON simpler and more elegant.

~~~
duskwuff
BSON is awful.

* It doesn't have "true" types in the sense that Ion does. It's basically just a binary serialization of JSON, with extra stuff.

* Despite being a binary format, it's actually bulkier than JSON in most situations.

* It removes any semblance of canonicity from many representations. A number, for instance, can potentially be represented by any of at least 3 types (double, int32, and int64).

* It has signed 32-bit length limits all over the place. Not that I'd _want_ to be storing 2GB of data in a single JSON document either, but it's not even _possible_ to do so with BSON!

* It requires redundant null bytes in unpredictable places. For instance, all strings must be stored with a trailing null byte, which is included in their length. There's also a trailing null byte at the end of a document for no reason at all.

* It is unabashedly Javascript-specific, containing types like "JavaScript code with scope" which are meaningless to other languages.

* It also contains some MongoDB-specific cruft, such as the "ObjectID" and "timestamp" types (the latter of which, despite its name, cannot actually be used to store time values).

* It contains numerous "deprecated" and "old" features (in version 1.0!) with no guidance as to how implementations should handle them.

~~~
_wmd
Most of this comes from BSON also being the internal storage format for a
database server. For example, at least the redundant string NULs make it
possible to use C library functions without copying, the unpacked ints allow
direct dereferencing, etc.

I've no clue about the trailing NUL on the record itself, perhaps a safety
feature?

~~~
duskwuff
> I've no clue about the trailing NUL on the record itself, perhaps a safety
> feature?

Could be. Or perhaps there's enough code paths in common between string
parsing and document parsing that they decided to put a trailing null byte on
both.

Stepping back a bit, though, the fact that BSON is optimized for "direct" use
in C code is really scary. That suggests that any failure to completely
validate BSON data could open up vulnerabilities in C code manipulating it.

------
Ericson2314
> Decimal maintains precision: -0. != -0.0

What? This means their "arbitrary-precision decimals" are actually isomorphic
to (Rational x Natural).

~~~
alextgordon
The use of != there is very confusing but what they mean is stores a precision
along each number, not that -0 != -0.0

e.g. in Python:

    
    
        >>> from decimal import Decimal as D
        >>> 2 * D("1.0")
        Decimal('2.0')
    
        >>> 2 * D("1.000")
        Decimal('2.000')
    
        >>> D("1.0") == D("1.000")
        True

~~~
Ericson2314
That just means == is a "lossy" equivalence relation. I rather the precision
be truely observable----every number is "infinite precision". Once can always
include natural as extra field if one cares about empirical precision.

~~~
tjonker
I'm having a bit of trouble parsing this, but Ion decimal values are not
"infinite precision". Every decimal has a very specific, finite precision.
It's a standard "coefficient and exponent" model, with no specification-
enforced limit on either.

~~~
Ericson2314
The idea is each number is adequately precise in that additional precision
would mean additional useless trailing zeros.

------
saosebastiao
Do any of the popular message serialization formats have first class support
for algebraic data types? It seems like every one I've researched has to be
hacked in some way to provide for sum types.

~~~
QuercusMax
Protocol buffers support oneof, which is a union type.
[https://developers.google.com/protocol-
buffers/docs/proto#on...](https://developers.google.com/protocol-
buffers/docs/proto#oneof)

(Insert joke here about Google engineers just copying around protobufs.)

------
kevinSuttle
Would like to see a comparison to EDN. [https://github.com/edn-
format/edn](https://github.com/edn-format/edn)

------
akavel
Can anyone share links to some examples, showcasing the differentiating
features vs. json? I couldn't easily find any via the main link

------
userbinator
Almost every time I see yet another structured data format I'm surprised at
the number of people who haven't ever heard of ASN.1, despite it forming the
basis of _many_ protocols in widespread use.

~~~
_wmd
Usual ASN.1 caveat: parsing its specifications requires money and a lot of
time, implementing many of its encodings (e.g. unaligned PER) is a lifetime's
work, and even the simpler ones thousands of eyes haven't managed to get right
despite years of effort (see OpenSSL, NSPR, etc)

ASN.1 also has a million baroque types (VideotexString, anyone?) where most
people just need "string", "small int", "big int", etc.

Some more on BER parsing hell here: [https://mirage.io/blog/introducing-
asn1](https://mirage.io/blog/introducing-asn1)

~~~
userbinator
_Usual ASN.1 caveat: parsing its specifications requires money and a lot of
time, implementing many of its encodings (e.g. unaligned PER) is a lifetime 's
work_

...unless you're Fabrice Bellard, who apparently wrote one just because it was
one of the minor obstacles on the way to writing a full LTE base station:

[http://www.bellard.org/ffasn1/](http://www.bellard.org/ffasn1/)

------
cm3
A question for frontend devs: Will H2 being binary on the wire inspire more
use of binary data representations as well, with conversion to JSON only on
the client? Passing around JSON or XML across a big SOA (or micro-services)
architecture is a waste of cycles and doesn't have types attached for
reliability and security.

~~~
voltagex_
Do you mean passing around binary between backend services and then having a
binary->JSON "proxy" behind whatever is receiving AJAX requests from the
client?

~~~
cm3
My idea was that the client (HTML+JS) will transform the binary data into JSON
or skip the conversion and process it directly. Seeing how fast JS engines
have become and the amount of typed binary arrays processed in JavaScript, I
believe it's a viable approach. But I'm not a frontend dev, so I can't be
certain.

~~~
kevinSuttle
Sounds a lot like the ActionScript compiler, which compiled their EcmaScript-
style scripting language into bytecode for the Flash runtime.

------
blake8086
How does Ion help with schema evolution? I see it mentioned, but not
described.

~~~
jonhohle
In practice there are three properties that help with schema evolution:

    
    
        1) open types - typically applications consuming Ion data 
           do  not restrict the fields included (that is, they 
           gracefully ignore, and often even pass along additional 
           fields). Schemas may grow while being backwards 
           compatible with existing software.
        2) Type annotations allow embedding schema information into 
           a datagram without the need for agreeing on special 
           fields. Datagrams may have multiple values at the top 
           level, so its possible to provide multiple 
           representations without introducing a new top-level 
           container.
        3) The only data might need to be shared between a producer 
           and consumer is a SymbolTable which may be applicable to 
           several schemas and may be shared inline if necessary.
           Otherwise, objects in a datagram are always inspectable 
           and discoverable without additional metadata.

------
tn13
This appears to be something in between of JSON and Protocol buffers. I wonder
under what conditions Ion makes more sense than either of the JSON/PBuff.

~~~
jonhohle
One significant advantage is you can opt-in to sharing schemas - without
requiring all consumers to have your schema. Like a lot of Amazon's internal
data formats, Ion designed to support backwards compatible schemas as well
(that is, adding additional fields does not break existing consumers).

It has isomorphic text and binary representations as part of the standard
making debugging or optimized transport a config option.

The type system is significantly richer than JSON and maps well to several
languages (internally Amazon uses it with C, C++, Perl, Java, Ruby, etc.).

S-Expressions.

~~~
tantalor
> without requiring all consumers to have your schema

Then how is the client supposed to handle the data? Guessing?

> backwards compatible schemas

> text and binary representations

> type system

> maps well to several languages

Protos have all these.

> S-Expressions

Okay? Is that useful?

~~~
mattdeboard
> Okay? Is that useful?

I bet it's super useful if you have sexprs in your data.

~~~
jonhohle
It's interesting to have languages written for and in the same specification
as your data.

------
viraptor
So far, most of the interesting bits I see in Ion are covered in YAML (which
is also JSON-superset). Most of the rest are extra types, which YAML allows
you to implement. The only really missing bit is the binary encoding... but
that seems unrelated to the text format itself.

This really looks like a NIH specification.

~~~
tjonker
Ion's equivalently-expressive text and binary formats is absolutely central to
its design, and IMO one of its most compelling features. You don't have to
choose between "human readable" or "compact and fast", you can switch between
them at will. This helps Ion meet the requirements of a broader set of
applications, eliminating the cost and complexity and impedance-mismatch
problems you get by transforming between multiple serialization formats.

~~~
viraptor
I get that binary format is nice, but I just don't get why instead of adding
binary format to an existing good text format Amazon decided to first extend a
poor text format and then add binary to that.

Basically: Ion == JSON + extra features + binary format spec. But Ion ~= YAML
+ binary format spec. You're going to write a new serializer/deserializer in
both cases anyway, but in the second one, at least you get the text part for
free in almost any language available.

------
coldcode
Is there any other implementation besides Java? I would be using it from iOS.

------
voltagex_
Open question to anyone reading this: Would you use Ion if you were designing
a new house-wide message queue? (e.g. broadcast messages to
/Home/Lounge/Lights/ to turn on/off)

~~~
tantalor
No that's overkill, just use JSON.

~~~
userbinator
Even JSON is overkill for an application like that. Pure binary would be my
choice.

------
intrasight
I use this [http://dataprotocols.org/tabular-data-
package/](http://dataprotocols.org/tabular-data-package/)

------
kilink
Things I dislike about Ion, having used it while at Amazon:

\- IonValues are mutable by default. I saw bugs where cached IonValues were
accidentally changed, which is easy to do: IonSequence.extract clears the
sequence [1], adding an IonValue to a container mutates the value (!) [2],
etc.

\- IonValues are not thread-safe [3]. You can call makeReadOnly() to make them
immutable, but then you'll be calling clone since doing anything useful (like
adding it to a list) will need to mutate the value. While it says IonValues
are not even thread-safe for reading, I believe this is not strictly true.
There was an internal implementation that would lazily materialize values on
read, but it doesn't look like it's included in the open source version.

\- IonStruct can have multiple fields with the same name, which means it can't
implement Map. I've never seen anyone use this (mis)feature in practice, and I
don't know where it would be useful.

\- Since IonStruct can't implement Map, you don't get the Java 8 default
methods like forEach, getOrDefault, etc.

\- IonStruct doesn't implement keySet, values, spliterator, or stream, and
thus doesn't play well with the Java 8 Stream API.

\- Calling get(fieldName) on an IonStruct returns null if the field isn't
present. But the value might also be there and be null, so you end up having
to do a null check AND call isNullValue(). I'm not convinced it's a worthwhile
distinction, and would have preferred a single way of doing it. You can
already call containsKey to check for the presence of a field.

\- In practice most code that dealt with Ion was nearly as tedious and verbose
as pulling values out of an old-school JSONObject. Every project seemed to
have a slightly different IonUtils class for doing mundane things like pulling
values out of structs, doing all the null checks, casting, etc. There was some
kind of adapter for Jackson that would allow you to deserialize to a POJO, but
it didn't seem like it was widely used.

[1] [https://github.com/amznlabs/ion-
java/blob/master/src/softwar...](https://github.com/amznlabs/ion-
java/blob/master/src/software/amazon/ion/IonSequence.java#L457)

[2] [https://github.com/amznlabs/ion-
java/blob/master/src/softwar...](https://github.com/amznlabs/ion-
java/blob/master/src/software/amazon/ion/IonValue.java#L103-L112)

[3] [https://github.com/amznlabs/ion-
java/blob/master/src/softwar...](https://github.com/amznlabs/ion-
java/blob/master/src/software/amazon/ion/IonValue.java#L119-L140)

~~~
drudru11
This is a good critique. Have you found anything better?

------
incepted
> <groupId>software.amazon.ion</groupId>

Why not "com.amazon.ion", like thousands of other existing packages?

~~~
dwb
They just want to use their shiny new-gTLD domain:
[http://amazon.software](http://amazon.software)

------
stolsvik
Are there any object marshalling/serialization solution for Ion? (Like GSON,
Jackson)

~~~
machinagod
It _is_ possible to adapt Jackson (with minimal effort) to use Ion, since it's
very similar to Jackson's native JSON format.

------
voltagex_
I wonder how difficult this would be to port to C#?

------
breatheoften
Why this instead of clojures "transit"?

