To think about the difference between serialization formats, here's an analogy I hope will help.
Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or C++: you declare your types ahead of time, and then you take some binary payload and "cast" it (parse it actually) into your predefined type. If those bytes weren't actually serialized as that type, you'll get garbage. On the plus side, the fact that you declared your types statically means that you get lots of useful compile-time checking and everything is really efficient. It's also nice because you can use the schema file (ie. .proto files) to declare your schema formally and document everything.
JSON and Ion are more like a Python/Javascript object/dict. Objects are just attribute-value bags. If you say it has field fooBar at runtime, now it does! When you parse, you don't have to know what message type you are expecting, because the key names are all encoded on the wire. On the downside, if you misspell a key name, nothing is going to warn you about it. And things aren't quite as efficient because the general representation has to be a hash map where every value is dynamically typed. On the plus side, you never have to worry about losing your schema file.
I think this is a case where "strongly typed" isn't the clearest way to think about it. It's "statically typed" vs. "dynamically typed" that is the useful distinction.
That's a great analogy! However, I do think strongly typed vs. weakly typed has a role in thinking about this, just a different dimension than the one you're describing. Let's say we come across a JSON structure that looks like this:
{"start": "2007-03-01"}
Is that a timestamp? Maybe! Does it support a time within the day? Perhaps I can write "2007-03-01T13:00:00" in ISO 8601 format if we're lucky. Can I supply a time zone? Who knows for sure? It's weakly typed data. The actual specification of that type of that field lives in a layer on top of JSON, if it's even specified at all. It might be "specified" only in terms of what the applications that handle it can parse and generate. I could drop that value into Excel and treat it as all sorts of different things.
Ion by comparison has a specific data type for timestamps defined in the spec [1]. The timestamp has a canonical representation in both text and binary form. For this reason, I know that "2007-02-23T20:14:33.Z" and "2007-02-23T12:14:33.079-08:00" are valid Ion timestamp text values. In this instance I would describe Ion as strongly typed and JSON as weakly typed. Or, as the Ion documentation puts it, "richly typed".
To make an analogy, weakly typed is the Excel cell that can store whatever value you put in it, or the PHP integer 1 which is considered equal to "1" (loose equality). Strongly typed is the relational database row with a column described precisely by the table schema. Weakly typed is the CSV file; strongly typed is the Ion document.
Ion has more data types than JSON, it's true. Ion has a timestamp type and JSON does not, so you could say it's "richer" if you want, but that just means "it has more types."
However I don't think it's accurate to say that the typing of Ion is any "stronger." Both Ion and JSON are fully dynamically typed, which means that types are attached to every value on the wire. It's just that without an actual timestamp type in JSON, you have to encode timestamp data into a more generic type.
> Some programming languages make it easy to use a value of one type as if it were a value of another type. This is sometimes described as "weak typing".
Strong typing makes it difficult to use a value of one type as if it were another. In PHP, you can compare the integer value 1 to the string value "1" and the equality test returns boolean true. Conflating integer 1 and string "1" is weak typing. A data format that expresses the concept of the timestamp 1999-12-31T23:14:33.079-08:00 using the same fundamental type as the string "Party like it's 1999!" is what I would call weakly typed.
Ion does not make it easy to use a string as if it were a timestamp or vice versa. It has types like arbitrary precision decimals, or binary blobs, that can't easily be represented in a strongly-typed way in JSON. You can certainly invent a representation, like specifying strings as ISO 8601 for timestamps, or an array of numbers for binary -- actually, wait, how about a base64-encoded string instead? Where there's choice there's ambiguity. These concepts of "type" live in the application layer in JSON, instead of in the data layer like they do in Ion.
Note as well that stronger is my term. The Ion documentation says "richly-typed". Certainly Ion does not include every type in the world. Perhaps a future serialization framework might capture "length" with a unit of "meters", or provide a currency type with unit "dollars", and if that existed I'd call it stronger-(ly?)-typed or more richly typed than Ion. In that case, the data layer would prevent you from accidentally converting "3 inches" to "3 centimeters" by accident, since those would be different types. That would be stronger typing than an example where you simply have the integer 3, and it's the application's job to track which integers represent inches, and which represent centimeters. So perhaps "strong" and "weak" are not the best terms, so much as "stronger" and "weaker".
By your definition, any language with strings is weakly typed, since you can always interpret a string as being something else. Strongly/weakly typed has never been a particularly useful description (as the page you linked notes), and I think it's particularly unhelpful here.
> By your definition, any language with strings is weakly typed, since you can always interpret a string as being something else
No, I wouldn't say that's the case. For example, in PHP you can literally write:
if (1 == "1") { ...
... and the condition evaluates to true. You can do similar things in Excel; Excel doesn't even really differentiate between those two values in the first place. (At least that's how it seems as a casual user.)
This is not the case in strongly typed programming languages that have strings such as C++ or Java. You can convert from one type to another, sure, by explicitly invoking a function like atoi() or Integer.toString(), but the conversion is deliberate and so it is strongly typed. A variable containing a string (java.lang.String) cannot be compared against one containing a timestamp (java.util.Date) by accident. An Ion timestamp is a timestamp and can't be conflated with a string, although it can be converted to one.
Edit: The set of types that are built in, in conjunction with how those types are expressed in programming languages (e.g. timestamp as java.util.Date, decimal as java.math.BigDecimal, blob as byte[]), is why I'd call Ion strongly typed or richly typed in comparison to JSON. Specifically, scalar values that frequently appear in common programs can be expressed with distinctly typed scalar values in Ion. I don't know if there's a good formal definition. You could probably define a preorder on programming languages or data formats based simply on the number of distinct scalar or composite types (so in that sense, yes, it's the fact that Ion has more). However it goes beyond that subjectively. Subjectively it's about how often you have to, in practice, convert from one type to another in common tasks. There is no clear way to represent an arbitrary-precision decimal in JSON, or a byte array, or a timestamp -- so you must "compress" those types down into a single JSON type like string-of-some-format or array-of-number; and several different scalar types must all map to that same JSON type, which creates the risk of conflating values of different logical types but the same physical JSON type with each other. There's no obvious or built-in way to reconstruct the original type with fidelity. There's no self-describing path back from "1999-12-31T23:14:33.079-08:00" and "DEADBEEFBASE64" back to those original types.
I subjectively call JSON weakly typed because its types are not adequately to uniquely store common scalar data types that I work with in programs that I write. I call Ion strongly typed because it typically can. I acknowledged earlier that a data format would be even more strongly typed if it was capable of representing not just the type "integer", but "integer length meters". Ion does not have this kind of type built in, though its annotations feature could be used to describe that a particular integer value represents a length in meters.
> You can't misuse any kind of Ion value that is a string as if it were a timestamp without performing an explicit conversion.
The same is true of JSON. There is no difference, except that Ion has a timestamp type and JSON does not.
If you disagree, please identify what characteristic of Ion's design makes it more strongly typed than JSON, other than the set of types that is built in.
You are choosing a definition of strong typing that supports your argument, but the argument is over the meaning of strong typing to begin with. It's not as if there's some universally accepted definition of strong typing. Like functional programming, functional purity, object oriented, etc.—none of these terms are universally defined.
I hate feeling like I'm nitpicking, but I don't think that's true. I think they do have a well-accepted definition, which appears in Wikipedia, in assorted articles online, and in computer science publications. Here are some examples of CS publications that describe a research contribution in terms of strong typing:
> Strong typing of object-oriented languages revisited. This paper is concerned with the relation between subtyping and subclassing and their influence on programming language design. [...] The type system of a language can be characterized as strong or weak and the type checking mechanism as static or dynamic. http://dl.acm.org/citation.cfm?id=97964
> GALILEO: a strongly-typed, interactive conceptual language. Galileo, a programming language for database applications, is presented. Galileo is a strongly-typed, interactive programming language designed specifically to support semantic data model features (classification, aggregation, and specialization), as well as the abstraction mechanisms of modern programming languages (types, abstract types, and modularization). http://dl.acm.org/citation.cfm?id=3859
> Strongly typed genetic programming. Genetic programming is a powerful method for automatically generating computer programs via the process of natural selection [but] there is no way to restrict the programs it generates to those where the functions operate on appropriate data types. [When] programs manipulate multiple data types and contain functions designed to operate on particular data types, this can lead to unnecessarily large search times and/or unnecessarily poor generalization performance. Strongly typed genetic programming (STGP) is an enhanced version of genetic programming that enforces data-type constraints and whose use of generic functions and generic data types makes it more powerful than other approaches to type-constraint enforcement http://dl.acm.org/citation.cfm?id=1326695
The argument that the terms have no universal definition cannot be sound in light of their widespread use in computer science publications, even in the title and abstract. Perhaps what you mean to say is that the terms don't have a completely unambiguous or formal definition. That's probably true, but not all CS terms do. The words are contextual and exist on a spectrum, in the sense that a strongly-typed thing is typically in comparison to a more-weakly-typed thing [1]. However, the fact that they're widely used by CS researchers is why I think we should reject the argument that they don't have a universal definition or are not useful. CS researchers like Oleg Kiselyov use the term when describing their papers and characterizing their contributions.
[1] This is true for static and dynamic typing as well: they exist in degrees. Rust can verify type proofs that other languages can't regarding memory safety. Some languages can verify that integer indexes into an array won't go out of bounds. Thus it's not the case that a given language is either statically typed or dynamically typed; rather, each aspect of how it works can be characterized on a spectrum from statically verified to dynamically verified.
> I think they do have a well-accepted definition [...] [You] shouldn't confuse your dislike for them for the absence of a well-accepted definition that's widely used in computer science literature.
Just upthread, you said:
> The notions of "strong" and "weak" typing have never been particularly well-defined
> A number of different language design decisions have been referred to as evidence of "strong" or "weak" typing. In fact, many of these are more accurately understood as the presence or absence of type safety, memory safety, static type-checking, or dynamic type-checking.
> Languages are often colloquially referred to as "strongly typed" or "weakly typed". In fact, there is no universally accepted definition of what these terms mean. In general, there are more precise terms to represent the differences between type systems that lead people to call them "strong" or "weak".
...which is exactly what I'm saying in this entire thread.
It's very strange to me how you seem really seem to want other people to be on board with your particular interpretation of what everybody (even you, 13 hours ago) agrees is not a very well-defined concept.
> This is true for static and dynamic typing as well: they exist in degrees. Rust can verify type proofs that other languages can't regarding memory safety. Some languages can verify that integer indexes into an array won't go out of bounds. Thus it's not the case that a given language is either statically typed or dynamically typed
Memory safety and static/dynamic typing are orthogonal. C is statically typed but memory unsafe. Rust is statically typed but memory safe (except in unsafe blocks). Lua is dynamically-typed but memory safe.
I agree that it's possible to mix elements of static and dynamic typing in a single language. C++ is generally statically typed, but also supports dynamic_cast<>.
But generally speaking, static and dynamic typing have a very precise definition. Something that carries around type information at runtime is dynamically typed. Something that does type analysis at compile time so that the runtime doesn't need to carry type information is statically typed.
I generally agree, except the "type" of JSON numbers isn't well-defined with respect to precision and binary-vs-decimal floating point representation. An application that cares deeply about either aspect of numbers can't rely on JSON alone to ensure that the values are properly interpreted by all consumers.
That is a good point in that it is a very accurate reading of the JSON spec. In practice many (even most) JSON implementations don't give applications access to any precision beyond what an IEEE double can represent. So while you may take advantage of arbitrary precision in JSON and be fine according to the spec, your users will probably suffer data loss unless they are very picky about what JSON library they use. For example, JSON.parse() in JavaScript is out.
It's more than just precision, it's making sure that the same value comes out that went in, and that things haven't been subtly altered via unintended conversions between decimal and binary floating-point representations. Obviously this is quite important when you've got both text and binary formats.
Some applications really need decimal values, and some really need IEEE floats. Ion can accurately and precisely denote both types of data, making it easier to ensure that the data is handled properly by both reader and writer.
That's a good description, but I'd say that we have a strongly <-> weakly typed axis and a statically <-> dynamically typed axis here. Or I might actually prefer to name the first axis poorly <-> richly typed.
poorly typed <-------------> richly typed
dynamic CSV, INI JSON YAML, Ion
static Bencode, ASN.1 Protobuf
What I mean by "richly typed" is that you would never read a timestamp off the wire and not know that it's a timestamp. By comparison, with CSV or INI files, you just have strings everywhere. Formats on the richly typed side have separate and explicit types for binary blobs and text, for example.
Sure, I think your "poorly typed" vs. "richly typed" axis just refers to how many built-in types it has. It's true that CSV and INI only have one type (string). And it's true that when more types are built in, you have fewer cases where you have to just stuff your data into a specially-formatted string.
Finally! I've had to live the JSON nightmare since I left Amazon.
Some of the benefits over JSON:
* Real date type
* Real binary type - no need to base64 encode
* Real decimal type - invaluable when working with currency
* Annotations - You can tag an Ion field in a map with an annotation that says, e.g. its compression ("csv", "snappy") or its serialized type ('com.example.Foo').
* Text and binary format
* Symbol tables - this is like automated jsonpack.
* It's self-describing - meaning, unlike Avro, you don't need the schema ahead of time to read or write the data.
Sounds a lot like Apple's property list format, which shares almost everything you listed in common, except for annotations and symbol tables.
Its binary format was introduced in 2002!
Edit: Property lists only support integers up to 128 bits in size and double-precision floating point numbers. On top of those, Ion also supports infinite precision decimals.
Plists are nifty, but the text format's XML-based, which makes it too complex and too verbose to be a general-purpose alternative to something like JSON.
(plutil "supports" a json format, but it's not capable of expressing the complete feature set of the XML or binary formats.)
Like Property Lists the binary format is TLV encoded as well. Ion has a more compact binary representation for the same data and additional types and metadata. Also, IIRC, Plist types are limited to 32-bit lengths for all data types. The binary Ion representation has no such restriction (though in practice sizes are often limited by the language implementation).
True, but my point was that there's enough talent at Amazon, working on SDKs, and others, and there are precedents where even more complex projects such as JMESPath have wide support [0].
I'm sure there are ion bindings for every language in common use at Amazon. But a huge percentage of Amazon code is Java, so presumably this one was the best maintained and documented.
I doubt it, when I was there Ion was only used by only a handful of Java teams doing backend work. It was also horribly documented and supported at the time (3.5 years ago).
I am still in Amazon and Ion is definitely the most widely used library around. It has among the best documented code and some of the extensions that have been built on top of Ion are simply amazing.
Precision is the number of significant digits. Oracle guarantees the
portability of numbers with precision ranging from 1 to 38.
Scale is the number of digits to the right (positive) or left (negative)
of the decimal point. The scale can range from -84 to 127.
Worth noting this isn't specifically an Oracle thing, most financial systems need to be sure that it can store currency numbers accurately and this convention is widely used to ensure this.
Typically when dealing with currencies scale is only used to represent the units less than whole unit of the currency, i.e. cents and pence. But there isn't anything that restricts it from being used to accommodate larger numbers with the use of negative scales.
<CcyNtry>
<CtryNm>UNITED STATES OF AMERICA (THE)</CtryNm>
<CcyNm>US Dollar</CcyNm>
<Ccy>USD</Ccy>
<CcyNbr>840</CcyNbr>
<CcyMnrUnts>2</CcyMnrUnts>
</CcyNtry>
The CcyMnrUnts property denotes 2 decimal places for the sub-unit of the dollar.
So for the above example of 99999.999 you would store an amount of 99999999 and a scale of 3.
So that it can be an integer with an arbitrary length, or a float/double without precision problems. You can also let our own integer classes do the parsing (which might even be able tonhandle complex types).
After all, everything in JSON is a string since it doesn't have a binary format and it shouldn't cause a huge overhead to do the parsing yourself (that might depend on the library, though).
I find that many financial technology companies opt to store currency as strings. The small overhead is typically well worth freedom from floating-point errors.
Any text format is a technically a string. Just because some numeric token has no double quotes around it doesn't mean it isn't a string (in that representation).
It's just that we can have some lexical rules that if that piece of text has the right kind of squiggly tail or whatever, it is always treated as a decimal float instead of every programmer working with it individually having to deal with an ad hoc string to decimal type conversion.
Until you need to deal with decimals instead of floats, then you are going to hate yourself because you have to pull in some third party library because the language treats every single number as a float (and floating point errors are a lot more common than most people think even when they are adding together simple numbers).
Sure, but most other languages have built-in support for decimal types. Java has BigDecimal, as does Ruby, Python has the decimal module, C# has System.Decimal, the list goes on.
Javascript doesn't even have proper integers to guarantee the functioning of this correctly, it's a really sad state.
I just did some ownership percentage stuff where it's not uncommon to go 16 decimal places out...working with JavaScript on this was a pain. Never thought I'd care about that .00000000000001 difference hah...
Ion has functions to turn Ion into JSON which will, of course, lose information. Annotations are dropped, decimals turn into a JSON number type which may lose precision, etc.
I Consider this Harmful (TM) and will oppose the adoption in every organization where I have an opportunity to voice such. (In its present form, to be clear!)
There is no need to have a null which is fragmented into null.timestamp, null.string and whatever. It will complicate processing. Just because you know the type of some element is timestamp, you must worry whether or not it is null and what that means.
There should be just one null value, which is its own type. A given datum is either permitted to be null OR something else like a string. Or it isn't; it is expected to be a string, which is distinct from the null value; no string is a null value.
It's good to have a read notation for a timestamp, but it's not an elementary type; a timestamp is clearly an aggregate and should be understood as corresponding to some structure type. A timestamp should be expressible using that structure, not only as a special token.
This monstrosity is not exhibiting good typing; it is not good static typing, and not good dynamic typing either. Under static typing we can have some "maybe" type instead of null.string: in some representations we definitely have a string. In some other places we have a "maybe string", a derived type which gives us the possibility that a string is there, or isn't. Under dynamic typing, we can superimpose objects of different type in the same places; we don't need a null version of string since we can have "the" one and only null object there.
This looks like it was invented by people who live and breathe Java and do not know any other way of structuring data. Java uses statically typed references to dynamic objects, and each such reference type has a null in its domain so that "object not there" can be represented. But just because you're working on a reference implementation in such a language doesn't mean you cannot transcend the semantics of the implementation language. If you want to propose some broad interoperability standard, you practically must.
In practice, it doesn't. If you want to know if an IonValue is null, ask it with #isNull. If you don't care about the null's type, ignore it. On the other hand, the type is an additional form of metadata which allows overloading the meaning of a value.
nulls can also be annotated, so Ion doesn't really have the concept of a singular shared null sentinel.
More so than JSON, Ion often uses nulls to differentiate presence from value (that is, the lack of a field in a struct has a different meaning the presence of that field with a null value). Since nulls are objects, they can be tested separately from the lack of a field definition.
> a timestamp is clearly an aggregate and should be understood as corresponding to some structure type.
Timestamps are structured types with a literal representation that is explicitly modeled in the specification. You're free to ignore it and use a custom schema for representing time, but you've moved any validation into your application at that point and are no better off than JSON.
I think the concern is that if you take an IonValue and cast it to IonText, then call stringValue(), you'll get an exception somewhere if the document contained a null value.
It recalls the nullability arguments between the ML family and the C/Java family.
kazinator is asking for safer document semantics and a type-safe API.
Sorry, this isn't true. The Java and C++ implementations were developed simultaneously, by different authors. The Java side has always been more full-featured, and has always had a larger user base by at least an order of magnitude. Today IonJava is among the most widely-consumed libraries within Amazon.
They both have self-describing schemas, support for binary values, JSON-interoperability, basic type systems (Ion seems to support a few more field types), field annotations, support for schema evolution, code generation not necessary, etc.
I think Avro has the additional advantages of being production-tested in many different companies, a fully-JSON schema, support for many languages, RPC baked into the spec, and solid performance numbers found across the web.
I can't really see why I'd prefer Ion. It looks like an excellent piece of software with plenty of tests, no doubt, but I think I could do without "clobs", "sexprs", and "symbols" at this level of representation, and it might actually be better if I do. Am I missing something?
What do you mean by they both have self-describing schemas? In order to read or write Avro data, an application needs to possess a schema for that data -- the specific schema that the data was written with, and (when writing) the same schema that a later reader expects to find. This means the data is not self-describing.
Ion is designed to be self-describing, meaning that no schema is necessary to deserialize and interact with Ion structures. It's consequently possible to interact with Ion in a dynamic and reflective way, for example, in the same way that you can with JSON and XML. It's possible to write a pretty-printer for a binary Ion structure coming off the wire without having any idea of or schema for what's inside. Ion's advantage over those formats is that it's strongly typed (or richly typed, if you prefer). For example, Ion has types for timestamps, arbitrary-precision decimals like for currency, and can embed binary data directly (without base64 encoding), etc.
I wouldn't try to say that one or the other is better across the board. Rather, they have tradeoffs and relative strengths in different circumstances. Ion is in part designed to tackle scenarios like where your data might live a really long time, and needs to be comprehensible decades from now (whether you kept track of the schema or not, or remember which one it was); and needs to be comprehensible in a large distributed environment where not every application might possess the latest schema or where coordinating a single compile-time schema is a challenge (maybe each app only cares about some part of the data), and so on. Ion is well-suited to long-lived, document-type data that's stored at rest and interacted with in a variety of potentially complex ways over time. Data data. In the case of a simple RPC relationship between a single client and service, where the data being exchanged is ephemeral and won't stick around, and it's easy to definitively coordinate a schema across both applications, a typical serialization framework is a fine choice.
I think it depends on what level you're referring to. If you mean record-level, then I concede that it's not self-describing. However, looking at the suggested use cases, it seems that it's "self-describing" in that you'll always be able to decode data stored according to what the documentation recommends:
"Avro data is always serialized with its schema. Files that store Avro data should always also include the schema for that data in the same file. Avro-based remote procedure call (RPC) systems must also guarantee that remote recipients of data have a copy of the schema used to write that data."
That's interesting. I didn't know that about Avro. Does the framework take responsibility for including the schema and defining a format consisting of schema plus data, or is that the responsibility of the application layer? It sounds like that might just be a convention or best practice recommended in the documentation, rather than a technical property of Avro itself.
If it's the application's responsibility to bundle the schema in Avro, then one difference is that Ion takes responsibility for embedding schema information along with each structure and field. Ion is also capable of representing data where there is no schema (analogy: a complex document like an HTML5 page), or working efficiently with large structures without deserializing everything even if the application needs data in just one field.
Another platform in contrast with Ion is Apache Parquet [1]. Parquet's support for columnar data means that it can serialize and compress table-like data extremely efficiently (it serializes all values in one column, followed by the next, until the end of a chunk -- enabling efficient compression as well as efficient column scans). Ion by comparison would serialize each row and field within it in a self-describing way (even though that information is redundant, in this particular case, since all rows are the same). Great flexibility and high fidelity at the expense of efficiency.
Avro files have a header which has metadata including the schema as well as things like compression codec (supports deflate and snappy) and all of the implementations that I have used (java and python bindings mostly) just does this in the background.
Another fun thing is that avro supports union types, so to make things nullable you just union[null, double] or whatever.
But one of the best things about avro (and parquet for that matter) is that it is well supported by the hadoop ecosystem
In the spec[1] there is a definition of an "object container file" which includes the schema, and is the default format used whenever you save an Avro file. You can even use it whenever sending Avro data through the wire, if you don't mind paying the extra space cost.
I think libraries generally take care of stuffing the schema into the wire protocol, and I have a hunch you're right in that it's implementation-defined.
I like that in this regard, any individual record in Ion is standalone. I can think of a few ways that could come in handy, e.g., a data packet of nested mixed-version records. Did not know about Paraquet, thanks!
There are some use cases where record-level self description is very useful. For example when dealing with small records in a database or NoSQL store or message queue that could be written by multiple versions of applications. To cover that case well with Avro where records are not self describing really requires something like a schema registry and embedding a schema id with each record (e.g. http://www.confluent.io/blog/schema-registry-kafka-stream-pr... ).
The intent of the stored schema isn't really for self-description. A typical use case for Avro is data storage over long periods of time.
It is expected that the schema will evolve at some point during this time. Therefore you still need to specify a target schema to read the data into which is allowed to be different than the stored schema. Avro then maps the stored data into the target schema by using the stored schema. Most avro libraries expect you to get the target schema from a separate source before reading data.
The JSON RFC was published right around the time Ion was being created making Ion about 10 years old this year. It was not clear at the time that JSON would become so popular ;-) and at the same time Ion fixed a lot of its weak areas (numeric types, dates, struct/object syntax, etc.)
That's not true. Avro at least existed at the time. However they wanted something self-describing to replace JSON/XML usage. Avro is better suited as a data storage format rather than a transit oriented format. Of course, both Ion and Avro can be used for either, but Avro will give you better compression on disk, but Ion is less cumbersome since it doesn't require a schema
A big problem with Avro, BSON, and many other "binary JSON" formats is that they're not isomorphic with JSON, they have a bunch of additional stuff added on. There's Avro documents that don't have direct JSON equivalents. Which means that when a human needs to read the data, you have to transform it into some third format.
It's a core feature of Ion that the text and binary representations are isomorphic. You can take any Ion binary document and pretty-print it as an Ion text document that is exactly equivalent. You can edit that document and send it into your application, which will be guaranteed to be able to read it. Or you can take your hand-authored text data and transcode it into binary, and know that any Ion application can handle it without any extra effort.
> A big problem with Avro, BSON, and many other "binary JSON" formats is that they're not isomorphic with JSON, they have a bunch of additional stuff added on. There's Avro documents that don't have direct JSON equivalents. Which means that when a human needs to read the data, you have to transform it into some third format.
Also, field order matters in Avro but not in JSON. That bit me pretty hard once... fortunately, I found out that Python's JSON library lets you read a JSON file into an OrderedDict instead of a plain dict, so I was able to get around it.
> but I think I could do without "clobs", "sexprs", and "symbols" at this level of representation
I'm by no means a real Lisp programmer, but even I find S-Expressions more natural to write and process. And the simplicity of it allows for great editing tools too. This may be personal, but I always found JSON clunky.
Hear, hear! Posted it as I've been raving about ion (particularly s-exp support) to non-amazonians, but credit goes all to them.
It's particularly interesting to see the fixes and improvements from the actual open source cleanup effort getting to (many) Internal production services.
Amazon doesn't open source things, as a general rule. It can be done but it is a lot of jumping through hoops and they generally need good reasons to do it (as opposed to a lack of good reasons not to).
Interestingly enough a JSON alternative named "ION" was just posted as a Show HN[0] about three months ago.
So now not only do we have the problem of redundant and mutually incompatible protocols (cue obligatory xkcd), but that we have so many such protocols that name collision is becoming an extra problem.
Binary values can be stored as base64 in regular old JSON as well. Yes that is bigger but same as email/MIME binary chunks are converted to base64. Email messages and attachments are handled this way, we do this everyday. Base64 does bloat by 40%ish, so the larger content could be compressed/decompressed prior to base64 encoding it and vice versa or even encrypted/decrypted on either end in software/app layer.
No need for a new protocol when doing it that way for basic things, if you need more binary (busy messaging/real-time) there are plenty of alternatives to JSON.
I love the simplicity of JSON, so do others and it is successful so many try to attach on to that success. The success part was that it was so damn simple though, most attachments just complicate and add verbosity, echoes back to XML and SOAP wars which spawned the plain and simple JSON. Adding complexity is easy and anyone can do it, good engineers take complexity and make it simple, that is damn difficult.
> Binary values can be stored as base64 in regular old JSON as well
But in JSON you'd encode that Base64 as a string and the application must know that the data isn't really a string but a blob of some type of encoding. That probably means wrapping in another struct to provide that metadata. Ion provide a terse method of doing the same while maintaining data integrity:
The 'image/gif' annotation is application specific, but all consumers know that the contents of that value are binary. In the binary Ion representation, those 43-bytes are encoded as a 45 byte value (one byte for the type marker and a second for the length in this case; as little as 47 with the annotation and a shared symbol table), making the binary representation very efficient for transferring binary data.
Since Ion is a superset of JSON, it's by definition more complex, but the complexity isn't unapproachable. Most of the engineers I worked with assumed it was JSON until coming across timestamps, annotations, or bare word symbols.
I can't decide if "JSON-superset" is technically accurate or not.
JSON's string literals come from JavaScript, and JavaScript only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code unit, not a code point. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly:
"\ud83d\udca9"
JSON specifically says this, too,
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
[… snip …]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Now, Ion's spec says only:
U+HHHH \uHHHH 4-digit hexadecimal Unicode code point
But if we take it to mean code point, then if the value is a surrogate… what should happen?
Looking at the code, it looks like the above JSON will parse:
1. Main parsing of \u here:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
2. which is called from here, and just appended to a StringBuilder:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
My Java isn't that great though, so I'm speculating. But I'm not sure what should happen.
This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine.
> But if we take it to mean code point, then if the value is a surrogate… what should happen?
Surrogates are code points. The spec does not say what should happen if the surrogate is invalid (for example, if only the first surrogate of a surrogate pair is present), but neither does the JSON spec.
Java internally also represents non-BMP code points using surrogates. So, simply appending the surrogates to the string should yield a valid Java string if the surrogates in the input are valid.
Is there a source for benchmarks/reviews for the various ways to represent data? As far as I see it, there are a lot of them that I'd like to hear pros/cons for: json, edn + transit (my fave), yaml, google protobufs, thrift (?), as well as Ion.
MessagePack is quite fast and the newest version has binary fields, but it lacks the rich datatypes like decimals and timestamps mentioned by another commenter. If Ion is as fast and has adequate language support, it sounds like it would be a good first choice for a new project.
Ion's advantage is that it's both strongly-typed with a rich type system, as well as self-describing.
Data formats like JSON and XML can be somewhat self-describing, but they aren't always completely. Both tend to need to embed more complex data types as either strings with implied formats, or nested structures. (Consider: How would you represent a timestamp in JSON such that an application could unambiguously read it? An arbitrary-precision decimal? A byte array?) I'm not familiar with EDN, but it appears to be in a similar position as JSON in this regard. ProtocolBuffers, Thrift, and Avro require a schema to be defined in advance, and only work with schema-described data as serialization layers. Ion is designed to work with self-describing data that might be fairly complex, and have no compiled-ahead-of-time schema.
Ion makes it easy to pass data around with high fidelity even if intermediate systems through which the data passes understand only part of the data but not all of it. A classic weakness of traditional RPC systems is that, during an upgrade where an existing structure gains an additional field, that structure might pass through an application that doesn't know about the field yet. Thus when the structure gets deserialized and serialized again, the field is missing. The Ion structure by comparison can be passed from the wire to the application and back without that kind of loss. (Some serialization-based frameworks have solutions to this problem too.)
One downside is that its performance tends to be worse than schema-based serialization frameworks like Thrift/ProtoBuf/Avro where the payload is generally known in advance, and code can be generated that will read and deserialize it. Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.
However, it doesn't support blobs. I'm conflicted about this point. On one hand, small blobs can occasionally be useful to send within a larger payload. On the other hand, small blobs almost always become large blobs, and so I'd rather plan for out-of-band (preferably even content addressable) representations of blobs.
> Another downside is that it's difficult to isolate Ion-aware code from the more general purpose "business logic" in an application, due to the absence of a serialization layer producing/consuming POJOs; instead it's common to read an Ion structure from the wire and access it directly from application logic.
This is indeed a common pitfall, especially since traversing Ion is slow and expensive. I've squeezed up to 30% performance gain by converting Ion data to POJOs up front and just using those.
They complain about how CBOR is a superset of JSON data types and so some CBOR values (like bignum) might not down-convert to JSON cleanly, and then in the next paragraph they talk about how Ion is a superset of JSON data types including 'arbitrary sized integers'.
Bad doubletalk. Boo. (I have implemented CBOR in a couple languages and like it. Every few months we get to say, "oh look, _another_ binary JSON.")
@LVB, thanks for that. RTFM-ing made me think twice about adopting CBOR or going with Ion. I'll also mention Velocypack (https://github.com/arangodb/velocypack) while here.
Wasn't this solved already by the BSON specification - http://bsonspec.org ? Sure this allows you a definition of types, but this could easily be done using standard JSON meta data for each field. I find BSON simpler and more elegant.
* It doesn't have "true" types in the sense that Ion does. It's basically just a binary serialization of JSON, with extra stuff.
* Despite being a binary format, it's actually bulkier than JSON in most situations.
* It removes any semblance of canonicity from many representations. A number, for instance, can potentially be represented by any of at least 3 types (double, int32, and int64).
* It has signed 32-bit length limits all over the place. Not that I'd want to be storing 2GB of data in a single JSON document either, but it's not even possible to do so with BSON!
* It requires redundant null bytes in unpredictable places. For instance, all strings must be stored with a trailing null byte, which is included in their length. There's also a trailing null byte at the end of a document for no reason at all.
* It is unabashedly Javascript-specific, containing types like "JavaScript code with scope" which are meaningless to other languages.
* It also contains some MongoDB-specific cruft, such as the "ObjectID" and "timestamp" types (the latter of which, despite its name, cannot actually be used to store time values).
* It contains numerous "deprecated" and "old" features (in version 1.0!) with no guidance as to how implementations should handle them.
Yes, and not only that. It also inherently insecure, while JSON is together with msgpack the only fast and secure serialization format out there. The problem is the encoding of objects and code without any checksumming, so it can be trivially tampered with, leading to very nice exploits, mostly remotely.
YAML does most of those and does more and can be made quite secure by limiting the allowed types to the absolute and trusted minimum, but this e.g. not implemented in the perl, only the python backend. By default YAML is extremely insecure.
There are more new readable and typed JSON variants out there. E.g. jzon-c should be faster than ion, but there are also Hjson and SJSON. See https://github.com/KarlZylinski/jzon-c
Most of this comes from BSON also being the internal storage format for a database server. For example, at least the redundant string NULs make it possible to use C library functions without copying, the unpacked ints allow direct dereferencing, etc.
I've no clue about the trailing NUL on the record itself, perhaps a safety feature?
> I've no clue about the trailing NUL on the record itself, perhaps a safety feature?
Could be. Or perhaps there's enough code paths in common between string parsing and document parsing that they decided to put a trailing null byte on both.
Stepping back a bit, though, the fact that BSON is optimized for "direct" use in C code is really scary. That suggests that any failure to completely validate BSON data could open up vulnerabilities in C code manipulating it.
That just means == is a "lossy" equivalence relation. I rather the precision be truely observable----every number is "infinite precision". Once can always include natural as extra field if one cares about empirical precision.
I'm having a bit of trouble parsing this, but Ion decimal values are not "infinite precision". Every decimal has a very specific, finite precision. It's a standard "coefficient and exponent" model, with no specification-enforced limit on either.
The "!=" means "not the same value according to the Ion data model".
The Ion value 0.0 has one digit of precision (after the decimal point), while the value 0.00 has two. In the Ion data model, those are two distinct values, and conforming implementations
must maintain the distinction.
Do any of the popular message serialization formats have first class support for algebraic data types? It seems like every one I've researched has to be hacked in some way to provide for sum types.
Almost every time I see yet another structured data format I'm surprised at the number of people who haven't ever heard of ASN.1, despite it forming the basis of many protocols in widespread use.
Usual ASN.1 caveat: parsing its specifications requires money and a lot of time, implementing many of its encodings (e.g. unaligned PER) is a lifetime's work, and even the simpler ones thousands of eyes haven't managed to get right despite years of effort (see OpenSSL, NSPR, etc)
ASN.1 also has a million baroque types (VideotexString, anyone?) where most people just need "string", "small int", "big int", etc.
Usual ASN.1 caveat: parsing its specifications requires money and a lot of time, implementing many of its encodings (e.g. unaligned PER) is a lifetime's work
...unless you're Fabrice Bellard, who apparently wrote one just because it was one of the minor obstacles on the way to writing a full LTE base station:
We heard of it and we despise it. It's the most horrible structured data format out there in the wild. Even worse than XML, and this is quite something.
A question for frontend devs: Will H2 being binary on the wire inspire more use of binary data representations as well, with conversion to JSON only on the client? Passing around JSON or XML across a big SOA (or micro-services) architecture is a waste of cycles and doesn't have types attached for reliability and security.
Do you mean passing around binary between backend services and then having a binary->JSON "proxy" behind whatever is receiving AJAX requests from the client?
My idea was that the client (HTML+JS) will transform the binary data into JSON or skip the conversion and process it directly. Seeing how fast JS engines have become and the amount of typed binary arrays processed in JavaScript, I believe it's a viable approach. But I'm not a frontend dev, so I can't be certain.
In practice there are three properties that help with schema evolution:
1) open types - typically applications consuming Ion data
do not restrict the fields included (that is, they
gracefully ignore, and often even pass along additional
fields). Schemas may grow while being backwards
compatible with existing software.
2) Type annotations allow embedding schema information into
a datagram without the need for agreeing on special
fields. Datagrams may have multiple values at the top
level, so its possible to provide multiple
representations without introducing a new top-level
container.
3) The only data might need to be shared between a producer
and consumer is a SymbolTable which may be applicable to
several schemas and may be shared inline if necessary.
Otherwise, objects in a datagram are always inspectable
and discoverable without additional metadata.
This appears to be something in between of JSON and Protocol buffers. I wonder under what conditions Ion makes more sense than either of the JSON/PBuff.
One significant advantage is you can opt-in to sharing schemas - without requiring all consumers to have your schema. Like a lot of Amazon's internal data formats, Ion designed to support backwards compatible schemas as well (that is, adding additional fields does not break existing consumers).
It has isomorphic text and binary representations as part of the standard making debugging or optimized transport a config option.
The type system is significantly richer than JSON and maps well to several languages (internally Amazon uses it with C, C++, Perl, Java, Ruby, etc.).
Yes, super useful. Depends on how your application needs to use S-expressions. You could define DSLs for a very expressive and complex rules-engine with S-Expressions forming your rules. Now you can write your rules as text, pass it around and build rule evaluators of those expressions all on top of Ion.
So far, most of the interesting bits I see in Ion are covered in YAML (which is also JSON-superset). Most of the rest are extra types, which YAML allows you to implement. The only really missing bit is the binary encoding... but that seems unrelated to the text format itself.
Ion's equivalently-expressive text and binary formats is absolutely central to its design, and IMO one of its most compelling features. You don't have to choose between "human readable" or "compact and fast", you can switch between them at will. This helps Ion meet the requirements of a broader set of applications, eliminating the cost and complexity and impedance-mismatch problems you get by transforming between multiple serialization formats.
I get that binary format is nice, but I just don't get why instead of adding binary format to an existing good text format Amazon decided to first extend a poor text format and then add binary to that.
Basically: Ion == JSON + extra features + binary format spec. But Ion ~= YAML + binary format spec. You're going to write a new serializer/deserializer in both cases anyway, but in the second one, at least you get the text part for free in almost any language available.
Open question to anyone reading this: Would you use Ion if you were designing a new house-wide message queue? (e.g. broadcast messages to /Home/Lounge/Lights/ to turn on/off)
Maybe, when Ion gets supports for most major languages. I won't touch it now because it means to go Java for every application that reads or writes to that queue. Not because of Java, but because it's only one language. It should get on par with support for the other formats listed in the comments before one should be confident to use it.
At least the JVM supports multiple languages (Scala, Clojure etc.), but if there's a spec it shouldn't be hard for anyone to add support for other languages.
This is offtopic, but I'm looking into having JSON schemas on another Mosquitto topic so that clients can request it, kinda like SOAP's WSDL (recovering C# programmer here).
Things I dislike about Ion, having used it while at Amazon:
- IonValues are mutable by default. I saw bugs where cached IonValues were accidentally changed, which is easy to do: IonSequence.extract clears the sequence [1], adding an IonValue to a container mutates the value (!) [2], etc.
- IonValues are not thread-safe [3]. You can call makeReadOnly() to make them immutable, but then you'll be calling clone since doing anything useful (like adding it to a list) will need to mutate the value. While it says IonValues are not even thread-safe for reading, I believe this is not strictly true. There was an internal implementation that would lazily materialize values on read, but it doesn't look like it's included in the open source version.
- IonStruct can have multiple fields with the same name, which means it can't implement Map. I've never seen anyone use this (mis)feature in practice, and I don't know where it would be useful.
- Since IonStruct can't implement Map, you don't get the Java 8 default methods like forEach, getOrDefault, etc.
- IonStruct doesn't implement keySet, values, spliterator, or stream, and thus doesn't play well with the Java 8 Stream API.
- Calling get(fieldName) on an IonStruct returns null if the field isn't present. But the value might also be there and be null, so you end up having to do a null check AND call isNullValue(). I'm not convinced it's a worthwhile distinction, and would have preferred a single way of doing it. You can already call containsKey to check for the presence of a field.
- In practice most code that dealt with Ion was nearly as tedious and verbose as pulling values out of an old-school JSONObject. Every project seemed to have a slightly different IonUtils class for doing mundane things like pulling values out of structs, doing all the null checks, casting, etc. There was some kind of adapter for Jackson that would allow you to deserialize to a POJO, but it didn't seem like it was widely used.
Protocol Buffers (and I think Thrift, and maybe Avro) are sort of like C or C++: you declare your types ahead of time, and then you take some binary payload and "cast" it (parse it actually) into your predefined type. If those bytes weren't actually serialized as that type, you'll get garbage. On the plus side, the fact that you declared your types statically means that you get lots of useful compile-time checking and everything is really efficient. It's also nice because you can use the schema file (ie. .proto files) to declare your schema formally and document everything.
JSON and Ion are more like a Python/Javascript object/dict. Objects are just attribute-value bags. If you say it has field fooBar at runtime, now it does! When you parse, you don't have to know what message type you are expecting, because the key names are all encoded on the wire. On the downside, if you misspell a key name, nothing is going to warn you about it. And things aren't quite as efficient because the general representation has to be a hash map where every value is dynamically typed. On the plus side, you never have to worry about losing your schema file.
I think this is a case where "strongly typed" isn't the clearest way to think about it. It's "statically typed" vs. "dynamically typed" that is the useful distinction.