I'm building a general-purpose data format for the modern age. The old ones are too bulky, too insecure, and too limiting.
* Secure: As a tightly specified format, Concise Encoding doesn't suffer from the security problems that the more loosely defined formats do. Everything is done one way only, leaving less of an attack surface.
* Efficient: As a twin binary/text format, Concise Encoding retains the text-based ease-of-use of the old text formats, but is stored and transmitted in the simpler and smaller binary form, making it more secure, easier on the energy bill, and easier on the planet.
* Versatile: Supports all common types natively. 90% of users won't need any form of customization.
* Future-proof: As a versioned format, Concise Encoding can respond to a changing world without degenerating into deprecations and awkward encodings or painting itself into a corner.
* Plug and play: No extra compilation steps or special description formats or crazy boilerplate.
Similar, yes, but Ion repeats a number of mistakes from the past:
* The text format doesn't have a version specifier, so any future changes to the text format will break existing codecs and documents.
* Uses ISO 8601, which has bad defaults (local timezone) and doesn't support real time zones (only offsets). ISO 8601 is also too big for what it does and non-free, resulting in inconsistent implementations (and thus security problems).
* Doesn't support chunking (so you must always know the full length before you start encoding).
* Lists must be prefixed with their length, so you must know the length before you start encoding. Converting a list from the text format to binary format would require loading the entire list into memory first.
* Only supports "struct" encoding, so you can't have real maps (for example using integers or dates as keys).
* Doesn't support arbitrarily long scalar values.
* Doesn't support NaN or infinity.
* Doesn't allow to preserve comments when converting to binary.
* Doesn't support arrays.
* Doesn't support recursive data or references.
* Doesn't have a URL or UUID or error type.
* Integer, date, and decimal float encodings are inefficient.
* Allows multiple sections, each with their own version declaration as a feature for mixing specification versions in the same document. This means that ALL codecs must support ALL versions forever, increasing bloat and attack surface.
I think your format is closer to ideal than a few others I have seen. Like json, you don't have to prefix size of lists or objects, which is a plus (or in some cases needed) for streaming. Strings can be chunked, which can serve the same purpose for huge strings. The power of json is in its simplicity though. And a binary format should retain that a much as possible. The big advantage of binary formats is not the 'conciseness' (there are compression algorithms that can help with that, where needed) but the reduction in lines of code and CPU cycles needed to serialize and deserialize. Json requires escaping/unescaping of strings and base-2 vs base-10 conversion of numbers. A binary format should mostly solve these and beyond that be as simple to implement bug-free and secure as possible.
Yes, that is exactly the aim of the binary format. It has a few basic concepts like being byte-oriented, 1-byte headers for most things, ULEB128 encoding for large values, same chunking mechanism for all arrays and string-likes, same "open/close" mechanism for all container types, etc.
The binary codec is VERY simple, and can be trivially implemented for an async-safe or otherwise constrained environment. In fact, I expect that many implementations will only build the binary codec, since you could just pass any recorded binary data through enctool [1] or whatever to see or manipulate its contents as a human. Most systems would have no need to process the text format.
My ideal binary format would contain exactly the types supported by json + binary - arbitrary size numbers. It would be something like this, maybe:
- 'n' - null
- 'f' - false
- 't' - true
- 'i' - int (64 bit, little endian, two's complement)
- 'd' - double (64 bit, ieee-754)
- 'a' - array start
- 'o' - object start (must contain even number of values)
- 'e' - array/object end
- 's' - utf-8 string chunk (64 bytes, must be followed by string chunk or string)
- 'b' - binary chunk (64 bytes, must be followed by binary chunk or binary)
- 0x80-0xbf - utf-8 string (0-63 bytes)
- 0xc0-0xff - binary (0-63 bytes)
It is at least an order of magnitude simpler to implement than almost all of its contenders. It is just not as concise and doesn't support a rich set of data types (but then again, json does quite well without a timestamp type). It supports streaming. It has a unique representation for every value (if you ignore that ieee-754 can represent many integers and objects have no defined ordering). Reading a single byte can tell you in every case the type and the amount of payload data that follows.
> Uses ISO 8601, which has bad defaults (local timezone) and doesn't support real time zones (only offsets). ISO 8601 is also too big for what it does and non-free, resulting in inconsistent implementations (and thus security problems).
Ion doesn’t use ISO8601; it has its own timestamp specification, which is free, and much smaller than ISO8601; its effectively a subprofile of a profile specified in a W3C note of ISO8601, but that relationship is mostly of historical interest, since it isn’t specified an a (sub)profile, but independently. It also has a UTC default.
> Lists must be prefixed with their length, so you must know the length before you start encoding.
True.
> Converting a list from the text format to binary format would require loading the entire list into memory first.
You’d have to process the entire list before starting to write it to the final format, because the format is optimized for read efficiency not write efficiency. But you don’t need to read the whole list into memory; you can use a state machine with a very small window to count items in this lost, write the total, and then start copying items from text to to binary, or you could just counts as you arr processing items to a scratch file, then write the length and copy the scratch file contents.
Length prefixing variable length data values is kind of important though; I’d consider it a major strike against a binary format if it didn’t do that. (Though Ion does have the problem that its length records are also variable length.)
> Only supports "struct" encoding, so you can't have real maps (for example using integers or dates as keys).
You can, since it supports type annotations and has formats that can with annotations communicate that. (List would probably be the normal choice.)
> Doesn't support arbitrarily long scalar values.
Yes, it does, which is why the length fields for most values, including most scalars, in the binary format use VarUInt.
> Doesn't support NaN or infinity.
Yes, it does. (It doesn’t, in text format or data model, distinguish different NaNs, but it supports NaN, +Inf, and -Inf.)
> Integer, date, and decimal float encodings are inefficient.
Ion has arbitary precision exact decimals, not decimal floating point, so, no, it doesn’t have an inefficient encoding for decimal floating point.
> its effectively a subprofile of a profile specified in a W3C note of ISO8601, but that relationship is mostly of historical interest, since it isn’t specified an a (sub)profile, but independently. It also has a UTC default.
Oh good, that's an improvement. Unfortunately, nobody thought to include real time zones :/
> You’d have to process the entire list before starting to write it to the final format, because the format is optimized for read efficiency not write efficiency.
Knowing how many objects are in the list won't help efficiency because you still don't know how big each element is. So you still need to walk through the list regardless. With that in mind, there's no advantage to a size field over an end marker, but there are disadvantages.
> Length prefixing variable length data values is kind of important though; I’d consider it a major strike against a binary format if it didn’t do that. (Though Ion does have the problem that its length records are also variable length.)
I did add typed arrays to CE to support efficient storage of monosized data types such as bool and int and float. Those have chunked size prefixes.
> You can, since it supports type annotations and has formats that can with annotations communicate that. (List would probably be the normal choice.)
I'm just not a fan of requiring users to massage data structures and annotations to get basic type behaviors. Technically you can get everything you need from XML too, but the costs...
>> Doesn't support arbitrarily long scalar values.
> Yes, it does, which is why the length fields for most values, including most scalars, in the binary format use VarUInt.
Ah cool, didn't know that, thanks!
>> Doesn't support NaN or infinity.
> Yes, it does. (It doesn’t, in text format or data model, distinguish different NaNs, but it supports NaN, +Inf, and -Inf.)
As I understood it, the text format didn't have "nan" or "inf" literals... unless I missed it somewhere?
* Secure: As a tightly specified format, Concise Encoding doesn't suffer from the security problems that the more loosely defined formats do. Everything is done one way only, leaving less of an attack surface.
* Efficient: As a twin binary/text format, Concise Encoding retains the text-based ease-of-use of the old text formats, but is stored and transmitted in the simpler and smaller binary form, making it more secure, easier on the energy bill, and easier on the planet.
* Versatile: Supports all common types natively. 90% of users won't need any form of customization.
* Future-proof: As a versioned format, Concise Encoding can respond to a changing world without degenerating into deprecations and awkward encodings or painting itself into a corner.
* Plug and play: No extra compilation steps or special description formats or crazy boilerplate.
https://concise-encoding.org
Reference implementation (golang): https://github.com/kstenerud/go-concise-encoding
Enctool, converter for playing around with the format: https://github.com/kstenerud/enctool