
Data serialization - soasme
https://enqueuezero.com/data-serialization.html
======
rollulus
Dealing with data as my day job I've become highly sensitive to schema's. And
I hate those "schemaless" (aka schema-on-read) serialization formats more and
more. No, there is no schemaless, there is a schema, but it is buried in your
code in a convoluted way on each line where you read and interpret your
deserialised data and all tests and assumptions you have there are a horrible
representation of your schema. That is an important dimension to compare
serialization on, and JSON is a complete loser in that sense. Not to mention
how it flawed its numeric type is.

~~~
josephg
I really wish we had a JSON 2 format which could fix JSON's obvious
shortcomings. I would like to see:

\- Support for Maps and Sets (unlike objects, maps allow arbitrary types to be
used as keys)

\- A standard Date format

\- Embedded binary blobs. No idea how to do this and keep it human readable,
but when you need this its super useful. Maybe something similar to WS's
binary message encoding.

\- Arbitrary precision integer support. This is particularly useful for
cryptocurrencies and for interoperability with 64 bit integers in other
languages. And bigints are coming to javascript -
[https://github.com/tc39/proposal-bigint](https://github.com/tc39/proposal-
bigint)

\- Maybe even fix JSON's weird unicode encoding:
[http://timelessrepo.com/json-isnt-a-javascript-
subset](http://timelessrepo.com/json-isnt-a-javascript-subset)

Unfortunately it seems like nobody 'owns' JSON enough to give JSON 2.0 the
political weight it would need for cross-language support.

~~~
teekert
I'm not an expert, just a bio-data-scientist learning every day... But
wouldn't Yaml be what you are looking for?

~~~
pletnes
Json is for machines, yaml is for humans, as a general rule. They’re mostly
compatible feature-wise.

~~~
teekert
Yaml is well readable by a human but it's much to tightly defined to be "for
humans". It's for machines but readable by humans I'd say. Plus it's got sets
and you can embed csv's (just 2 things of the top of my head).

------
q3k
re: protobufs

> It requires the program doing data parsing work to have the generated
> library as well. This would generally cause problem when schema modified.

That's not the case, and is the exact reason why you need to specify tag
numbers in protos - so that you can make your schema forward and backward
compatible when decoding/encoding from/to the wire format.

------
deathanatos
I would also recommend CBOR ([http://cbor.io/);](http://cbor.io/\);) one can
think of it like a binary form of JSON. It has a few advantages:

    
    
      * datetime objects
      * binary blobs
    

It is very similar to MsgPack in nature. However, MsgPack, in particular on
Python, poorly handles the text/bytes separation, and CBOR is backed by RFC.

~~~
camgunz
CBOR is MessagePack. At least cbor-ruby started with the MessagePack sources.
The story is that Carsten took MessagePack, wrote a standard and added some
things he wanted, and called it something else.

I wrote [1] a pretty comprehensive (and admittedly biased) critique of the
CBOR standard last year.

[1]
[https://news.ycombinator.com/item?id=14072598](https://news.ycombinator.com/item?id=14072598)

Disclaimer: I wrote and maintain a MessagePack implementation.

------
linschn
S-Expressions are missing. Language support is poor (except if you're
programming in a Lisp dialect in which case it's built-in), but I do believe
they are the best serialization format out there:

[http://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions](http://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions)

They have a canonical representation

[https://en.wikipedia.org/wiki/Canonical_S-
expressions](https://en.wikipedia.org/wiki/Canonical_S-expressions)

I swear I have seen a proposal for an efficient binary representation
somewhere but I can't find it.

~~~
wodenokoto
As far as I understand S-expressions are completely code-as-data, so how do
you protect yourself from malicious code execution when loading S-expressions?

~~~
kazinator
Simply not passing any of these parsed expressions to your _eval_ function.

ANSI Common Lisp presents a pitfall here in that it features read-time
evaluation via the #. (hash dot) syntax. For instance #.(+ 2 2) produces the
object 4. After seeing #., he reader scans the (+ 2 2) expression, evaluates
it immediately, and substitutes the result. When reading untrusted data in
Common Lisp, the __* _read-eval_ __* variable must be set to _nil_ to disable
hash-dot.

Lisps that don't have a read-time-eval escape mechanism don't require
anything.

------
plq
Serialization format has no relation with schemas. Also, Schema validation can
be as strict or lax as you want it to be.

I wrote Spyne (in Python) mainly to abstract the schema from the serialization
format. The protocols are totally pluggable and your code cares only about the
models and not the serialization format. The latest alpha is out not long ago.
Check it out if this sounds interesting.

[https://github.com/arskom/spyne](https://github.com/arskom/spyne)

Code generator: [http://spyne.io](http://spyne.io)

~~~
mountainplus
This sounds interesting and I would have liked to read more but currently I'm
getting timeout issues trying to visit spyne.io. Could you please check?

------
jihadjihad
Protobufs and similar projects like it (gRPC, Cap'n Proto) seem really
interesting, but I haven't come across a time at work yet where it's made
sense to the team to adopt it. Maybe that's just my own inexperience, but the
serialization scheme is low on the list compared to optimizing DB queries,
getting rid of bloat in the app, etc. I've been waiting for an excuse to adopt
this stuff at work because it seems really cool!

~~~
buckhx
Defining your schema upfront and have types generated in multiple languages is
the real value add. The performance is a nice cherry on top.

~~~
kitd
More specifically, being able to agree and communicate schemas to other
collaborators is a big win when trying to avoid interfacing bugs.

OTOH schemas seem less necessary if you are working on a completely self-
contained app.

------
udp
Regarding JSON, this page says:

 _> Performance is not good when dataset is huge. Program usually needs to
load all data into memory first._

This is just downright false. There are plenty of SAX-style JSON parsers.

~~~
occamrazor
Yes. This is a surprising claim in the IP, especially after saying that XML
has high performance.

~~~
soasme
I removed performance piece mostly because it's meaningless discussing
performance without in the context of designated implementation and
benchmarks.

Above catch is my point so I tweaked the words to `Performance is not good
generally when dataset is huge unless you use a library support streaming
parsing or writing.`

------
jdeca568
Cool, in the ~same vein, SBE is definitely worth a look.

[https://github.com/real-logic/simple-binary-
encoding](https://github.com/real-logic/simple-binary-encoding)

------
tannhaeuser
Don't we do CSV/TSV anymore? It's certainly the most basic schema-less data
format imaginable, and you can even have field names using the convention that
the first row contains field names.

It might be considered out of fashion today in our staged this-vs-that
culture, but I don't see anything wrong with it. Tab-separated-values are just
data fields separated by a distinguished character, with rows separated by
another distinguished character; as simple as it gets and no API required.

~~~
imtringued
It's complicated enough that you need a library to properly use it and
simultaneously simple enough that people think they don't need a library. Then
there are the billions of possible variations. Nobody can agree on a common
column (is it ',' ';' or '\t'), newline seperator, whether the first row is a
header and you can only store data in the first normal form. CSV is so
terrible that even excel spreadsheets are a better data exchange format.

~~~
tannhaeuser
Nice said :) but come on. CSV parsing is trivial.

------
vanderZwan
I wonder why FlatBuffers aren't a more popular option[0]. The format isn't
even mentioned in the article. I have never used it (since I my needs were not
that advanced), but it seem to combine quite a few amazing qualities of other
serialization formats, while adding a few of its own. It even has a version
without a schema called "FlexBuffers"[1]. I would expect anyone considering
ProtoBuffers and MsgPack to give FlatBuffers a look.

Two five minute lightning talks by Van Oortmerssen, who created the format
(another reason to take a closer look IMO), that shows off some of the
powerful features:

 _Lightning Talk: FlatBuffers (2015)_

[https://www.youtube.com/watch?v=olmL1fUnQAQ](https://www.youtube.com/watch?v=olmL1fUnQAQ)

 _Going Further with FlatBuffers_

[https://www.youtube.com/watch?v=90ND0yQVYg8](https://www.youtube.com/watch?v=90ND0yQVYg8)

[0]
[https://google.github.io/flatbuffers/](https://google.github.io/flatbuffers/)

[1]
[https://google.github.io/flatbuffers/flexbuffers.html](https://google.github.io/flatbuffers/flexbuffers.html)

~~~
vvanders
I've used it in production, can confirm it is awesome.

Flatbuffers destroys ProtoBuf both in terms of allocations(1 vs N where N is
usually huge) and deserialization speed.

You also get explicit control over your data layout in memory when you write
the message. So if you know what your read patterns are up-front you can get
amazing cache usage even in managed languages.

------
mabbo
One topic not brought up in this is versioning/change management. While I have
my criticisms of Json, if I decide to add a new field to my object, most JSON
parsers have the advantage that they'll just consider my old data without that
field to have a full value there. XML, same thing.

How does protobuf or MsgPack handle that? Aren't they both trying to align
data by byte number at some deep level? Or do I not understand them?

~~~
neuland
Protobuf has supported this for as long as I’ve used it. You add an optional
field for the new data. And there’s a couple other considerations. (Docs: [0])

Fields have a number, which you can/should mark as reserved after you take it
out. Then future developers won’t use a field that was something else in the
past. [1]

Since version all fields are optional ( they removed the required field
feature). So your app has to check that the fields you want (in that version
of your app) are present.

[0] [https://developers.google.com/protocol-
buffers/docs/proto3#u...](https://developers.google.com/protocol-
buffers/docs/proto3#updating)

------
drej
> Performance is not good when dataset is huge. Program usually needs to load
> all data into memory first.

You could use a streaming parser, or, more easily, NDJSON (assuming your data
is huge, because it's a list of stuff). Just save each JSON as a line in a
file and then stream the file line by line, only parsing one line at a time.
That's fairly fast and allocates very little memory.

------
k__
I read somewhere the problem with MsgPack is, that JSON has a rather fast
parser build into JavaScript that beats the MsgPack parser. So you would have
to check if the saved bandwidth would be enough to justify the slower parsing.

Would be interesting if this still holds true with a WASM implementation.

~~~
udp
I used to think binary formats for network protocols were a really good idea
until I found out how big the TCP header itself can be. Saving a few bytes
from your payload by using binary representations of integers doesn't make a
huge difference when the TCP header is 60 bytes.

~~~
dmm
Just to add another dimension to your analysis, if you're sending large
binaries the 33% overhead of Base64 adds up pretty quickly. It may not apply
to your use case but if you are delivering images or video over a websocket it
can make a big difference.

~~~
dchest
On the other hand, gzip can reclaim most of those 33% with Huffman encoding.

~~~
Const-me
Only in some cases. Also it increases latency.

Here’s my old article about Microsoft’s binary XML format:
[http://const.me/articles/net-tcp/](http://const.me/articles/net-tcp/) As you
see, for some payloads it compresses XML to 9% of the original size.

The data contract serializer included in .NET framework (and even in .NET
Core) can read & write objects directly from/to this binary serialization
format, without intermediate text XML anywhere.

------
manigandham
> Use Thrift if you're developing RPC services.

There's gRPC now, which uses Protobuf messages and is much better than Thrift.

[https://grpc.io](https://grpc.io)

~~~
trumpeta
Could you expand on that a bit? I'm interested in promoting gRPC over thrift
at work, but so far the only benefits I see are the HTTP/2 transport which
allows for better load balancing and request tracing on the transport level.

~~~
manigandham
gRPC with Protobuf has been faster in our usage and also has more development
these days compared to Thrift. It's also simpler and better designed by just
sticking to a single well-tested serialization format.

HTTP/2 is also a big advantage since it's standardized and easily integrates
into many existing load balancers and proxies like Envoy and nginx, both of
which now natively support gRPC directly too.

------
guohaochuan
Our company implemented two rpc protocols(don't ask me why..)

One is based on json, one is modified thrift. The schemaless feature is among
the many problems of json, but it's not the most annoying one. That would be
binary support.

Imagine we need to make a upload/download storage service with json rpc...So
as a rule of thumb, I would consider:

1\. Web facing? json would do.

2\. RPC? Pick protobuf or thrift or any protocol supporting native types
including binary.

------
dmm
I'm working on a product where I had to serialize 100KB+ BLOBs in two places.
The first was over stdout from a c program to an elixir application and the
second was over a binary websocket, specifically a Pheonix channel.

In both cases I'm using MsgPack. It was easy enough to implement and I like
how it has real numeric types.

------
willvarfar
Here are some benchmarks for Python and Java
[https://stackoverflow.com/questions/9884080/fastest-
packing-...](https://stackoverflow.com/questions/9884080/fastest-packing-of-
data-in-python-and-java)

------
cozzyd
If you're storing tons of numbers you might consider scientific formats like
HDF5, ROOT or FITS.

~~~
pletnes
Add netcdf4, which builds upon hdf5.

~~~
cozzyd
Ah yes! I've actually used it before (to read in atmospheric data) but I
couldn't remember the name.

------
gglitch
OT, but I'm excited to see TiddlyWiki in the wild.

~~~
soasme
nice catch bro.

------
souenzzo
Checkout EDN. It's about JSON, but you can extend.

