
Bond – An extensible framework for working with schematized data - dons
https://microsoft.github.io/bond/manual/bond_cs.html
======
gregwebs
The Bond compiler is written in Haskell: [http://blog.nullspace.io/bond-
oss.html](http://blog.nullspace.io/bond-oss.html)

It is about time considering that Microsoft research has been one of the main
funders of work on the Haskell compiler.

~~~
oscargrouch
I've saw this yesterday, pretty happy about it, but then.. see that the
compiler was coded in haskell..

This make it pretty "unportable" because its the same dependency with the Java
VM, so how can i distribute code with this library, with a dependency like
that, asking people to download the whole GHC ?!

Unfortunately for libraries that should be embedded in third-party code, the
reality beyond C/C++ is pretty harsh.. for full applications the reality is
different.. but for embedded libraries.. despite the fact that i've liked the
solution for something im doing, i had to pass because of this small detail..
and im too busy to write a parser in C++ to make this more portable in source
code form.. so i had to get back to protobuf :/

~~~
sapek
You need Haskell only to build the Bond compiler. Once you do that, you get a
native, stand-alone executable for your system (you don't need Haskell to run
the Bond compiler). You use the Bond compiler as part of build process for
programs using Bond. Programs using Bond don't have any Haskell dependency.

~~~
oscargrouch
Im sure there's no problem to distribute it in binary form, the problem would
be in the source code distribution, giving you add the dependency on the GHC
compiler for everybody/dev that need to build the program.

Im working on something that can have a lot of dependencies in thirdparty
libs, so i need to minimize the dependency side-effects.. so despite the fact
i like this more than protobuf, i'll have to stick with it (cap'n proto had to
rewrite from haskell to c++ because of this).

PS: Oh no, The haskell inquisition downvotes (as expected)

~~~
ghc
You're being downvoted because your point is nonsensical. First, GHC is not
the same as requiring Java, since you don't need it to _run_ the code. If you
replaced Java with C++ your first comment would be more accurate. Second,
complaining you need a compiler for Haskell to compile the source code
version...you might as well complain you need a C++ compiler to compile your
code. I know _I_ don't have a C++ compiler installed, so how is installing GHC
any more onerous than installing G++?

~~~
oscargrouch
>First, GHC is not the same as requiring Java, since you don't need it to run
the code

I think you dont read what i've wrote, or more likely, im explaining it
poorly(not a native, sorry) . This __can __be a binary to compile and create
source code, but also and often can be embedded to be used as a library.. im
guessing you are using Windows because you 've said you dont have a c compiler
at hand.. but Windows are most a end-user thing, and end-users probably wont
care about compiling code.. otherwise a c/c++ compiler is ubiquotous

Im not complaining about the tool, but about the use as a library, which is
something this also aim to be, and C is a better aim at that because can be
embedded in any language.. given the compiler is in haskel i cant access the
AST for instance, i cant embed in my binary, but have to call another external
binary instead.. but at least have a runtime to embed.. this may be ok for
some.. but i was just saying that, despite the protocol language being very
good, i couldnt use it instead of protobuf because i would have a more limited
api and my end program/ goal would lose power and flexibility.

This is a pretty technical explanation, it could be coded in Brainf*ck..
nothing against the lang in itself.. is just that it limits the use case of
this tool(as compared with protobuf)

~~~
lmm
> end-users probably wont care about compiling code.. otherwise a c/c++
> compiler is ubiquotous

ghc is pretty ubiquitous these days. Any serious linux distro will have a
package so it's one line (apt-get install ghc or similar). Even on e.g. a mac
it's no harder than installing ruby or python.

> given the compiler is in haskel i cant access the AST for instance, i cant
> embed in my binary, but have to call another external binary instead

You could write Haskell. It's a pretty nice language.

More to the point, Haskell does have a C FFI and allows you to build a library
that exposes a C interface that C programs can link against. I don't know
whether the authors have done that here, but the functionality is available.

------
leetrout
Slightly OT- I'm working with data sets that might change, but not often if at
all, which are provided by Elasticsearch. I'm processing the raw data in Flask
(API), munging, joining, and dropping what I don't want going out to the
world.

I've been toying with the idea of using something like PB, Cap'n Proto, or now
Bond to define and track schema changes and centralize marshaling /
serializing logic. I'm not concerned about having RPC. Does this sound like
crazy talk? Anyone else happen to track schemas agains schemaless data stores?

(I also like the idea of not having to ship JSON everywhere if I don't want
to.)

~~~
seanp2k2
TL;DR by using more of the available features in ElasticSearch, you can
probably replace all of your external app with ElasticSearch.

A few things:

\- ElasticSearch is definitely not schema-less, but it can try to generate a
schema (aka "mapping") for you if you don't give it one:
[http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/c...](http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapping.html)

\- ElasticSearch has tons of ways to customize the data you get back, so,
unless you really don't want the ES cluster crunching things for you, you can
do a lot of the transformation server-side. You can go so far as to have your
own type + mapping for e.g. a report, which sources data from another type and
transforms it:
[http://www.elasticsearch.org/guide/en/elasticsearch/referenc...](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-
transform.html)

\- This covers both why the schema can't, by nature, be dynamic (so the
argument of "schema-less / dynamic schema" is BS in practice IMO), as well as
how to get data out from one index an into another (e.g. your "report" index
which does scripted transformation).

\- Another idea would be to use the scripting module to write a custom "view":
[http://www.elasticsearch.org/guide/en/elasticsearch/referenc...](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-
scripting.html)

\- You can use Groovy, mvel, JS, or Python for scripts. If you combine this
with how ES lets you do "site plugins", you could make a JS + CSS + HTML site
which is actually served by the ES cluster, which interacts with it and
generates reports or whatever all without additional infrastructure. Example:
[https://github.com/karmi/elasticsearch-
paramedic](https://github.com/karmi/elasticsearch-paramedic)

------
ziedaniel1
It's cool that the .NET version actually JITs specialized serialization and
deserialization code at runtime. This is one place where managed languages
really shine, because emitting bytecode is easier and more portable than
emitting, say, raw x86. It's also safer -- the runtime can verify the memory
safety and type safety of the code.

~~~
Someone
Is it? To start that JIT process, you need to have a class in your code that
the compiler for the .NET version generated. Disk space is cheap nowadays,
even on mobile, so I do not see a big disadvantage of generating the
deserialization code at the same time the source code for the class gets
generated (and if you things that way, you lose get one-time delay, and you
don't need the code that generates those serializers in your application)

What a I overlooking? What information is know at runtime that isn't already
available at build time? (And no, "the exact CPU/memory/etc. the code runs on
is not a valid answer. This is C# code, so there always is a runtime that
handles that stuff)

~~~
sapek
There are two advantages to generating code at runtime:

1) In some scenarios you have information at runtime that allows you do
generate much faster code. The canonical example is untagged protocols, where
serialized payload doesn't contain any schema information and you get schema
at runtime. Bond supports untagged protocols (like Avro) in addition to tagged
ones (like protobuf and Thrift) and the C# version generates insanely fast
deserializer for untagged.

2) It allows programmatic customizations. If the work is done via codegen'ed
source code then the only way for user to do something custom is to change the
code generator to emit modified code. Even if codegen provides ability to do
that, it is very hard to maintain such customizations. In Bond the
serialization and deserialization are composed from more generic abstractions:
parsers and transforms. These lower level APIs are exposed to the user. As an
example imagine that you need to scrub PII information from incoming messages.
This is a bit like deserialization, because you need to parse the payload, and
a bit like serialization, because you need to write the scrubbed data. In Bond
you can implement such an operation from those underlying abstractions and
because you can emit the code at runtime you don't sacrifice performance.

BTW, Bond allows to do something similar in C++. The underlying meta-
programming mechanism is different (compile-time template meta-programming
instead of runtime JIT) but the principle that serialization and
deserialization are not special but are composed from more generic
abstractions is the same.

~~~
Someone
Ad 1): if you don't know at compile time what kind of objects you will
deserialize, you need to do more than generate the serialization code; you
also need to generate the class. So, basically, you need the entire schema
compiler. I still don't see why separating those two generation steps is a
gain.

Ad 2): does this mean that one can also do efficient schema migration at
deserialization time (rename fields, add fields with default values), or that
one can deserialize to something else than the class that got generated when
the schema was compiled?

~~~
sapek
1) I didn't do a good job explaining this. You are right that if you want to
materialize an object during deserialization you need to know a schema at
build time to generate your class. But the crucial things is, and this is true
of all similar frameworks, you don't know the schema of the payload at that
point. One big reason you use something Protobuf, Thrift or Bond is to get
forward/backward compatibility. What this means in essence is that
deserialization is always mapping between schema you built your code with and
schema of the payload. There are two common ways to do that mapping: (a)
payload has interleaved schema information within data and you perform
branches at runtime based on what you find in the payload (this is what
Protobuf, Thrift and Bond tagged protocols do) (b) you get schema of payload
at runtime and use that information perform the mapping (this is what Avro and
Bond untagged protocol do). The latter case is particularly suitable for
storage scenarios: you read schema from file/stream header and then process
many records that have that schema. This is the case where having ability to
emit code at runtime results in a huge performance win: you JIT schema-
specific deserializer once and amortize this over many records.

2) You can do both. You can also do type safe transformations/aggregations/etc
on serialized data w/o materializing any object.

------
sapek
There's been a lot of questions on how Bond compares to Protobuf, Thrift and
Avro. I tried to put some information at this page:
[http://microsoft.github.io/bond/why_bond.html](http://microsoft.github.io/bond/why_bond.html)

------
nly
No RPC? Disappointing. There are so few choices C and C++ programmers with
regard to battle-tested, easy (read: code generation for decode and dispatch),
language-agnostic RPC.

~~~
bradleyankrom
Have you tried any of the MessagePack RPC implementations? I haven't but I'm
curious.

~~~
yawniek
i recently evaluated msgpack-rpc and thrift for a small side project.
surprisingly it turned out that msgpack was not only much faster but also way
easier to use (for lots of small messages).

i got around 300k msgs/s throughtput with msgpack-d-rpc

------
a_c
How would this compared with apache thrift?

~~~
sapek
See
[https://news.ycombinator.com/item?id=8868045](https://news.ycombinator.com/item?id=8868045)

------
sdave
how does it compare to protobuf,thrift ?

~~~
joncfoo
Quoting apc @
[https://lobste.rs/s/7w6p95/msft_open_sources_production_seri...](https://lobste.rs/s/7w6p95/msft_open_sources_production_serialization_system_written_partially_in_haskell/comments/kh9zpl#c_kh9zpl)

The current offerings (Thrift, ProtoBuffs, Avro, etc.) tend to have similar
opinions about things like schema versioning, and very different opinions
about things like wire format, protocol, performance tradeoffs, etc. Bond is
essentially a serialization framework that keeps the schema logic stuff the
same, but making the tasks like wire format, protocol, etc., highly
customizable and pluggable. The idea being that instead of deciding ProtoBuffs
isn’t right for you, and tearing it down and starting Thrift from scratch, you
just change the parts that you don’t like, but keep the underlying schema
logic the same.

In theory, this means one team can hand another team a Bond schema, and if
they don’t like how it’s serialized, fine, just change the protocol, but the
schema doesn’t need to.

The way this works, roughly, is as follows. For most serialization systems,
the workflow is: (1) you declare a schema, and (2) they generate a bunch of
files with source code to de/serialize data, which you can add to a project
and compile into programs that need to call functions that serialize and
deserialize data.

In Bond, you (1) declare a schema, and then (2) instead of generating source
files, Bond will generate a de/serializer using the metaprogramming facilities
of your chosen language. So customizing your serializer is a matter of using
the Bond metaprogramming APIs change the de/serializer you’re generating.

~~~
nly
Thrift _has_ pluggable protocols. It comes with 'compact' (protobuf-like),
'dense', 'binary' and json out of the box. It also has pluggable transports
and multiple server implementations (threaded, async, etc). I'm personally not
seeing any innovation here... I think they just wanted their own version of
Thrift such that they could ignore the languages they don't care about.

~~~
antics
> I think they just wanted their own version of Thrift such that they could
> ignore the languages they don't care about.

I'll put it bluntly: you have no idea what you're talking about.

Bond v1 was started when Thrift was not production ready. This is Bond v3.
There is no conspiracy to make Bond hard to use for technology we "don't care
about." In general I'm fine with tempered speculation, but your conclusion
here is just lazy, and we both know it. It contributes nothing to the
conversation, and spreads FUD for no good reason. We can do better, agree?

Now, to address your comments about customization directly: pluggable
protocols are an _example_. The metaprogramming facilities of Bond are
dramatically more rich than those of Thrift. A good example of these
facilities: using the standard metaprogramming API and a bit of magic we have
been able to trick Bond into serializing _a different serialization system 's_
schema types. So, picture Bond inspecting some C# Thrift type (or something),
and then populating the core Bond data structures with data it finds there,
and then serializing it to the wire.

 _This_ is the kind of power you get when you construct the serializer in
memory using metaprogramming, and then expose that to the user. The
flexibility is frankly unmatched.

~~~
nly
Microsoft could have pulled a Facebook (They improved the C++ server in
FBThrift) and forked or rewritten Thrift, reusing the IDL and existing wire
formats _if nothing else_ , and dropped a kickass C# implementation. The
result would have been D, Rust, Go, PHP, and Python developers, etc etc,
wouldn't have had to go off and reinvent the wheel for the Nth time, for
negligible gain in terms of tooling, and an almost certain regression in terms
of ecosystem and acceptance. I don't think it's FUD to express a bit of
cynicism here or cry NIH Syndrome.

> I'll put it bluntly: you have no idea what you're talking about. Bond v1 was
> started when Thrift was not production ready. This is Bond v3.

Let me put something bluntly. I don't care who started writing code first.
Microsoft are well over half a decade late to the party. Thrift and Protobufs
have been public domain since, what... 2007/8?

And frankly, at least on the C++ front, there's not much to get excited about
with regard to metaprogramming here. The Avro C++ code generator already
produces a small set of template specialisations for each record type, and
they're trivial enough to write manually for any existing classes you wish to
marshal against your schema. std namespace containers are already recognised
through partial template specialisations. MsgPacks implementation also does
this. Other more general metaprogramming solutions, like Boost Fusion, are
also being used by many, in production, for completely bespoke marshalling.

Don't get me wrong, Bond looks really nice, particularly for C# programmers,
and I have respect for the work being done, but I can't get excited about it.
It's kind of like someone announcing yet another JSON library or some new web
framework when what the industry needs is _consensus_ on formats and APIs.
Right now there are so many serialization frameworks that the de-facto
standard will just continue to be schema-less JSON and robust tools will
remain largely non-existent.

~~~
sapek
I think your points are valid, especially about wire format compatibility.
Bond and Thrift are in fact close enough that providing Thrift compatibility
is a real possibility. It wasn't high enough on our priority list to make the
cut but we do in fact use some Thrift-based system internally. The impedance
mismatch in the type systems are higher between Bond and Avro/Protobuf so we
will have to see about those.

I hear you on the fragmentation. I know that this doesn't help the community
as an explanation, but big companies like Facebook, Google and Microsoft
really have a good reasons to control such fundamental pieces of their
infrastructure as serialization. Case in point: Facebook has forked their own
Thrift project because, I presume, having it as Apache project was too
restraining.

FWIW, we plan to develop Bond in public and accept contributions from the
community.

------
drivingmenuts
And yet they still can't build a web page that isn't a shitshow.

Main content has horizontal scroll on portrait monitors, which underlaps the
transparent fixed div they used for navigation.

