Hacker News new | past | comments | ask | show | jobs | submit login
Bond – An extensible framework for working with schematized data (microsoft.github.io)
99 points by dons on Jan 10, 2015 | hide | past | favorite | 41 comments

The Bond compiler is written in Haskell: http://blog.nullspace.io/bond-oss.html

It is about time considering that Microsoft research has been one of the main funders of work on the Haskell compiler.

I've saw this yesterday, pretty happy about it, but then.. see that the compiler was coded in haskell..

This make it pretty "unportable" because its the same dependency with the Java VM, so how can i distribute code with this library, with a dependency like that, asking people to download the whole GHC ?!

Unfortunately for libraries that should be embedded in third-party code, the reality beyond C/C++ is pretty harsh.. for full applications the reality is different.. but for embedded libraries.. despite the fact that i've liked the solution for something im doing, i had to pass because of this small detail.. and im too busy to write a parser in C++ to make this more portable in source code form.. so i had to get back to protobuf :/

You need Haskell only to build the Bond compiler. Once you do that, you get a native, stand-alone executable for your system (you don't need Haskell to run the Bond compiler). You use the Bond compiler as part of build process for programs using Bond. Programs using Bond don't have any Haskell dependency.

Just to emphasize(so that my comment dont get misunderstood somehow) that this tool is really good and well designed, so thank you for sharing with us!

Im sure there's no problem to distribute it in binary form, the problem would be in the source code distribution, giving you add the dependency on the GHC compiler for everybody/dev that need to build the program.

Im working on something that can have a lot of dependencies in thirdparty libs, so i need to minimize the dependency side-effects.. so despite the fact i like this more than protobuf, i'll have to stick with it (cap'n proto had to rewrite from haskell to c++ because of this).

PS: Oh no, The haskell inquisition downvotes (as expected)

You're being downvoted because your point is nonsensical. First, GHC is not the same as requiring Java, since you don't need it to run the code. If you replaced Java with C++ your first comment would be more accurate. Second, complaining you need a compiler for Haskell to compile the source code version...you might as well complain you need a C++ compiler to compile your code. I know I don't have a C++ compiler installed, so how is installing GHC any more onerous than installing G++?

>First, GHC is not the same as requiring Java, since you don't need it to run the code

I think you dont read what i've wrote, or more likely, im explaining it poorly(not a native, sorry) . This can be a binary to compile and create source code, but also and often can be embedded to be used as a library.. im guessing you are using Windows because you've said you dont have a c compiler at hand.. but Windows are most a end-user thing, and end-users probably wont care about compiling code.. otherwise a c/c++ compiler is ubiquotous

Im not complaining about the tool, but about the use as a library, which is something this also aim to be, and C is a better aim at that because can be embedded in any language.. given the compiler is in haskel i cant access the AST for instance, i cant embed in my binary, but have to call another external binary instead.. but at least have a runtime to embed.. this may be ok for some.. but i was just saying that, despite the protocol language being very good, i couldnt use it instead of protobuf because i would have a more limited api and my end program/ goal would lose power and flexibility.

This is a pretty technical explanation, it could be coded in Brainf*ck.. nothing against the lang in itself.. is just that it limits the use case of this tool(as compared with protobuf)

> end-users probably wont care about compiling code.. otherwise a c/c++ compiler is ubiquotous

ghc is pretty ubiquitous these days. Any serious linux distro will have a package so it's one line (apt-get install ghc or similar). Even on e.g. a mac it's no harder than installing ruby or python.

> given the compiler is in haskel i cant access the AST for instance, i cant embed in my binary, but have to call another external binary instead

You could write Haskell. It's a pretty nice language.

More to the point, Haskell does have a C FFI and allows you to build a library that exposes a C interface that C programs can link against. I don't know whether the authors have done that here, but the functionality is available.

No, I'm a long-time Linux user. I don't install G++ because I don't need it (I work in Haskell and Python) and I try not to install unnecessary packages or remove them when I'm finished. You do generally need gcc (or clang) in Linux, but not all distros distribute g++ in their gcc.

I don't think you really understand what GHC is. GHC can compile Haskell down to C or Assembly, and has an FFI to make Haskell embeddable in C. The runtime for GHC is not like the runtime for Java or other VM/Interpreter-based languages...Haskell can be compiled and embedded to turned into a shared library to be distributed with your code.

Can you please explain how you connected Haskell with Java in your mind? I'm honestly curious.

Slightly OT- I'm working with data sets that might change, but not often if at all, which are provided by Elasticsearch. I'm processing the raw data in Flask (API), munging, joining, and dropping what I don't want going out to the world.

I've been toying with the idea of using something like PB, Cap'n Proto, or now Bond to define and track schema changes and centralize marshaling / serializing logic. I'm not concerned about having RPC. Does this sound like crazy talk? Anyone else happen to track schemas agains schemaless data stores?

(I also like the idea of not having to ship JSON everywhere if I don't want to.)

TL;DR by using more of the available features in ElasticSearch, you can probably replace all of your external app with ElasticSearch.

A few things:

- ElasticSearch is definitely not schema-less, but it can try to generate a schema (aka "mapping") for you if you don't give it one: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/c...

- ElasticSearch has tons of ways to customize the data you get back, so, unless you really don't want the ES cluster crunching things for you, you can do a lot of the transformation server-side. You can go so far as to have your own type + mapping for e.g. a report, which sources data from another type and transforms it: http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

- This covers both why the schema can't, by nature, be dynamic (so the argument of "schema-less / dynamic schema" is BS in practice IMO), as well as how to get data out from one index an into another (e.g. your "report" index which does scripted transformation).

- Another idea would be to use the scripting module to write a custom "view": http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

- You can use Groovy, mvel, JS, or Python for scripts. If you combine this with how ES lets you do "site plugins", you could make a JS + CSS + HTML site which is actually served by the ES cluster, which interacts with it and generates reports or whatever all without additional infrastructure. Example: https://github.com/karmi/elasticsearch-paramedic

I'm doing something similar at work. I'd be happy to chat about it if you drop me a line sometime.

It's cool that the .NET version actually JITs specialized serialization and deserialization code at runtime. This is one place where managed languages really shine, because emitting bytecode is easier and more portable than emitting, say, raw x86. It's also safer -- the runtime can verify the memory safety and type safety of the code.

Is it? To start that JIT process, you need to have a class in your code that the compiler for the .NET version generated. Disk space is cheap nowadays, even on mobile, so I do not see a big disadvantage of generating the deserialization code at the same time the source code for the class gets generated (and if you things that way, you lose get one-time delay, and you don't need the code that generates those serializers in your application)

What a I overlooking? What information is know at runtime that isn't already available at build time? (And no, "the exact CPU/memory/etc. the code runs on is not a valid answer. This is C# code, so there always is a runtime that handles that stuff)

There are two advantages to generating code at runtime:

1) In some scenarios you have information at runtime that allows you do generate much faster code. The canonical example is untagged protocols, where serialized payload doesn't contain any schema information and you get schema at runtime. Bond supports untagged protocols (like Avro) in addition to tagged ones (like protobuf and Thrift) and the C# version generates insanely fast deserializer for untagged.

2) It allows programmatic customizations. If the work is done via codegen'ed source code then the only way for user to do something custom is to change the code generator to emit modified code. Even if codegen provides ability to do that, it is very hard to maintain such customizations. In Bond the serialization and deserialization are composed from more generic abstractions: parsers and transforms. These lower level APIs are exposed to the user. As an example imagine that you need to scrub PII information from incoming messages. This is a bit like deserialization, because you need to parse the payload, and a bit like serialization, because you need to write the scrubbed data. In Bond you can implement such an operation from those underlying abstractions and because you can emit the code at runtime you don't sacrifice performance.

BTW, Bond allows to do something similar in C++. The underlying meta-programming mechanism is different (compile-time template meta-programming instead of runtime JIT) but the principle that serialization and deserialization are not special but are composed from more generic abstractions is the same.

Ad 1): if you don't know at compile time what kind of objects you will deserialize, you need to do more than generate the serialization code; you also need to generate the class. So, basically, you need the entire schema compiler. I still don't see why separating those two generation steps is a gain.

Ad 2): does this mean that one can also do efficient schema migration at deserialization time (rename fields, add fields with default values), or that one can deserialize to something else than the class that got generated when the schema was compiled?

1) I didn't do a good job explaining this. You are right that if you want to materialize an object during deserialization you need to know a schema at build time to generate your class. But the crucial things is, and this is true of all similar frameworks, you don't know the schema of the payload at that point. One big reason you use something Protobuf, Thrift or Bond is to get forward/backward compatibility. What this means in essence is that deserialization is always mapping between schema you built your code with and schema of the payload. There are two common ways to do that mapping: (a) payload has interleaved schema information within data and you perform branches at runtime based on what you find in the payload (this is what Protobuf, Thrift and Bond tagged protocols do) (b) you get schema of payload at runtime and use that information perform the mapping (this is what Avro and Bond untagged protocol do). The latter case is particularly suitable for storage scenarios: you read schema from file/stream header and then process many records that have that schema. This is the case where having ability to emit code at runtime results in a huge performance win: you JIT schema-specific deserializer once and amortize this over many records.

2) You can do both. You can also do type safe transformations/aggregations/etc on serialized data w/o materializing any object.

There's been a lot of questions on how Bond compares to Protobuf, Thrift and Avro. I tried to put some information at this page: http://microsoft.github.io/bond/why_bond.html

No RPC? Disappointing. There are so few choices C and C++ programmers with regard to battle-tested, easy (read: code generation for decode and dispatch), language-agnostic RPC.

We are planning to release cross-platform RPC support but it just wasn't ready yet and we didn't want hold up the core release for it.

Have you tried any of the MessagePack RPC implementations? I haven't but I'm curious.

i recently evaluated msgpack-rpc and thrift for a small side project. surprisingly it turned out that msgpack was not only much faster but also way easier to use (for lots of small messages).

i got around 300k msgs/s throughtput with msgpack-d-rpc

I haven't, although it's really nice that they chose to adopt the Thrift IDL. From a cursory glance however, it doesn't look like the code generator produces any dispatch code. Atm you're still going to need to write a tonne of boilerplate.

How would this compared with apache thrift?

how does it compare to protobuf,thrift ?

Quoting apc @ https://lobste.rs/s/7w6p95/msft_open_sources_production_seri...

The current offerings (Thrift, ProtoBuffs, Avro, etc.) tend to have similar opinions about things like schema versioning, and very different opinions about things like wire format, protocol, performance tradeoffs, etc. Bond is essentially a serialization framework that keeps the schema logic stuff the same, but making the tasks like wire format, protocol, etc., highly customizable and pluggable. The idea being that instead of deciding ProtoBuffs isn’t right for you, and tearing it down and starting Thrift from scratch, you just change the parts that you don’t like, but keep the underlying schema logic the same.

In theory, this means one team can hand another team a Bond schema, and if they don’t like how it’s serialized, fine, just change the protocol, but the schema doesn’t need to.

The way this works, roughly, is as follows. For most serialization systems, the workflow is: (1) you declare a schema, and (2) they generate a bunch of files with source code to de/serialize data, which you can add to a project and compile into programs that need to call functions that serialize and deserialize data.

In Bond, you (1) declare a schema, and then (2) instead of generating source files, Bond will generate a de/serializer using the metaprogramming facilities of your chosen language. So customizing your serializer is a matter of using the Bond metaprogramming APIs change the de/serializer you’re generating.

That's cool... but, from what I can tell (correct me if I'm wrong), Bond accomplishes this by using common classes for in-memory objects which have no relation to the wire format, and then simply invoking a pluggable wire format and parse/serialize time. This lets you plug in previous-generation serialization protocols like Protobuf, Thrift, or Avro but probably won't allow you to plug in a next-generation zero-copy protocol like Cap'n Proto, SBE, or FlatBuffers, where the in-memory data structure and the wire format are one and the same. If you want to try one of them, you'll still have to rewrite all your code, unfortunately.

Hey Kenton. I see Adam Sapek bumming around the thread, so maybe he'll chime in here, but Bond works essentially by: (1) inspecting the schematized type, and (2) generating code that will quickly walk over that type and write it to a stream. So yes, it would probably require some surgery to make Bond do what Cap'n Proto is doing.

It is interesting to think about how it might work, though...

If you want to follow up, I encourage you to email Adam (adamsap -at- microsoft) or you can ping me and I'll loop him in (aclemmer@microsoft.com).

Thrift has pluggable protocols. It comes with 'compact' (protobuf-like), 'dense', 'binary' and json out of the box. It also has pluggable transports and multiple server implementations (threaded, async, etc). I'm personally not seeing any innovation here... I think they just wanted their own version of Thrift such that they could ignore the languages they don't care about.

> I think they just wanted their own version of Thrift such that they could ignore the languages they don't care about.

I'll put it bluntly: you have no idea what you're talking about.

Bond v1 was started when Thrift was not production ready. This is Bond v3. There is no conspiracy to make Bond hard to use for technology we "don't care about." In general I'm fine with tempered speculation, but your conclusion here is just lazy, and we both know it. It contributes nothing to the conversation, and spreads FUD for no good reason. We can do better, agree?

Now, to address your comments about customization directly: pluggable protocols are an example. The metaprogramming facilities of Bond are dramatically more rich than those of Thrift. A good example of these facilities: using the standard metaprogramming API and a bit of magic we have been able to trick Bond into serializing a different serialization system's schema types. So, picture Bond inspecting some C# Thrift type (or something), and then populating the core Bond data structures with data it finds there, and then serializing it to the wire.

This is the kind of power you get when you construct the serializer in memory using metaprogramming, and then expose that to the user. The flexibility is frankly unmatched.

Microsoft could have pulled a Facebook (They improved the C++ server in FBThrift) and forked or rewritten Thrift, reusing the IDL and existing wire formats if nothing else, and dropped a kickass C# implementation. The result would have been D, Rust, Go, PHP, and Python developers, etc etc, wouldn't have had to go off and reinvent the wheel for the Nth time, for negligible gain in terms of tooling, and an almost certain regression in terms of ecosystem and acceptance. I don't think it's FUD to express a bit of cynicism here or cry NIH Syndrome.

> I'll put it bluntly: you have no idea what you're talking about. Bond v1 was started when Thrift was not production ready. This is Bond v3.

Let me put something bluntly. I don't care who started writing code first. Microsoft are well over half a decade late to the party. Thrift and Protobufs have been public domain since, what... 2007/8?

And frankly, at least on the C++ front, there's not much to get excited about with regard to metaprogramming here. The Avro C++ code generator already produces a small set of template specialisations for each record type, and they're trivial enough to write manually for any existing classes you wish to marshal against your schema. std namespace containers are already recognised through partial template specialisations. MsgPacks implementation also does this. Other more general metaprogramming solutions, like Boost Fusion, are also being used by many, in production, for completely bespoke marshalling.

Don't get me wrong, Bond looks really nice, particularly for C# programmers, and I have respect for the work being done, but I can't get excited about it. It's kind of like someone announcing yet another JSON library or some new web framework when what the industry needs is consensus on formats and APIs. Right now there are so many serialization frameworks that the de-facto standard will just continue to be schema-less JSON and robust tools will remain largely non-existent.

I think your points are valid, especially about wire format compatibility. Bond and Thrift are in fact close enough that providing Thrift compatibility is a real possibility. It wasn't high enough on our priority list to make the cut but we do in fact use some Thrift-based system internally. The impedance mismatch in the type systems are higher between Bond and Avro/Protobuf so we will have to see about those.

I hear you on the fragmentation. I know that this doesn't help the community as an explanation, but big companies like Facebook, Google and Microsoft really have a good reasons to control such fundamental pieces of their infrastructure as serialization. Case in point: Facebook has forked their own Thrift project because, I presume, having it as Apache project was too restraining.

FWIW, we plan to develop Bond in public and accept contributions from the community.

>> I'll put it bluntly: you have no idea what you're talking about. Bond v1 was started when Thrift was not production ready. This is Bond v3.

> Let me put something bluntly. I don't care who started writing code first. Microsoft are well over half a decade late to the party. Thrift and Protobufs have been public domain since, what... 2007/8?

I think your technical complaints are almost all good and valid. I agree with essentially all of them. I don't want to fork Adam's sibling response here, so I'll leave it at that.

My point is that asserting that Bond was developed so that MSFT could purposefully ignore certain languages is pointedly wrong, and irresponsible considering the dearth of evidence you have to support it. And here you have 3 authoritative comments to say so. I don't understand how you can possibly disagree with this, or be upset that someone would take issue here. It's ok to be wrong.

I previously worked on Thirft and have overseen the development of Bond with Adam as the lead developer.

Your characterization of Thrift is accurate and Bond actually has some of the same architectural roots as Thrift. Those features of Thrift were ones that I wanted to preserve in Bond. But we also wanted to expand that plugability to allow for even more flexibility than the core Thrift architecture would allow for -- for example, the ability to support Avro-like "untagged" protocols within the same framework. I believe that the core innovation is in how that gets implemented. Also, we believe that performance is a feature -- our testing has shown that Bond noticeably outperforms Thrift and Protocol Buffers in most cases.

There is no conspiracy or intent to "ignore languages" -- we will release additional languages as they are ready and as we can support them as first-class citizens. We also welcome community involvement.

One key differentiator is the limited set of languages Bond currently supports:

"By design Bond is language and platform independent and is currently supported for C++, C#, and Python on Linux, OS X and Windows."

Versus Thrift:

"language bindings - Thrift is supported in many languages and environments C++ C# Cocoa D Delphi Erlang Haskell Java OCaml Perl PHP Python Ruby Smalltalk"

After struggling with thrift's Go binding (it happily generated broken Go code with the Aurora project's thrift file), I'm now skeptical that any of the others really works. I've never encountered a more frustrating project in this space.

Cross-language support in such frameworks takes a lot of effort. Whenever you add a new language you really hit the classic 80-20 problem.

We have support for a few more languages that we are using internally but after having a hard look at the implementations we decided that they weren't up to par for the open source release yet. I hope that we will release more soon. And needless to say, we are open to contributions from the community.

Or something like CBOR or JSONB?

And yet they still can't build a web page that isn't a shitshow.

Main content has horizontal scroll on portrait monitors, which underlaps the transparent fixed div they used for navigation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact