Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Cap'n Proto, by the ex-maintainer of Protocol Buffers (kentonv.github.com)
286 points by kentonv on April 2, 2013 | hide | past | favorite | 85 comments



I have a lot of respect for Kenton and I'm sure this is high quality work.

The idea of a fixed-length encoding of a protobuf-like structure is a good idea, and something I have seen explored elsewhere. To me, though, it's unfortunate to create an entirely new schema language for it, instead of just making it an alternate encoding of the existing protobuf data model, which it closely resembles. To have two schema languages that are nearly, but not quite, isomorphic means that you get no interoperability benefits with any existing protobuf-based code or data.

The way I see it, the most important feature of protobufs is not the on-the-wire encoding, but rather the schema. If you start from a schema (and its associated data model), then you can create as many different encodings as you want, and convert between them losslessly at any time, while taking advantage of their different performance characteristics.

For example, Dremel (as described in this paper: http://research.google.com/pubs/pub36632.html) is an extremely fast SQL-based query engine at Google, which powers the Google cloud product BigQuery (which I work on). Because Dremel uses Protocol Buffers as its data model, any set of Protocol Buffers can be dumped into Dremel for fast querying, without any kind of data massaging or converting step. I think this is a really powerful idea.

I can see reasons why Kenton might have chosen not to use .proto schemas directly, but I hope that they can eventually converge, because having a shared schema language is a powerful idea.


This is a good point, and something I considered. There are a few reasons I decided against using the .proto format:

* In .proto format, it's generally assumed that _removing_ a field, or having a gap in field numbers, is OK. In Cap'n Proto, it's not OK, because offsets of subsequent fields are affected.

* The language frankly has a lot of little ugly quirks that I wanted to get away from.

* I want control over my format so that I can add features that protobufs may not be interested in, like the ability to define constants.

* After proto2 I'm kind of tired of working around legacy. :)

* I'm not working for Google anymore, so I don't have any particular reason to cater to what would make the most sense for Google. Why not use Thrift or POJOs or anything else as the base language? :)

I think most of the benefits you describe could be achieved by writing a .proto<->.capnp translator, which should not be very hard given that, as you say, the languages are pretty much isomorphic.


Protobufs are awesomely useful, and this seems like a great enhancement. However, I'm disappointed that there's no map representation. That's a been major issue for us with protobufs compared with thrift or json. While you can represent maps as lists of pairs, it's awkward and lacks the key uniqueness constraint.


I agree! I actually intend to add maps at some point, just haven't worked out the details yet. It's a little weird in Cap'n Proto since the in-memory structure is also the wire format, and the right in-memory representation of a map is not perfectly obvious (Hash table? B-tree? Red/black?).


I wrote an implementation of the D-Bus protocol, which has maps, and after doing so I wish it didn't. Maps are a pain for implementors.

For example, the semantics of duplicate keys are not obvious. Do you throw an error, or keep the first value, or keep the last value, or do something else? Whatever the decision, somebody will be unhappy with it. Better to just give the user a list of (key, value) pairs and let them figure out what to do with it (e.g. pass it to a map construction function that behaves as they like).


Thumbs up. Maps are important.

Red black trees are probably the standard. I'd want to sit down and think about how the wire format sharing the in-memory structure effects things. :) Do we need to be able to determine the amount of memory used by the map ahead of time?


> Do we need to be able to determine the amount of memory used by the map ahead of time?

That's a very good question and I debate it a lot. Since Cap'n Proto messages tend to be arena-allocated, dynamically growing space can be problematic. Currently, for lists, you must specify the list size before filling in the data. That could be improved in a few ways, but maps will likely be harder.

If you do know sizes in advance, I think hashtables make the most sense. If we want something that allows dynamic additions and maybe even removals (and can figure out how to reconcile that with arena allocation), B-trees might be a really nice approach. Maybe you could even use a huge mmap'd Cap'n Proto file as a database (warning: very speculative, lots of obstacles to resolve).


Okay, protocol buffers are awesome, so as constructive feedback, I think using C++ as the source-generation language is a mistake.

We use Thrift, which uses the same approach, and it's a huge PITA to install the native compiler just to generate some .java files. To the point where, unlike basically everything else, we don't run the thrift compiler locally on dev machines, but instead run it on a Hudson machine with thrift carefully setup and then hope we never have to touch it.

Granted, the runtime library for each language needs to be in that language (or more likely it's C/C++ bindings), but jumping from "the runtime is probably C/C++ under the covers" to "let's use C++ to generate the source code" is a mistake IMO.

Personally, with Thrift, it has kept me from submitting pull requests to help patch/maintain the generated Java bindings (which were pretty bad at one point in time).

Along the same lines, I haven't used it yet, but I'm a wanna-be fan of Scrooge, which uses the Thrift runtime libraries, but does it's own source code generation:

https://github.com/twitter/scrooge

I'm not necessarily saying each language's source code should be generated by a program that is also written in that language, as I understand it's nice to reuse the IDL/spec parsing logic ... but, on the other hand, I think that would maximize the quality of each language's generate source code, since the barrier for it's own users to maintaining/tweaking would be so much lower.

(I understand you live & breathe C/C++, this feedback is just from a "if you want to maximize potential contributions to your project" point of view.)


I absolutely agree, which is why the Cap'n Proto compiler is written in Haskell. :D


Ha, nice! Sorry about that--you caught me skimming the navigation bar and making/projecting assumptions. :-)


Now that does put me off from contributing...

(Not a criticism, just a reference to the GP.)


Aww. Seriously, though, it's pretty easy to write a code generator without understanding Haskell, and most of the work ends up being in the target language.

In the long run I'd like to do something like I did with protoc where you can write plugins in any language, but... more important priorities right now.


What about using Go?


What would be the benefit of Go over Haskell in this case?


If you use Java with Thrift, you should check out Swift:

https://github.com/facebook/swift

Swift is an annotation-based library for creating Thrift types and services. It was created to be an alternative to code generation, which is generally a pain for Java development. Swift generates bytecode at runtime, so you get all the performance advantages of code generation with none of the development disadvantages.

The idea with Swift is that you own the annotated classes and can fully customize them to your liking, independent of the IDL and wire format. For example, you can name the methods whatever you want, add documentation, helper methods, etc. When implementing a client for Thrift service, you only need the methods that you actually call.

You can use swift-generator-cli to generate the initial versions of the classes from existing Thrift IDL, which you then customize and maintain yourself. Alternatively, you can use swift-maven-plugin, which is pure Java (no external C++ binary), to generate code at compile time.

This project is in active development and use at Facebook, so please give it a try and let us know what you think!


I completely disagree. C++ is a great general purpose languageand 99% dev machines will have a compiler installed. I know mine doesn't have Java installed.


But do you have the right version of Boost installed? And does it work with both g++ and clang? Which standard library? Etc.

C or very light use of C++ makes most sense for sad reasons.


But you would likely have Java installed if you were interested in generating Java language bindings.


Very good point, very well made.


Hah, there are tens of millions of professional developers out there. The majority of them use Windows and don't have a C++ compiler installed. Look outside the bubble ;)


> it's a huge PITA to install the native compiler just to generate some .java files

I use Thrift too and haven't felt this pain myself. The compiler is easy to install with APT, MacPorts or the usual `./configure && make install` incantation. This said, I would certainly appreciate an implementation of the compiler running on the JVM.


My first question was: whoa, security? But you answer it in the docs, a bit.

http://kentonv.github.com/capnproto/encoding.html (at the end)

How do we know that cycles/deep-nesting are the only possible attacks?

Maybe I'm overreacting - when you say pointers, these are really pointers to other locations in a file, not to code. I'm trying to think of what you could do with malformed data, but the worst thing I can think of is seeking to impossible positions. Maybe you could confuse or crash certain poorly written libraries, maybe by declaring a size for a data structure larger than the file. But the damage seems containable.


Oh yes, security is a high priority. It's definitely wrong if malicious data can crash the receiver. The security notes on that page are not a complete story, just mentioning a couple of the less-obvious concerns. It's my intent that the Cap'n Proto runtime should protect the application from all security issues related to decoding, so that apps don't have to worry about it.

The pointers are not actually native pointers, and when you dereference a pointer, the implementation does bounds checking to make sure it points within the message.

I am a security wonk myself and have been thinking carefully about these things, but I also hope to get a formal security review from a third party before any production release.


Doesn't this mean you are just pushing the decoding overhead into the runtime instead of upon receiving?


To some extent, yes. But the total overhead of bounds-checking pointers in a Cap'n Proto message is orders of magnitude smaller than the overhead of decoding an equivalent protobuf message. (You can see this in my benchmarks. Look for "object manipulation time". It's usually a bit slower for Cap'n Proto, but the total time is dwarfed by I/O time.)


Will you be able to disable bounds checking in a production version and/or move it to a once only for the entire object model?


Yes! If you really, truly trust your input data, and it is in a single flat array, you can call capnproto::readRootTrusted<MyType>(ptr) instead of setting up a MessageReader. All bounds checks and other validation will be skipped.

But, IMO this is only a good idea for, like, static constants embedded into the source code. Anything you read dynamically could be corrupted if not malicious, and you don't want that to kill your server.

That's up to you, though.


Agreed that bounds checking is the correct default here, but it's nice to know that in true C++ fashion we can shoot our leg off if we so choose.


I'm not much of a security person, just thinking out loud. Cap'n Proto looks great though.

I am also looking forward to saying "Cap'n Proto" in a work context.


Big fan of protobuf here. Awesome work.

We did something along those lines for our product except we used Boost.Fusion to generate the serialization code which enables us to skip the interface compilation phase altogether.

Basically you have any structure, you Boost.Fusion adapt it and bam!, cross-platform binary serialization at compile-time.

The best is that this approach works for everything. Want JSON? Bam! You got JSON.

One trick we used is to zero-copy the big chunks of memory. Don't know if that's the case here. It's tricky though because if you do asynchronous I/O later you have to make sure the original buffer stays alive long enough.

(ps: getters and setters, this is soooo last century ;) )


Thanks!

Yeah, getters and setters are old school, but still necessary in C++, especially in this case since the underlying data isn't necessarily in native format. In more modern languages I'd hope to use properties. :)

I do hope to write a JSON transcoder as well, so that for those cases where you really need to send/receive JSON, you can do that, while still using the same protocol spec.


An important gotcha for JSON: it's not enough just to follow the JSON spec. If you want the data to be usable in a browser, you also have to take into account JavaScript's limitations.

In particular, the JSON spec says that arbitrary-precision numbers are allowed, but if you want to round-trip a 64-bit integer without losing precision, you'd better put quotes around it.


I wasn't thinking about properties. We've used a C++er approach (I feel the stench of Java in you ;) ): objects are stored natively in memory and only when serialized - if needed - transcoding is applied just before, on the flight, on local copies. Since the objects for which representation is different are small (actually, only integers), it's insanely fast and almost no memory is wasted.

It's not quite "0 ms transcoding" but quite and most importantly it's 0 ms read/write access time (you spend more time accessing the object in memory than serializing it).


> Yeah, getters and setters are old school, but still necessary in C++

Why isn't operator= usable in this case?


Because you need an lvalue to assign to, which turns out to be pretty painful when the underlying data is not (quite) in native format. The lvalue can't just be a reference to the underlying data, so instead you need to set up some parallel set of objects somewhere that wrap the data, and each of these objects probably needs to contain a pointer, so they may end up being bigger than the actual data they are wrapping. It really isn't worth the effort.


I have to question the value of avoiding deserialization. With no type tag space reserved in structures or pointers, are there any strongly-typed language runtimes (i.e., not C) which can actually use this as the definitive in-memory representation? If you have to wrap proxy objects around a Cap'n Proto blob, the cost of continual indirections will quickly exceed the cost of just deserializing into native objects.

I'm glad to see efforts to move away from encoding the schema (ids and types) into every field of every message, though. I can't remember the last time I found a use for self-describing data of a type that was completely unknown when I wrote the code. And the suggested packing format looks feasible, where too many protocol designers think everyone can afford to gzip everything by default (and make a general-purpose library re-deduce where the redundancy is in your schema, when certain bits are inevitably constant for every message you could ever send).


This will tend to depend on the details of the language.

In Java, there is java.nio.ByteBuffer which (I think) can basically do everything that is done with pointer arithmetic in the C++ implementation.

In, say, Python, it may make more sense to decode the message to Py objects upfront. You could use Python's struct packing library to do this fairly efficiently, and you could lazily decode sub-objects. If you are using Python, you probably aren't going for blazing speed anyway.

Another possibility in many languages is to write bindings that call into C++, though that obviously has a bunch of trade-offs.

Finally, if you're communicating with browser-side Javascript, you probably just want to transcode the object to/from JSON on the server side. I intend to provide libraries that make that easy.


Avoiding serialization is a big win for short-lived object graphs like those seen when processing an http request. About 6 years ago, I developed a similar technique used in a .net app server to minimize server load for relatively chatty AJAX apps. It's good to see the same idea polished up further coming around again.


Overall, I love how this looks -- it's a nice evolution of protobufs.

I had a couple issues/questions while reading through the site, so I'll just throw them in here:

  > struct Date {
  >   year @0 :Int16;
  >   month @1 :UInt8;
  >   day @2 :UInt8;
  > }
That's a fairly awkward syntax, even more than .proto files. I notice your compiler is written in Haskell, so it should be pretty easy to define a human-friendly syntax and parse it with Attoparsec and friends. In particular, the use of colons and semicolons here seems unnecessary.

Are you set on that syntax, or would you accept patches to change it? I don't know how much breaking change you'd be willing to accept at this point.

  > Fields (and enum values, and interface methods) must
  > be numbered consecutively starting from zero in the
  > order in which they were added.
I disagree with this decision. Being able to set tag numbers in arbitrary order makes reading the text format of a proto message (e.g. during debugging) much nicer, because fields can be grouped according to their purpose rather than whenever they happened to be added to a message.

Speaking of which, I don't see any specification for a text-based serialization format on this site. If I were you, I'd just steal Protobuf's because it's simple and easy to read/write.

  > Unlike Protobufs, you cannot skip numbers when defining
  > fields – but there was never any reason to do so anyway.
For me, the reason to skip field numbers is when some older version of the message defined some field, but new messages removed it and clients should not know about it. Typically, I indicate this by replacing the field definition with a comment to ensure the field number isn't accidentally re-used. This reduces the amount of legacy cruft in .proto files.

  > GCC 4.7 Needed
  > 
  > If you are using GCC, you MUST use at least version
  > 4.7 as Cap’n Proto uses recently-implemented C++11
  > features. If you are using some other compiler…
  > good luck.
One of the best features of protobufs is that they are easily portable between many platforms and languages. If the C++ implementation requires advanced modern compiler features, that will severely reduce the number of developers who are able to use it. It would be better to write the reference implementation such that it's supported even by ancient compilers, so that users don't have to worry about upgrading lots of other packages just to try out CAPNP.

  > For example, exceptions are thrown on assertion
  > failures (indicating bugs in the code), network
  > failures, and invalid input. Exceptions thrown by
  > Cap’n Proto are never part of the interface and
  > never need to be caught in correct usage.
I'm fine with making assertion failures throw exceptions that "never need to be caught", but network failures and invalid input are very much things that should be treated as common events.

Error reporting (via return values or out-parameters) will make the interface slightly less elegant. This is fine. The benefit of having proper error handling by far outweighs minor verbosity of error checking.

  > Text: Like Data, but the content must be valid
  > UTF-8, the last byte of the content must be zero,
  > and no other byte of the content can be zero.
If there's a length prefix, why bother with a terminating NUL? D-Bus does the same thing, and it serves no reason except to add another possible error condition.

  > So when Directory.open is called remotely, the
  > content of a FileInfo (including values for name
  > and size) is transmitted back, but for the file field,
  > only the address of some remote File object is sent.
It's not obvious how such address-passing can be implemented, without falling into the black hole that is CORBA. I don't see anything on the encoding page about it, either. Is an address some sort of 64-bit cookie, a string, or something more complex? How does using addresses work when the server is written in a language that doesn't have the concept of addressable values?

  > The Cap’n Proto RPC protocol is not yet defined.
  > See the language spec’s section on interfaces for a
  > hint of what it will do.
Roughly what is the timeline for defining the RPC protocol? I've been working on a Protobuf-based RPC protocol for a while (it'll replace my use of D-Bus for inter-machine and low-overhead inter-process communication), but it probably won't be in even a beta state for at least another two or three months. If you don't have any immediate plans, I'd love to merge my work into CAPNP.


All I want to comment on is the terminating null even in the presence of a length prefix: it means C and similar languages can treat it as a (read-only) string without the need to make a copy.


Thanks for the feedback!

  > Are you set on that syntax, or would you accept patches to change it?
I'm not wedded to the syntax. Maybe start a thread on the discussion group and see what people think?

  > Being able to set tag numbers in arbitrary order makes reading the
  > text format of a proto message (e.g. during debugging) much nicer,
  > because fields can be grouped according to their purpose rather than
  > whenever they happened to be added to a message.
I think a text format should write fields in the order in which they appear, not the order of the numbers. These don't have to be the same order. The purpose of the numbers is to allow you to insert a new field in an arbitrary position in your struct def without breaking wire-compatibility, rather than requiring you to add all new fields at the end.

Maybe the wording in the docs is confusing?

  > Speaking of which, I don't see any specification for a text-based
  > serialization format on this site. If I were you, I'd just steal
  > Protobuf's because it's simple and easy to read/write.
I intend to support JSON transcoding. The protobuf format is kind of quirky, although it is nice that it doesn't require as much quoting. Maybe the JSON transcoder could have a "less quoted" mode that breaks standards somewhat.

  > For me, the reason to skip field numbers is when some older version
  > of the message defined some field, but new messages removed it and
  > clients should not know about it.
In practice, you actually shouldn't do that even with protobufs, because if you accidentally reuse that field number, you might misinterpret old data. Hence the convention with protobufs is usually to rename fields to "OBSOLETE_N".

Cap'n Proto can't allow you to remove fields because that would change the offsets of subsequent fields. But you can certainly rename fields.

  > If the C++ implementation requires advanced modern compiler features,
  > that will severely reduce the number of developers who are able to
  > use it.
This is true, but C++11 features are really useful, and the implementation (and even its interface) simply would not be as good without them. If this is a big problem, perhaps someone will start maintaining a back-port to C++98, as if it were another language.

  > I'm fine with making assertion failures throw exceptions that "never
  > need to be caught", but network failures and invalid input are very
  > much things that should be treated as common events.
I think my wording in the docs may be a little off, in that it isn't completely clear what I mean by "correct usage". Yes, clearly you should catch network errors in a distributed environment. But typically the only things you really want to know about the error are:

* Should I retry? * What error message should I log / display?

Cap'n Proto exceptions can be caught and will carry this information. The point is more that these are logistical issues, not correctness issues. The "correct" output of your program never depends on what exceptions were thrown -- if any exception is thrown, either you don't produce output at all, or you solve the problem and produce the same output that you would have otherwise.

Exceptions are somewhat necessary in Cap'n Proto because validation happens as you walk the message tree. Having every getter possibly return an error code would be pretty painful. You can, however, configure it so that getters flip an error flag and then return a default value on error. As long as you check that error flag before outputting garbage, you should be fine.

  > If there's a length prefix, why bother with a terminating NUL?
It's helpful if the recipient intends to use the strings in C/C++ code -- they can avoid a copy. For example, if the message contains a file name and the recipient wants to open() it, they need a NUL-terminated string, and malloc()/copying one is kind of sad. Although I want to support many languages, I expect Cap'n Proto users to be biased towards C/C++ since those are the people who care deeply about speed.

  > It's not obvious how such address-passing can be implemented...
This is certainly a complicated problem, but solvable. I wrote some comments elsewhere:

https://news.ycombinator.com/item?id=5482782

I look to things like the E programming language and Tahoe-LAFS for inspiration here.

  > Roughly what is the timeline for defining the RPC protocol?
I'm working full-time on Cap'n Proto right now. It's always hard to predict coding timelines, but I am hoping to have something within a month or two.

  > it'll replace my use of D-Bus for inter-machine and low-overhead
  > inter-process communication
Hey, that sounds like exactly the main use case I'm interested in. I want to pass messages in shared memory, so making an IPC is basically no more expensive than posting to a semaphore. This makes it more plausible to have a sandboxed process which essentially makes all its system calls via IPC to the sandbox, without as much of a speed hit.

Please do join the discussion group and tell us about your work! I'd love to have your help, or at the very least exchange notes. :)


As a user of the excellent Protocol Buffers this is great news. I was especially interested in the Inter-Process communication and random access. Hopefully we'll see some ports to other languages soon.


Thanks! Anyone who wants to own the implementation in their favorite language should let me know. :D


Awesome! I'm a huge fan of binary protocols like this. Couple questions:

1. It appears from the encoding docs that length for TEXT fields are in bytes, not characters.. is that correct? Hessian made this awful flaw which meant writing an efficient parser was painful because the entire string couldn't be copied into a buffer in one shot.

2. I'm unclear on first reading how references are supported. One of my gripes with formats like Java's Serialization is that it's effectively un-streamable since a parser needs to hold every object in memory for the life of the stream in case they're referenced later. I read the 'far pointers' and 'serialization over a stream' stuff.. Am I to understand that the idea is that your stream would be batched up into these frame-chunks of multiple segments and the far-pointers would be valid only within those frames?


Thanks for the feedback!

> 1. It appears from the encoding docs that length for TEXT > fields are in bytes, not in characters..

Correct. Text is UTF-8 encoded, and the length sent on the wire is bytes.

You make a good point that this can be frustrating in languages where UTF-8 isn't the standard representation for text. Though, is it really a character count that you want, or is it a UTF-16 length? Depends on the language, again. Hmm, this could get ugly...

I think we can optimize around this problem by taking advantage of the fact that Cap'n Proto implementations will generally use arena allocation (since message objects must be allocated in contiguous segments). Just start by attempting to encode the string into whatever space is left in the current segment. If it turns out to be enough, which it usually will, great! Mark the space as allocated. If not, you have to back-track and try a slower path.

> 2. I'm unclear on first reading how references are supported.

Usually you send a whole "message" (composed of N "segments") at a time, and that message cannot have pointers to any data objects outside the message (but can have pointers to interfaces). A sophisticated protocol could send one segment at a time, and the receiver could request additional segments when the far pointers pointing into them are first accessed (this could all be transparent to the application). In general, though, if you want your message to reference things outside the message, you probably want to use interfaces.

http://kentonv.github.com/capnproto/language.html#interfaces

These let the receiver call back at some arbitrary time in the future. (The details of how these will be implemented is not specified yet, but it'll be something like what E does.)


To be clear, I think a character count isn't what you want. That's easy enough to do post-parse. A byte count is most useful so that can make a single call to pull the value into memory. A buddy and I wrote python bindings for Hessian and the fact they were encoding character counts and not byte count caused us pain.

Makes sense regarding pointers within messages, and interfaces.


Ohhhhhh, yeah, I misread what you were saying. Yes, the count on the wire is a byte count. :)


Why would you want characters for the field length? Maybe I am missing something obvious or I misunderstood.


Best commit message:

Day 1: Learn Haskell, write a parser.


There's a huge opportunity for serialization protocols between PC based hosts and embedded systems. Protocols that are able to serialize without dynamic allocation are really hard to find, so I'll definitely be looking into this.

The dependency on C++ and a rather recent version of GCC is a deal-breaker for some of the embedded platforms I work with, however.


Hmm, I think you'd probably want to treat embedded C/C++ as if it were a different programming language -- so, a separate code generator and runtime library would need to be written. Then you could really make sure everything works in an appropriate way for the platform, without fighting with the desktop/server devs.

But yes, Cap'n Proto would probably be a great fit for that environment, and I'd love to see someone take on this target. :)


How large will interface pointers be? It seems like they'd have to be at least as large as an auth cookie since that's basically what they are.

Also, it seems like in a cluster, you'd need the ability for any server to generate many auth cookies that will work on any other server when the client makes its next request.


> How large will interface pointers be? It seems like they'd have to be at least as large as an auth cookie since that's basically what they are.

That's one possible implementation. 128 crypto-random bits should be enough.

Another possible implementation is to say that the interface can only be accessed through the same connection over which the pointer was sent. In this case you don't need any crypto (other than SSL on the connection, perhaps). The interface pointer can just be an integer, sequentially counting from zero.

The latter approach of course means that in case of network failure you lose all your pointers. Depending on the use case this may be fine -- perhaps you just need to fetch them again on re-connect. If it's not fine, then yes, you need some sort of unguessable capability string.

> Also, it seems like in a cluster, you'd need the ability for any server to generate many auth cookies that will work on any other server when the client makes its next request.

For long-lived capabilities, yes, you'll probably want them to be backed by data in a database so that they can easily relocate to a different machine.

But for the all-pointers-break-when-the-connection-closes approach, there's no issue.

I expect the first iteration of the RPC system will only support this latter approach. The former approach clearly requires help from the application, so will take longer.


And it's (partly) written in Haskell! swoon


Haskell was a lot of fun to learn, but I'm a n00b. My Haskell code probably sucks. :( Luckily since it's just a parser/code generator it probably doesn't matter too much. :)


Have you tried running hlint?


Yep, it was pretty helpful for learning.


I love this tongue in cheek speed comparison.


It was especially funny because I originally announced the project on Google+ (where my former Google coworkers follow me) on April 1st. Half of them thought it was a joke. :)


I saw it today (the 2nd) and still thought it was a joke.


Theres still a serialization step, so this is not an infinite improvement.

The real "zero serialization" is writing stuff 1:1 from memory to disk (or where you want it to go). This works great in languages like C where you can make assumptions for memory layout and use mostly POD-types [1]. But even then, you still have to fixup pointers.

[1]: I guess this is the reason why fread has split the amount of bytes to read into a "size of item" and "items" parameter: http://linux.die.net/man/3/fread


> Theres still a serialization step, so this is not an infinite improvement.

There is? What happens during the serialization step?

> The real "zero serialization" is writing stuff 1:1 from memory to disk

Is that not exactly what Cap'n Proto does?


It's unfortunate that in 2013, we're still developing protocols and formats that have weird Unicode support. Cap'n Proto only seems to support Modified UTF-8, which doesn't include NULL bytes[1]. Though I believe support for full UTF-8 would be a trivial change, requiring no changes to code.

[1] http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


You're right. Someone brought this up on the mailing list, and I agreed that the no-NUL-in-content restriction could be dropped. A NUL byte will still be required at the end, and it will be up to apps to decide if they want to add further restrictions.


I agree with ending null bytes. It eases cross-language compatibility, and it usually costs nothing because it fits into the padding bytes anyway. Btw, that's exactly what OCaml does (string have length, can include nulls, but still end with null bytes, to be passable to C functions).


Sorry, I thought this was an April Fools!

I saw "Infinitely Faster", and though uhh huh, of course. Then I clicked to the next page and "Note: Cap'n Proto is not ready yet" made me even more sure... only when I came here was it clear that the technical discussion wasn't a continuation of the joke!


Can somebody comment about this vs MessagePack please?


Well, I'm obviously biased, but:

MessagePack is a binary encoding of JSON, which actually makes it pretty different from both Cap'n Proto and Protocol Buffers, both of which require you to explicitly specify your type in an IDL (interface definition language).

There are trade-offs. The IDL is extra work. But when you encode an object in MessagePack, the full names of every field must be encoded on the wire. In Cap'n Proto (or Protobufs), those names are elided, saving space.

MessagePack is also primarily a sequential format (like protobufs). You have to parse the whole message before you can operate on it. Cap'n Proto is random-access, which is why it needs to parse step -- it can just operate on the raw bytes in-memory as you go.

The down side of a random-access format is that it will tend to be fatter since all the widths are fixed. E.g. Protobufs or MessagePack will encode an int32 up to 127 in 1 byte, but Cap'n Proto will always use 4. That said, Cap'n Proto mitigates this by offering a simple "packing" scheme that compacts zero-valued bytes, and that tends to bring it to parity with protobufs for message size (haven't compared with MessagePack). The down side of this packing is that it of course means the protocol is back to being sequential, since you have to unpack it first. The packing is very fast, though, as it can be implemented in a tight loop with almost no branches.


hey kenton, any public thoughts on russ smith's similar project several years back?


I'm not familiar. Do you have a link?


will respond off-thread


"Note: Cap'n Proto is not ready yet"

Okey doke. Closes window

(i wonder if you put half the effort into the code that you did this web site whether you'd have a finished product...)


Measurably less effort has been put into the docs than the code. You can verify this for yourself by checking the commits: https://github.com/kentonv/capnproto/commits/master?page=1

Also, wow, your attitude is horrifically bad.


Also, wow, your attitude is horrifically bad.

You have an intense feeling of fear, shock or disgust because I don't want to use an unready product that I feel has an unnecessarily polished web site?

I'd hate to see your face once I review the code.


Oo, and you like to do the thing where you look up the dictionary definitions of words used to criticize you for their literal meaning. Cool.

My problem is not with your choice to use or not use. It's with the words you use to dismiss something that is well-within the norms of what's posted on here. Unfinished products go on HN all the time.

And, the website's not even that polished! It's a bunch of markdown design documents with two or three graphics attached. So, you have a bad attitude and you're wrong. That's off-putting, and apparently not just to me.

Oh, big bad Peter's going to review some pre-alpha open source code and dislike it. That is sure something that's going to make my face scrunch up.


I'm glad you think i'm cool. Sorry I don't tow the party line and blow every dev who releases a lack of a product so they can build interest based on promises (though I realize there is code available in this instance). I'm also sorry you don't like the website. And thank you for mentioning my name, I so love to see it in print. And sorry about your scrunchy face.


I'm fairly sure your sexual services would not be required by "every dev" who posts their unfinished open source projects on HN. Courtesy, on the other hand, or barring that, silence, could be useful.

http://ycombinator.com/newsguidelines.html


Yeah I don't feel great about publicizing an incomplete product, but as they say, release early, release often. Hopefully getting feedback now means the product will be better once it's production-ready.


This is exactly the sort of product to publicize early. Once too much investment has been made in the details of the protocol, things will necessarily tighten up. Makes loads of sense to open the dialog for the protocol early.


As a systems administrator, I'm relatively annoyed that you identify as a 'professional sysadmin' in your profile and then leave completely pointless remarks like this. There's a reason that we have a BOFH reputation in the aggregate and, typically, users are afraid to ask us for things. It's attitudes like this among us.


Considering that my comment has nothing at all to do with my job, i'm relatively sure I don't give a shit.

The reason systems admins have a bad reputation is: they earned it.


  my comment has nothing at all to do with my job
Correct, but your comment has everything to do with your attitude.


I really hope you get over whatever has made you feel entitled to talk about others' work in this manner. It will only hurt your intellectual development down the road. I'm not agreeing or disagreeing with you or your comment (although I do disagree - that's not the point). I disagree with your attitude in general. When I read this comment, I think of the guy who gets really angry at the waiter/waitress when they bring out a food order that the chef messed up/misheard. I'm not asking you to stop being opinionated, I'm asking you to treat others with some level of decency.


Also, I'm glad you like the site! My friend Amy threw together the design last night and I'm pretty happy with it. :) (I personally couldn't design a decent web page if my life depended on it... :( )


I actually do like it, though I have to say, for technical docs I like more compact designs like this: https://developers.google.com/protocol-buffers/docs/overview (I don't know why, but I feel like i'm scrolling more on Cap'n Proto pages to read the same amount of information as on the protocol buffers page)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: