Hacker News new | comments | ask | show | jobs | submit login
Storing C++ Objects in Distributed Memory (berkeley.edu)
122 points by xiii1408 41 days ago | hide | past | web | favorite | 36 comments

I think stuff like this is why C/C++ are so popular. You can push the language really far to do interesting things. Obviously you can also shoot yourself in the foot but you can also pretty much do whatever you come up with.

Actually, most stuff described here is implementable in any language. Since its public interface is just a container.

Dynamic languages for sure - you just dynamically dispatch your method call based on the type of the object in the container.

But what C++ is doing (and static languages with metaprogramming in general) is specializing and statically inlining everything at compile time so the assembly just takes the fastest path every time; no runtime cost.

I'm unsure which static languages do and do not support partial specialization. My understanding is that the current iteration of Rust doesn't have partial specialization, but maybe it has other features that could do what this post is doing?

I'd be interested in hearing people's perspectives on the metaprogramming power of different statically typed languages - people have expressed interest in me building libraries like this in Rust and a few other languages.

I don't fully grok all of the details in the post, but yes, Rust does not have specialization. However, given the way traits work, I don't think it would be needed in Rust. Serde is the most popular serialization/deserialization framework. You get a Serialize trait, that you implement for a given type. It already calls that specific implementation for that type.

Buuuut I'm probably missing something.

He specifically asked about partial specialization, which is akin to currying for templates.

Yeah, this is what I mean by that I don't 100% understand all the nuances. I mean, I understand what that feature does, but given that Rust's system works quite differently, I'm not 100% sure what an exact translation would be.

Just have a generic abstract class with the required public interface, then have two implementations of it, generic one, and a special one for trivially copyable types. Then you can have a factory, that creates an instance based on the element type. Can be done in Java and C# alike.

The post gives an example of using an interface with static dispatch. This is in contrast with a regular interface (an abstract class) which implies the use of dynamic dispatch.

I don't think it matters in the scenario though. The cost of dynamic dispatch will be negligible in comparison to remote call costs, even into a different process on the same machine.

Besides, there are techniques to avoid dynamic dispatch for generic specializations, at least in C#.

And even more techniques in F# in it's type system and inlining

This was already broadly used at least 10 years ago. Interestingly, modern C++ techniques tend to be ignored by HN folks, while old stuffs[0][1][2][3] generally gain more points.

[0] this thread

[1] https://news.ycombinator.com/item?id=18650902

[2] https://news.ycombinator.com/item?id=16257216

[3] https://news.ycombinator.com/item?id=18281574

Perhaps this is because modern language designers gradually solve more niche problems as time progresses, while the rest of the world is still catching up to pre-2000 concepts. The 90's was a great time for new strides in language design and memory management, but now we're only making marginal improvements. People will vote for things that teach them something new.

We were doing this stuff in 1993 with OS/2, Rogue Wave libs and SmartHeap. It was a LOT of work and exposed many a buggy motherboard.

We had to build everything ourselves except for the STL like features of RW Tools.h++ and a few other of their libs.

1990s C++ compilers were ... interesting ... - especially for templates.

Distributed Objects in Obj-c was all about this in the 90's I believe.

Yeah, writing C++ portable code across UNIX flavours, OS/2, Windows was indeed interesting, even when using plain old C.

Have you used SOM as well?

>> Have you used SOM as well?

We tried a few things with it. It was a good idea that took too much manual work. Much like CraptiveX/COM but without the toolworks.

That's super cool. I'd love to see any of your papers/code.

I wish I still had it. There was a lot of my blood, sweat and frustration in that code. :)

I lost everything I had archived in a series of hardware failures and stupidities...

I've repeatedly asked the co-founders to let me have some of the stuff and they still refuse. They didn't think I should have had it to start with.

It really was simple in concept though. Basically a queue server on each machine, memory mapped files and containers/string class/bignum class etc using placement new. The hard part was finding machines that weren't steaming piles of crap. Compaq was the only company we found that had server class machines we could rely on. We also had to stick to IBMs C++ compiler due to all the bugs in Borland and Watcom at the time.

Distributed data structures is an interesting idea, but I’m not sure this particular implementation is the right abstraction level for many problems.

These sorts of things form a spectrum from low level message passing all the way to a full fledged database. Some things in between are abstractions like MapReduce which still gives a lot of control, but also gives you fault tolerance and lower configuration of your cluster to run different applications.

Another abstraction in this space that is just a bit higher than ordinary message passing are systems like Kafka and RabbitMQ which give additional guarantees over plain message passing using sockets as well as less configuration to set up/remove machines.

One thing a lot of the more successful abstractions in this space seem to have is a pretty clear line between which parts of the system are local and which are distributed. It seems like this system doesn’t have as clear of a demarcation which would make understanding performance much more difficult. Databases can also have this problem since they do a lot of low level performance tuning automatically, but at least they provide a very high level of abstraction for applications to work with. This seems like it could end up being confusing to tune without providing a nice high-level interface like databases.

This post is really just about an abstraction for storing objects and doesn't explain how you'd make data structures that distribute objects across nodes using remote pointers.

Remote pointers end up being pretty nice for locality, since you can explicitly see what process a remote pointer is pointing to.

None of the backends that my remote pointer implementations depend on really offer fault tolerance, though. So you're right: if you're in a situation where you have nodes entering or leaving your cluster before your program ends, this model needs extending before it works for you. That's true for MPI-style programs, generally, though.

Another option for variable size types not explored in the article is to set aside some contiguous part of memory for these objects. The main idea is to store pointers as relative addresses from the start of the memory region.

You can then avoid doing any copying or serialization by directly sending the entire memory region as a unit.

The main difficulties with this approach is in deciding how large to make these regions and also handling free over this region. The first part is often not too bad depending on the application, but handling deallocation and expanding/shrinking memory regions can be tricky. This allocation problem is much more difficult to dismiss since it is necessary to solve in order to avoid making the user decide how much memory to allocate initially which is impossible for many applications. For instance, when dealing with strings, it may be impossible to get a good bound on the size of these strings.

> The main idea is to store pointers as relative addresses from the start of the memory region.

One problem is that languages don't support the use of standard container types inside a region with a variable base address (variable because it will change the next time you call mmap).

I think modern languages should support this concept, especially the ones that aim to be "systems" programming languages.

> mmap

Funny you should mention that - something has kept me thinking that the modern operating systems should have a built-in support for the cross-node sharing of the virtual memory via mmap.

Some cool people at VMWare wrote some software that lets you manage RDMA segments as files in the file system [1]. And you can mmap those RDMA-resident files if you wish.

[1] https://256fd102-a-62cb3a1a-s-sites.googlegroups.com/site/mk...

Boost.Interprocess does. Pointers to variable base addresses can be implemented as offset pointers (the trick is storing an offset from this instead of mmmap start address).

It's a weird article in that it does not mention at all, that for remote containers it's more beneficial to expose and use range queries. E.g. you don't ever want to access remote list element by element due to latency. You will be much better off with chuncked enumeration.

Sure. Most of the time you'd be better off with an explicit endpoint and protocol, using something like protobufs or capn-proto (which provide both serialization of objects and a relatively easy to use RPC interface for remote calls).

Making remote accesses explicit rather than implicit makes it a little more obvious to code readers how grossly expensive they will be.

To be fair, the collection can internally do prefetching and write accumulation.

I'm not sure adding black boxes really helps design reliable/performant distributed systems :-).

Everyone uses semi-transparent caches of CPUs. If behavior is predictable - I don't see a problem.

CPU caches are good about correct cache invalidation (and must be). Software caches aren't always. Software caches don't always do a great job of bounding cache size, either.

Today's multi-core systems are really distributed systems, and the cpu cache does not hide that well enough that OS software doesn't have to work around it.

For example, kernels have to send IPIs (interprocessor interrupts) to other CPU cores to get them to perform some actions; cores sharing a specific cache line, or memory address that happens to alias to the same line in cache can create a 'ping-pong' effect that can severely degrade performance; L3 cache latency can be non-uniform depending on how far away the cache is from that particular core. The black box is breaking down as core counts increase.

  template <>
   struct serialize<int> {
     int serialize(const int& value) {
       return value;

     int deserialize(const int& value) {
       return value;

  error: return type specification for constructor invalid

Yep, you caught my typo. Notice the GodBolt version has serialize_().

well off topic, but anyone know what blog engine/style this is done in?

It's just CSS based on the LaTeX defaults, written by Andrew Belt as a style for Wikipedia [1,2].

[1] https://github.com/AndrewBelt/WiTeX

[2] https://andrewbelt.name


Please don't use throwaways to post throwaway comments. Eventually we ban the main account.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact