SpaceBase is a distributed in-memory chunk store. Each chunk is referenced by an "id" and stores unstructured data (as far as the store knows). Each chunk is "owned" by a single node but other nodes might keep a read-only copy. If another node wants to write to that chunk, it first requires a transfer ownership and the new owner is broadcast. This design is inspired on how L1 caches in CPUs work. The theory is that with MMO, the data is geo-localized and each computer would take care of a part of the whole world. By keeping data local to the machine you avoid network I/O.
There are two things I didn't see in the article. The first is how read-caches are invalidated. Maybe data is eventually consistant or cache-invalidation is broadcast on write. The second is how data from a stale node is recovered. That's the difference I though when I saw the CPU reference.
I will write another post in the coming weeks, explaining the cache-coherence protocol in detail, but let me just say now that Galaxy is always consistent (i.e. not eventually-consistency). Some more details: read-caches are invalidated when a node wishes to write (just like L1 caches), but the writer doesn't need to wait for acknowledgements in order to proceed - this helps with latency. Also, the invalidation doesn't need to be broadcast, because the current owner of a cache-line maintains a list of all readers.
Any hints you want to give on how to handle node failures ? For me that's the potential weak point.
Fault tolerance is really the hardest part of the implementation. The documentation contains a detailed explanation of the features and how to use them. A future blog post will explain all algorithms in detail (it's simply too long to post in a comment). In the mean time, you can take a look at the code on GitHub if you feel like it.
Great project. Congratulations on your decision to do real open development from the beginning. Looking forward to your next post.
Ill admit that im thinking of a wide area applicstion. Maybe another version of the question is what are you targeting as a max number of nodes (30? 30,000? 30M?) What is your goal for invalidation latency? And, not to quibble, but how much invalidation latency can you have without being considered eventually consistent? I assume you do guarantee the order of the invslidations?
Perfectly cool if the answer to these question is "maybe later" or "maybe not".
Can't wait to see your deeper explanation. I really love this project because it's a whole different take on distributed memory. I'm embarassed that I has settled into thinking that a DHT was -the- way to distribute shared memory, and hope this project provokes lots of us to think more broadly. I don't know for sure that it will have a large number of applications, but I am really glad that we'll all be able to think more broadly about distributed systems as a result of the open work you are doing.
Most communication among nodes is peer-to-peer and uses UDP, so there are a few things that can limit you. First, is your data-structure itself and the amount of contention for items. Second, the first time a node requests an item, it has to find it (no well-known hashing scheme), so it can either multicast the cluster, or ask a central server. Multicasting with 30,000 nodes is probably not the right way to go, so you'll need a server, and that puts a practical limit. Third, Galaxy uses either ZooKeeper or JGroups for cluster management (peer detection, fault detection, leader election etc.), so either of these solutions might also have some limitations. Fourth, if your cluster is huge you might hit network bandwidth issues, though that seems unlikely.
Forgot to address the latency question. Galaxy is consistent (and not eventually consistent), so to get a new version of an item from a node you'll need to wait for the invalidation acks. However, the amortized cost is O(M/N log N) no matter what - the larger and taller the tree, the more nodes you'll need and the more nodes sharing the root, but updates to the root will be less frequent (because the tree is taller).
What I find the most interesting thing about Galaxy is the distributed data-structures people will come up with. Some will be better suited to Galaxy and some will not. But because Galaxy's architecture mimics CPU architecture, I think that modern data-structures designed around caches and NUMA would be excellent candidates for distribution with Galaxy.
The novel quality of this system is that node data-locality is determined by how you access the data.
Let's say you have 10 server nodes and 10^6 clients. That means that, in the best case, each node deals with 10^5 clients. Presumably clients (or players) can move through the simulated world, hopping from node to node.
Here things get hazy in my mind. For example, how does the client know which server to connect to? Does it just make a guess and get redirected if it guessed wrong? When a player crosses a node threshhold, how does that work?
I'm thinking there must be a central character store - basically a traditional database, that handles initial node affinity. Position in the world is part of character state, and when you login, you'll be handed off to the correct node based on that state.
But if this is how things work, when would you ever need your "cache lines" moved from node to node? The world-state is spatial so why move it from node to node? I guess that's the crux - I can't think of any other data that would need to move between nodes other than player data, and in that case I don't see what data it would need to take with it.
> how does the client know which server to connect to?
Galaxy doesn't handle any client connections, only connections between cluster nodes, but if you were to build something on top of that connects to clients, then, yeah, starting with a guess and redirecting is ok. An initial node that simply directs connections might work, too. And if players move from one place to another, having your communication layer telling them to connect to a different node is pretty much what we had in mind.
> when would you ever need your "cache lines" moved from node to node?
Yes, player data, NPC data, vehicle data - anything that moves. BUT, another big reason for data migration is load-balancing. Continuing with your game example, if a lot of players congregate in one area, handled by one machine, you may decide to split it to two machines, and migrate half of the information there.
If you were to use Galaxy for a graph database (forgetting the MMO use-case for now), then while the graph vertices don't "move", changes in the edges might make you decide on a better distribution of the vertices over the cluster.
The best way to experiment with distributed systems design is to build a real distributed system - or at least be able to sketch one out.
Frankly, this seemed like a solution looking for a problem, and your vague responses are supporting this. The interesting dynamic, of being able to move data between nodes as a side-effect of access patterns is interesting, but it's not clear how an MMO could really use this dynamic to good effect. Indeed, even the fault tolerant case, it's not clear how this dynamic would help failover - I mean, would you need to duplicate access patterns prior to node failure to ensure dual-local data?
Frankly, I think you should focus on one use-case (MMO, graph database, something) and hand-wave a complete solution that really leverages the novelty of your approach. Get specific and talk about what actually happens when "things move".
My post was meant to be an introduction to a series of very technical blog posts discussing theoretical and practical aspects of distributed systems. The post was not meant to serve a clear commercial purpose, so I was trying to steer the discussion away from commercial uses and more to its CS aspects. You know, we really find this stuff interesting. Some of my future posts will discuss the more theoretical sides of Galaxy and will drill very deeply into its design and algorithms, while others will discuss how SpaceBase will make use of Galaxy to help MMOs build huge, rich worlds, and LBSs track lots of moving objects in real-time. To be more concrete and give just a taste, I'll say this: when SpaceBase runs on top of Galaxy, objects are transferred from one node to another to create a dynamic area-of-responsibility for each node. This means that each node will be responsible for processing all objects in some region of the game world (or real world for LBSs). But the regions are dynamic - namely, they shrink and grow to accommodate non-uniform load, so that small busy areas will be split over several nodes, while large, relatively vacant ones will be handled by just one.
The node could be experiencing a soft failure—excessive ECC errors, power supply failure, kernel security update needed, etc.
Isn't the only real bottleneck for large-scale real-time MMOs the network bandwidth needed between the server and client? While this tech would improve the efficiency of handling data on the server, it wouldn't be able to solve the inherent problem of network limitations.
Major props for making it open source though, I'll enjoy looking at the code.
EVE has the nice property where the universe is already split up into these nice partitions connected by a network of jump gates. It maps very nicely onto a server farm.
Sorry if my previous post came off as dismissive, as a game development enthusiast I was focused only on the application to games rather than the technology itself. The point I was trying to make is that even with this technology, we are unfortunately not closer to being able to create real-time action mmos. For an advancement in this area, we need a leap in network capabilities, rather than server efficiency.
This doesn't solve any new problems, what it does do is provide a licensable technology that many studios have to re-invent themselves, there is value in that however.
There then instead of node ownership based on the space grid, you'd want to somehow have node ownership based on clustering/density.
So maybe constantly iterate a K-means clustering algorithm, where servers are cluster centers and every player/client in the cluster belongs to that server.
That would be my back of the envelope approach. It probably has lots of terrible flaws that I haven't thought of yet.
One of my favorite articles is on this very subject: http://www.gamasutra.com/view/news/114192/Opinion_Designing_...
- Shard: Encompasses all parts of a single game world
- Zone: Area of the game which can't be further broken down; the players in this area are all connected to the same node and performing actions there.
- Instance: Area of the game which is unique for a player or group of players. This is done so that players can be together and not be interrupted by others, e.g. someone killing a boss before the group gets to it and having to wait for it to respawn. These are created on demand by players and go away after the players leave.
- Node: Single machine that runs parts of a shard. This could be one zone with a high average player count, or a bunch of zones with fairly low player counts (e.g. large open areas) and maybe some instances thrown in.
Your end-to-end latency from a user clicking something and seeing the results is 2x latency + simulation time.
While I like software implementations of hardware optimizations as much as the next guy, what kind of replication distribution do you expect? You're splitting up your data-set in RAM across many machines, yet you've got them replicating data from each other. How much of your data-set do you expect to be accessing frequently?
This is all for latency reasons. Fault-tolerance is a different story.
I was part of a team that invented "spacebase" back in the mid-1990s! We were able to support millions of simultaneous MMO players in the age when most people were using dialup modems and had vastly less bandwidth and vastly higher latency than they do now. This technology ended up being acquired by Sony, and used as part of the playstation network.
Like this company our work grew out of simulation programming done originally for the military (in this case the DoD) and like this company we provided an API and solution to rapidly partition the space so that the game client would only need to know about objects located near it according to in-game geometry. Like this product ours was fully distributed, etc.
Alas, we were ahead of the age of MMOs, though while World of Warcraft didn't yet exist, Ultima Online did, and there were a lot of other attempts at MMOs.
Nowadays people if there was less temporal difference people would say "They ripped us off!" but I can totally believe this company had the same idea... and they saw a green field because there were no competitors.
The problem is, there were no competitors because (at least back then) game developers were not interested in solutions they didn't invent themselves. Maybe that has changed.
I'd say that if you disregard indie development, that is still pretty much the case. There is certainly a bigger market for middleware, but I think it's generally met with skepticism.