Hacker News new | comments | show | ask | jobs | submit login

Each node will usually be hosted on a single machine (the only reason to host more than one node per machine is for testing - it isn't require for load balancing like with DHT solutions), so 30M is probably not feasible.

Most communication among nodes is peer-to-peer and uses UDP, so there are a few things that can limit you. First, is your data-structure itself and the amount of contention for items. Second, the first time a node requests an item, it has to find it (no well-known hashing scheme), so it can either multicast the cluster, or ask a central server. Multicasting with 30,000 nodes is probably not the right way to go, so you'll need a server, and that puts a practical limit. Third, Galaxy uses either ZooKeeper or JGroups for cluster management (peer detection, fault detection, leader election etc.), so either of these solutions might also have some limitations. Fourth, if your cluster is huge you might hit network bandwidth issues, though that seems unlikely.

EDIT: Forgot to address the latency question. Galaxy is consistent (and not eventually consistent), so to get a new version of an item from a node you'll need to wait for the invalidation acks. However, the amortized cost is O(M/N log N) no matter what - the larger and taller the tree, the more nodes you'll need and the more nodes sharing the root, but updates to the root will be less frequent (because the tree is taller).

What I find the most interesting thing about Galaxy is the distributed data-structures people will come up with. Some will be better suited to Galaxy and some will not. But because Galaxy's architecture mimics CPU architecture, I think that modern data-structures designed around caches and NUMA would be excellent candidates for distribution with Galaxy.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact