
Centralized Control, Distributed Data Architectures - amaks
http://highscalability.com/blog/2014/4/7/google-finds-centralized-control-distributed-data-architectu.html
======
askQ
A CS major in college here. This article suggests that centralized
architectures seem to be winning and that Google has learnt that with
experience. I have a question and people on HN might be the right people to
ask this to.

Why is it that you need experience to learn whether a centralized model is
better or a decentralized one? When a a team is considering alternative models
(centralized vs distributed), can't you just compute the parameters of each
model (complexity, network usage, availability, reliability etc) and pick the
one that is better for your objective? Why do they need to trial and
experiment?

I come from a college world where we are learning algorithms and systems and
we can easily pick an algorithm for a problem based on complexity etc without
having to implement the multiple algorithms we are considering. I find that
industry is a lot more trial and error. Why is this? An explanation with an
example would be great.

~~~
noelwelsh
The real world is too complex to analyse. Theory works by analysing a
simplified model of the world, with the hope that essential characteristics
are maintained and only irrelevant details are ignored.

Take the analysis of algorithms. I expect most of the analysis you've seen
assumes memory access takes constant time. This hasn't been the case for a
long time, due to caching. Now most people don't really care about performance
so it's a reasonable simplification. However some people really care about
performance and thus need a more complex model. See cache-oblivious algorithms
for some work that addresses this. When you get to the level of assembly
language optimisations no model is going to help you. You just need to
benchmark the code on a real system.

Similarly in distributed systems we work from simplified models. This is fine
for most of us, but Google has one of the most complex systems in the world.
Naturally their requirements are a bit off the beaten track, and perhaps not
adequately addressed by theory.

~~~
calibraxis
Yes, and often times theory consciously sweeps essential things under the rug.
Take for example physics which postulates frictionless surfaces. That must've
seemed like a ridiculous loss of explanatory power. But it turns out that
humans can get _principles_ that way, if we blatantly take a huge amount of
reality and dismiss it as "friction".

Science really is a different enterprise than engineering. I hear it's fairly
recent that scientists had anything to say to engineers and those in medicine.
(Now of course science is vital to these enterprises.)

------
jude-
Disclaimer: I'm a CS PhD student who works on distributed service
architectures.

It's worth pointing out that this design pattern only makes sense when the
entire system lives under one administrative domain. Google owns all of the
servers that make up GoogleFS; a cloud provider owns all of the Hadoop nodes
in its datacenters; a PaaS provider owns all of its NoSQL datastore nodes;
etc. We see a similar pattern at work in Puppet, Chef, Ansible, Func,
certmanager, etc. as well.

Under these circumstances, it's desirable to maintain the authoritative state
in a logically centralized place for two reasons. First, doing so makes it
easy for the rest of the system to discover and query it. Second, it makes it
easier to keep authoritative state consistent with updates. Centralizing
control and distributing data lets you address control-plane concerns
separately and independently of data-plane concerns.

However, it stops making sense to centralize the authoritative state (control)
once you build a system that spans multiple administrative domains. Which
domain gets to host the authoritative state? How do you get the other domains
to act on it? Centralization won't work here, unless you can first get the
domains to agree on who's the controller (sacrificing their autonomy to decide
the state of the system).

We have addressed these concerns instead by distributing responsibility for
the authoritative state across domains, and devising a way for them to reach
consensus on it. DNS does this by delegating authority for name bindings
hierarchically. The Internet maintains routing state by having each AS learn
and advertise routes to each other AS via BGP. Bitcoin maintains the
blockchain (its authoritative state) by having a majority of nodes agree on
the sequence of blocks added to it. DHTs work by sharding the key space AND
routing state across their participants.

It's hard to achieve consensus (and react to changes) in these multi-domain
settings versus the single-domain setting since you can't force every domain's
replicas to agree. However, this is a _feature_ \--no one but the computer's
owner should have the final say on the state it hosts. Naturally, multi-domain
systems must account for this in their design--something that Google's
internal systems can safely ignore.

~~~
bjornsing
I think you're going to see multiple systems with centralized control exert a
partial influence (as opposed to complete control like Puppet/Chef) over
systems. DNS is a pretty good example: nobody thinks of configuring a DNS
server on a system as setting up a "controller", but in some sense it controls
an important aspect of the system's behavior. So does NTP.

Another example is the SDN-inspired "Wi-Fi sharing" platform Anyfi.net that
I'm working on. It allows you to configure an "anyfi controller" e.g. on your
home Wi-Fi router, but that "controller" only has a say in how the spare
bandwidth and "extra SSIDs" on your router are used. It can set up Wi-Fi
networks and tunnel out the raw 802.11 frames to an endpoint anywhere on the
Internet, but it can't do anything that impacts your security or steals
significant portions of your bandwidth. In that sense it's somewhat like DNS,
but with even less security implications for the "controlled" system.

------
KaiserPro
decentralization is hard.

To put in different terms its like trying to control a plate of marbles with a
single pencil.

you can only manipulate a small portion of of the marbles, and you hope that
the commands you give them will propagate and not run out of control.

Nowadays its perfectly feasible to control 10,000 servers through one system
running on two or three servers. with some work that could reasonably be
pushed to half a million.

That'll basically take of 99.99% of all companies needs.

~~~
EGreg
In a decentralized system, it's not about control, it's about open source
tools which interoperate and let people get things done, while giving them
power over their own experience if they want to exercise it. I've spent the
last 3 years thinking about how to implement social networking in a
distributed manner, and I'll just say I've figured it out and implemented it.

[http://platform.qbix.com/features/distributed](http://platform.qbix.com/features/distributed)

[http://www.faqs.org/patents/app/20120110469#b](http://www.faqs.org/patents/app/20120110469#b)

------
ninthfrank07
Check out [https://tent.io](https://tent.io) if you're interested in
decentralized services.

~~~
dfc
Looks kind of neat. Do you know of any projects using it?

~~~
ninthfrank07
The only project actively using it at the moment is
[https://cupcake.io](https://cupcake.io) (for example, my profile is
[https://frabrunelle.cupcake.is](https://frabrunelle.cupcake.is)).

This is because Tent app developers are waiting for Tent 0.4, which contains
several new features
([https://github.com/tent/tent.io/issues?direction=desc&labels...](https://github.com/tent/tent.io/issues?direction=desc&labels=v0.4&page=1&sort=updated&state=open)).
It should come out in a few months. The team behind Tent is currently focusing
on [https://flynn.io](https://flynn.io), which they are about to launch. Flynn
is important for Tent because it facilitates the deployment of Tent servers.

------
EGreg
Counterpoint:
[http://myownstream.com/blog#2011-05-21](http://myownstream.com/blog#2011-05-21)

------
outside1234
See also: Google benefits from centralized control.

