
Distributed is not necessarily more scalable than centralized - mad44
http://muratbuffalo.blogspot.com/2014/07/distributed-is-not-necessarily-more.html
======
MCRed
Dropbox is distributed. According to the article, it uses AWS, which is a
Dynamo based system. Among its other features, Dynamo allows you to distribute
data across many servers, using a hash of the data's key in order to look it
up (each server gets some of the keyspace).

Riak is a similar type system.

Dropbox is "centralized" in the sense that it is one service, but it's not the
opposite of distributed which would mean "running all on one computer."

Edit: I said "hash of the data's key" but really it's a hash of the key plus
the bucket.

~~~
asgard1024
I think you missed the point. The big question is, what is "one computer"? It
surely isn't just one atom (one point in space). Is something running on a CPU
with several ALUs a distributed system? Is something running on a multicore
server a distributed system?

So it seems, distributed systems are not defined in terms of spatial or
logical distribution of "stuff being done" (because everything is distributed
in this sense), but rather by assumption of (un)reliability of the links
between the components, and the choice of components. And this may depend on
your vantage point, too.

So if that reliability is good enough, you don't need theory of distributed
systems, and it all makes the system more efficient. In that sense, Dropbox is
more efficient than P2P solution because you have hidden assumption (from the
user perspective) that Dropbox servers will always be available.

Edit: Also reliability of components plays a role. Thinking about it more, it
really seems to be a question of hierarchy. Distributed systems are less
hierarchical than the centralized counterparts; there are less bottlenecks
(that may fail) but more coordination required. So the OP is basically arguing
that some hierarchy scales better than no hierarchy.

------
match
It seems the author is conflating centralized/decentralized with
distributed/monolithic. Dropbox is obviously a distributed system.

------
menzoic
> The student persisted and kept repeating that "Dropbox has a bottleneck
> because it is a centralized storage solution, and the distributed solution
> doesn't have that bottleneck". I couldn't believe my ears.

The student is correct. Lets ignore the fact that Dropbox is actually
distributed and say it is centralized because all nodes of the system belong
to one provider. The only way Dropbox could have scaled to 200m users was tons
of cash. In a distributed solution where each node is a provider themselves,
each additional user could potentially increase the performance of the system.
The distributed alternative scales much more gracefully without running into
the bottleneck of needing more cash to buy more machines/storage/bandwidth. In
this particular frame, distributed is most definitely always more scalable
than centralized unless you have unlimited cash.

~~~
jrochkind1
> In a distributed solution where each node is a provider themselves, each
> additional user could potentially increase the performance of the system.
> The distributed alternative scales much more gracefully without running into
> the bottleneck of needing more cash to buy more machines/storage/bandwidth.

Wouldn't each additional user have to potentially buy more hardware to
increase the performance/capacity? There's still a cost, there's still a need
to buy more capacity to increase capacity -- you've just "distributed" the
cost to the (organizational) nodes. Which, okay, can be useful sometimes, but
clearly there's a huge market of people who would rather pay someone else to
take care of it, than spend that same money (or likely more) on being a
"provider themslves".

~~~
menzoic
@jrochkind1

The type of distributed storage system that I'm speaking of relies on each
node/user being a provider themselves.

For an example checkout [http://storj.io/](http://storj.io/)

~~~
jrochkind1
I understand that, and I believe what I said holds.

~~~
menzoic
>clearly there's a huge market of people who would rather pay someone else to
take care of it, than spend that same money (or likely more) on being a
"provider themslves".

There's a point that you might be missing. Similar to how you have people
purchasing and maintaining powerful machines to mine bitcoin, you also have
people using their spare HDD space or potentially buying dedicated storage to
support the "distributed dropbox" system. The reason is that there is an
incentive for doing so. You get paid for the storage/resources you provide.
This adds to its scalability because any addition to the system is not only
paid for but profitable.

------
rco8786
1) Dropbox is distributed.

2) This article doesn't actually make any argument about why a centralized
system can scale as well as a distributed one.

~~~
mad44
Related to 2), the answer is hierarchies. Related to 1), the linked article
writes: (For those who want to nitpick, I know Dropbox is not fully
centralized; it uses AWS S3 for storage and Dropbox-company servers for
metadata management. Also, it employs data parallelism in the backend for
scalability, but, on the spectrum, it is closer to a centralized architecture
than a fully decentralized one.)

The point is to compare a more centralized architecture to a more/fully
decentralized architecture.

------
rdtsc
> You can employ Paxos to replicate the centralized server. In contrast, it is
> often much harder to design and add fault-tolerance to a distributed system.

Ok am I missing anything. So we are employing Paxos to replicate the
centralized server. Are we replicating it to itself? Because if we are not, we
got ourselves a "distributed" system.

------
cbhl
My hunch is that the student is frustrated because Dropbox sync speeds are
sometimes less than the network line speed (maybe due to the agent having to
scan the filesystem to look for changes, or because the agent is syncing many
small files, or because Dropbox or the ISP or anyone in the middle is
throttling the connection). This is particularly noticeable if you sync a new
computer on a different network from the rest of your Dropbox machines (say, a
EC2 VPS, or on a university network away from home) because when you're on the
same network, LAN sync is often used for a large portion of the initial sync.

I suspect the student thinks that distributing his/her files among his/her
friends and/or multiple services (bittorrent-style) will allow his/her to
increase throughput -- however, I suspect it will merely increase complexity
(and possibly also cost) without actually making syncing/back-up faster.

------
contingencies
Dropbox is centralized at the organizational, jurisdictional and other levels
whilst technically it may employ distributed resources. It's not incorrect to
point at this centralization as risk, both in terms of availability and
scalability.

This is really an industry-wide problem begging for a neat solution. Software
eats middle management! (Devops => Devmangops? Mmm... mangoes...) Perhaps the
world needs an open source tool in the organizational management/risk space
that models business-level risk based upon commercial as well as technical
infrastructure.

Perhaps the best model for developing such a capacity is a generic exchange
protocol with plugins for risk management? My start brainstorming @
[http://www.ifex-project.org/our-proposals/ifex](http://www.ifex-
project.org/our-proposals/ifex)

------
aba_sababa
I think the actual confusion is about a centralized distributed system vs a
peer-to-peer distributed system, which is probably what (still totally wrong)
PhD student meant.

------
Illniyar
It's not really clear to me what part of dropbox isn't distributed? (in the
sense that it's hosted on multiple computers), the data is distributed and the
processing is distributed. Do they mean it has a central controller/router or
something of that kind?

------
setori88
Chatty software able to synchronize state over the open internet using
declarative concurrency is a distributed system. A high performance cluster
running something like distributed message passing concurrency erlang is a
distributed system. A single program written with the complexity of shared
state concurrency executing over multiple cores is a distributed system. The
concept of concurrency is vital for this, particularly what type of
concurrency used. When this person talks about distribution what kind of
concurrency is he referring to? I'd like to see this professor reimplement
Dropbox for sequential execution on a single CPU to serve the world (you can
only use shared state, or any other form of concurrency if you do it on the
same CPU). This centralized system then should be fault tolerant. Which it
absolutely will not be, as you need at least two machines for fault tolerance.
This article was a waste of time.

------
zmanian
Distributed is often much more difficult to scale than centralized esp because
you n^2 messages for the system to reach consensus.

Distributed tends to produce higher availability than centralized systems and
often that is worth the cost.

------
alexnewman
Yea AWS is not dynamo based. Dropbox uses a bunch of mysql and s3. It is
hugely distributed and they have to spend a lot of human resources keeping it
up.

~~~
mryan
Actually S3 is internally based on Dynamo, although this is not exposed to the
end user. Straight from the horse's mouth:

[http://www.allthingsdistributed.com/2007/10/amazons_dynamo.h...](http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html)

------
tzakrajs
I sense much confusion in the force...

