
System design implications of declining network latency vs. disk latency - liotier
http://danluu.com/infinite-disk/
======
chollida1
> Since 1995, we’ve seen an increase in datacenter NIC speeds go from 10Mb to
> 40Gb, with 50Gb and 100Gb just around the corner.

This has been a huge factor for our firm.

Trading systems used to be, by necessity, monolithic in design. Your risk
system, feed handlers, portfolio management, strategies, execution management,
and drop copy/logging systems were all built into one process due to poor
speed of interproceess communication, even on the same vlan. You just couldn't
get the required performance and certainty with a distributed system. I mean
its nice that your distributed database is eventually consistant over the
course of 500 milliseconds, but that's a couple of order of magnitude too slow

Now, networks are sooooooo fast and bandwidth is so plentiful that we can
afford to split many of the major subsystems onto different machines.

It's important to note that we don't split these subsystems up for redundancy,
I mean we aren't Netflix, we can't pull out major subsystems and keep trading.
if one goes down we intentionally take down the entire system. You can laugh
at me here if you'd like....

Rather we split the systems up so that we can test and upgrade them easier.
Before the split, each upgrade was a major change that might have happened 2x
a month, now we can upgrade individual systems 2-3 times a week. Agian, you
may laugh now, but I put a very big premium on stability and certainty.

I remember talking to someone at Google who told me that their low latency RPC
framework was considered to be one of their key technological advantages.
Having made the move over the course of a couple of years, I'd be inclined to
agree, a robust and fast RPC framework is an incredibly freeing piece of
technology.

~~~
footiethrowa
> I remember talking to someone at Google who told me that their low latency
> RPC framework was considered to be one of their key technological
> advantages. Having made the move over the course of a couple of years, I'd
> be inclined to agree, a robust and fast RPC framework is an incredibly
> freeing piece of technology.

Have you found Google's RPC framework to be robust and fast, or something
else? And if so, would you mind sharing what it is? Thanks!

~~~
chollida1
We've experimented with Protocol Buffers, Thrift and CapnProto.

We settled on protobuf but capn proto was super fast. I just didn't have the
time to play around with it or fortitude to put it into production with it
being so new.

I should point out, that's my own hang up and not mean to disparage Capn Proto
at all, call it the web 2.0 of "Nobody ever got fired for buying IBM".

~~~
kanslanf
I really like what I've read of CapnProto and would try it for my next
distributed system project.

~~~
corysama
CapnProto has good competition from Google's Flatbuffers. Which one is better
comes down to the specifics of your data.

[https://google.github.io/flatbuffers/](https://google.github.io/flatbuffers/)

[https://capnproto.org/news/2014-06-17-capnproto-
flatbuffers-...](https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-
sbe.html)

[https://news.ycombinator.com/item?id=7901991](https://news.ycombinator.com/item?id=7901991)

------
elipsey
Problem: most regular users have horrible asymmetric internet connections, and
it certainly has not improved at 50x/decade as stated. As I recall, I had
2Mbit/512kbit cable in 2000, and now I Have 10Mbit/1Mbit.

"Youtube, Netflix, and a lot of other services put a very large number of
boxes close to consumers to provide high-bandwidth low-latency connections. A
side effect of this is that any company that owns one of these services has
the capability of providing consumers with infinite disk that’s only slightly
slower than normal disk.[...]The most common counter argument to disaggregated
disk, both inside and outside of the datacenter, is bandwidth costs. But
bandwidth costs have been declining exponentially for decades and continue to
do so."

I have, at best, 1 Mbit upstream at my house. The author states:

"A typical consumer 3TB HD has an average throughput of 155MB/s, making the
time to read the entire drive 3e12 / 155e6 seconds = 1.9e4 seconds = 5 hours
and 22 minutes."

but my upstream connection is three orders slower (in spite of costing the
same amount). How am I supposed to get my data to the provider? Just emailing
an attached photo is already painful...

These are really intersting ideas for business and the data center, but I
think the author is out of touch with how bad consumer internet service is for
us regular chumps.

~~~
hga
" _Never underestimate the bandwidth of a station wagon full of tapes hurtling
down the highway._ " \- Andy Tanenbaum

Amazon has formalized their Import/Export service with their own hardware, the
"Snowball":
[https://aws.amazon.com/importexport/](https://aws.amazon.com/importexport/)

$200 for 10 days in your possession, up to 50 TB, no extra charge if you're
importing to Amazon. Useful to get your media collection backed up in the
cloud, if you don't want to take a few years as I'm currently planning for my
AT&T DSL soda straw. But otherwise, latency slower than a round trip to the
New Horizons space probe, which is currently ~ 9 hours.

------
AnthonyMouse
You have to be careful with articles like this. Most system design people work
at large scale companies because they're the ones who can afford them. But
it's a mistake to think that because something makes sense for Google or
Amazon, it also makes sense for your small/medium business.

For example, cloud providers pay less for disk than you. They may even _charge
you less_ than it would cost for you to buy it yourself. But how much storage
do you actually use? For most companies your entire storage budget is probably
a rounding error in your cost of doing business. Amdahl's law says if you're
looking to save money there you're wasting your time, particularly if it comes
at the cost of anything important.

Which it kind of does. Even putting aside the entire privacy debate, cloud
storage creates third party dependencies. Put the only copy of your data on
cloud storage and now you're dependent not only on the cloud provider, but
also their ISP and your ISP. If any of them has downtime, so do you. If the
cloud provider loses all your data, they say "oops" and have bad PR for a
while, you go out of business.

And as the article mentions, these numbers are idealized. It is possible to
build a network that has a 2ms latency to cloud storage, but you have little
control over that. I currently have a 28ms ping to youtube.com. And it's not
as if you can measure the latency and then make assumptions as if it will
never change. If your ISP gets into a peering dispute you may go from 5ms
today to 50ms tomorrow. If your ISP over-provisions its network tomorrow you
may start seeing three digit round trip latencies when the network exceeds
capacity. And it's not as if you have much choice in ISPs if this happens; if
you have any at all the alternative could be as bad or worse.

But probably the most important thing to understand is that the question is
not even relevant to most people. With the amount of memory supported by even
fairly old servers, your entire working set may fit in memory and make disk
access latency quite irrelevant.

------
pjc50
I'd look at this the other way round: everyone should now be aware that a
computer is a series of processing units connected by network links whose
latency is many times that of a register or cache access.

Some of those network links are labelled "PCIe" or "SATA", and some are
labelled "Ethernet" or "DisplayPort", but they're all high-speed serial links.
A PCIe lane is only 8Gbit, comparable to 10Gbit ethernet.

This means that your computer "backplane" design can be de-constrained; you
don't have to draw a hard distinction between elements in the same case and
those in the case of the machine next to you in the rack.

A rotating disk is roughly able to saturate a 1Gb ethernet link. Maybe we
should just go full SAN and put Ethernet connectors on the drives.

~~~
jo909
Very true. Most servers are multiple processors (sockets) interconnected via a
high speed network (QPI, HT), and each processor has links to its own memory
and some PCI lanes. You could almost see that as multiple distinct computers
on one board and in one case.

Hard disks with direct ethernet connections exist (f.e. Seagate Kintec).

~~~
pjc50
Seagate Kinetic looks like an excellent idea, although horribly expensive.

------
ChuckMcM
Back in 2002 when I was with NetApp we would show customers that NFS (Network
Attached Storage) was faster than direct attach storage. Oddly it took a while
to convince people of that. But it was very successful running Oracle over NFS
in their "cloud" data center in Texas.

That said, as network speeds surpass what were previously internal system
interconnect speeds it does make new design configurations possible. That
said, the lack of predictability of those inter-connects makes things a bit
more complex. If you're running stuff in AWS for example, and you are disk
bound. Its probably worth your while to benchmark using a multidisk system
running ZFS as a "local" NAS point for your disk storage. You might find that
you can run a dozen or more instances of that single storage box for less
money than a dozen machines with chunks of disk.

~~~
e12e
Is that another way of saying that the Netapp boxes had battery backed RAM
that'd promise to really, for sure, absolutely, persist data written to said
RAM? (And, also, that two boxes with twice the RAM has, well, twice the RAM)?

------
djfergus
"But bandwidth costs have been declining exponentially for decades and
continue to do so. "

I wonder how pronounced this effect will be in places where bandwidth costs
are not following the same downward trend as in North America. e.g. Australia:

[https://blog.cloudflare.com/the-relative-cost-of-
bandwidth-a...](https://blog.cloudflare.com/the-relative-cost-of-bandwidth-
around-the-world/)

e.g. Will data centers be designed differently?

~~~
mcbridematt
There is plenty of dark fibre around in Australia (driven by a need to bypass
monopoloy Telcos) that the bandwidth costs cited by Cloudflare have no real
effect on datacenter design.

(The 'true' cost of transit in AU, ignoring Telstra is closer to the Asia
region average of $30 IIRC)

I think higher energy and real estate costs in Australia have more of a
bearing - it is quite possibly cheaper to 'import' bits into Australia than to
store/process/etc locally (any data sovereignty issues notwithstanding)

~~~
CountSessine
If it's dark fibre, how is it having any effect at all on transit cost?

~~~
mcbridematt
I didn't say it did. It was a side comment that the situation was not as dire
as Cloudflare said - if you can avoid buying transit directly from one
specific provider.

~~~
CountSessine
#include <pedantic.h>

Yes, you did:

A: _There is plenty of dark fibre around in Australia (driven by a need to
bypass monopoloy Telcos)_

B: _the bandwidth costs cited by Cloudflare have no real effect on datacenter
design_

and

C: _(The 'true' cost of transit in AU, ignoring Telstra is closer to the Asia
region average of $30 IIRC)_

Based on simple sentential logic, you are saying A->B and we're provided with
C as additional information. Unless your word _that_ isn't being used as an
implicative connective, which would be silly.

How does fallow, unused, dark fibre have any effect on transit cost? Is there
a 'transit cost' options market or something?

------
mindslight
I understand where the author is coming from - high speed serial transceivers
really have undermined our traditional concept of a "machine". And in the
uniform administrative domain of a data center, hardware refactorings are
straightforward. So it's pretty nifty to think about these changes to
fundamental constants (although the continual rewriting of software systems
because they're not abstracted enough to treat it all as one cache hierarchy
is disheartening).

But when moving across administrative boundaries, especially to a more ad-hoc
environment, serious problems crop up. Some assumptions the author is
handwaving away:

1\. That 2ms to Youtube is to a single specific Youtube _cache_. If you're
about to get on a plane, are you supposed to inform Google so that they can
move your files across the country?

2\. "The network is reliable" \- what do you do while you're _on_ that plane?
DSL is still our national infrastructure policy so in many places that is the
best that is available, especially if you want a choice of providers to avoid
anti-consumer behavior.

2a. Or wifi. What is the latency of modern best-of-class consumer gear? I'm
getting 16ms with a simple TL-WR703N. Never mind mobile...

3\. Access patterns being so long-tail points to the continuing utility of
end-user SSDs as non-volatile cache. The devices that would be most likely to
drop them (cost, space, power) are generally on the worst network links.

4\. The datacenter containing the notion of what data is authoritative
(otherwise you wouldn't need to hit it for every access as outlined). What is
the need for this?

5\. The overhead of implementing oblivious transfer (or the alternative of
being surveilled and having decreased economic power and freedom).

When you think in terms of datacenters, it's tempting to think that the next
advancement is end-user computers becoming part of that system, but that is a
skewed perspective. The pendulum swings back and forth, and I think we're
pretty close to Peak Butt. The next deluge will be user-centric software that
utilizes the butt for its strengths, but respects privacy and tracks the
authoritative copy on user owned machines.

------
dave_ops
When I internalized this trend ~3 years ago, it's what spurred my interest in
distributed storage systems across machines, racks, and datacenters, that are
only memory backed.

~~~
mprovost
Any examples of such a system?

~~~
nostrademons
All the major webscale systems - Google, Facebook, Microsoft's cloud services.

At a lower scale, there are also services like Hacker News, Plenty of Fish,
and Mailinator that serve out of RAM on a single box.

------
chubot
Besides speed, there is another difference between local and remote storage:
remote storage is on a different failure domain. A machine is still a single
failure domain since the storage/RAM/CPU share a power supply / battery
backup.

If you change an application to use remote storage rather than local storage,
you may not get severe performance degradation in some cases, but your
application will experience more failure modes than it did before.

------
w_t_payne
This all seems to be driven by latency figures, predicated on the application
accessing data in a random pattern.

What if you are doing some data-intensive computations where the data access
pattern is fixed and computation parameters change?

This particular application wouldn't care about latency, since you can
linearize or otherwise arrange your data on disk so it is in the optimal
position for your application. Instead, throughput would be the limiting
factor.

Has anybody done a similar analysis on this basis?

------
MichaelGG
Somewhat related: Is there any solid open source (or commercial?) software
that can just join several servers into a redundant, SPOF-less, RAM-based
block storage device? Perhaps exposed via iSCSI or something so it'd work for
any client?

Idea would be that I could setup two separate racks with UPSes, specify a
safety factor, then have super-fast shared disk/block available. (DB
clustering or whatever.)

~~~
mdk754
I could be wrong and haven't explored the idea myself, but couldn't you
accomplish this with something like ceph or scaleio and tmpfs/ramdisk for your
block devices?

------
amelius
The problem with network latency is that it is bound by the speed of light.

