
Non-volatile Storage: CPUs no longer more performant than I/O devices - matt_d
https://queue.acm.org/detail.cfm?id=2874238
======
Animats
I've made that point before on YC.[1] We need to view fast storage as
something other than a disk accessed through the OS, and other than slow RAM
accessed as raw memory. Access through the OS is too slow, and access as raw
memory is too risky. What's probably needed is something like a GPU sitting
between the CPU and the fast persistent storage. Call this an SPU, or "storage
processing unit."

What would such a device do? Manage indices, do data transformations, and
protect data. Database-type indices would be maintained by the SPU, so
applications couldn't mess up the database structure. The SPU would manage
locking, so that many non-conflicting requests could be serviced
simultaneously. The SPU would have tools for doing searches. Regular
expression hardware (this exists) would be useful. Record protection
management (app can read/write part but not all of a record) would allow
implementation of database type data access rules. Encryption and compression
might be provided in the SPU.

There have been smart disk controllers before, but they haven't been that
useful, since they couldn't make the disk go any faster. Now, it's time to
look at that layer again. Some of the technology can be borrowed from GPUs,
but existing GPU architecture isn't quite right for the job. An SPU will be
doing many unrelated tasks simultaneously. GPUs usually aren't used that way.

[1]
[https://news.ycombinator.com/item?id=9964319](https://news.ycombinator.com/item?id=9964319)

~~~
mikehollinger
Being up front: this is what I work on for IBM Systems. A buddy wrote this
blog
([https://www.ibm.com/developerworks/community/blogs/fe313521-...](https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/power8_capi_flash_in_memory_expansion_to_speed_data_access?lang=en))
with a little more info.

What we have is an IO offload accelerator that knows how to drive high
bandwidth IOs to some external storage device. A user app doesn't interact
with the device - they make shared library calls to read or write data from a
particular buffer, and the accelerator (because it's cache coherent) can read
/ write from the virtual address space of the user space program to satisfy
the request as needed. This means that the IOs bypass the entire OS driver
stack, since everything is a shared library call from user space.

So yep! That exists. :-) There's other classes of accelerators out there too
(and coming in the future as well). Adding additional function like
compression or some form of indexing or search is stuff that we've talked
about.

(edit) - [https://github.com/open-power/capiflash](https://github.com/open-
power/capiflash) has the code for the shared libs, the APIs, and some
examples.

~~~
matt_d
Interesting!

Incidentally (since this may be somewhat related), I'm wondering, what are
your thoughts on the Persistent Memory Manager approach, as in the following:

Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu: "A
Case for Efficient Hardware/Software Cooperative Management of Storage and
Memory." Workshop on Energy-Efficient Design, 2013.

Context: "emerging high-performance NVM technologies enable a renewed focus on
the unification of storage and memory: a hardware-accelerated single-level
store, or persistent memory, which exposes a large, persistent virtual address
space supported by hardware-accelerated management of heterogeneous storage
and memory devices. The implications of such an interface for system
efficiency are immense: A persistent memory can provide a unified load/store-
like interface to access all data in a system without the overhead of
software-managed metadata storage and retrieval and with hardware-assisted
data persistence guarantees."

The stated goals/benefits include eliminating operating system calls for file
operations, eliminating file system operations, and efficient data mapping.

Paper:
[http://justinmeza.com/bin/meza_weed13.pdf](http://justinmeza.com/bin/meza_weed13.pdf)

Presentation:
[https://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pdf](https://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pdf)

~~~
Animats
One giant flat address space is not the answer. Hardware people tend to come
up with approaches like that because flat address spaces and caching are well
understood hardware. It's the same thinking that leads to "storing into device
registers" as an approach to I/O control, even when the interface is really
packets over a serial cable as in FireWire or USB or PCI Express.

File systems and databases are useful abstractions, from an ease of use,
security, and robustness perspective. The challenge is to make them go faster.
Pushing the machinery behind them out to special-purpose hardware can do that.

The straightforward thing to do first is to to take some FPGA part and use it
to implement a large key/value store using non-volatile solid state memory.
That's been done at Stanford[1], Berkeley[2], and MIT[3], and was suggested on
YC about six years ago.[4] One could go further, and implement more of an SQL
database back end. It's an interesting data structure problem; the optimal
data structures are different when you don't have to wait for disk rotation,
but do need persistence and reliability.

[1]
[http://csl.stanford.edu/~christos/publications/2014.hwkvs.nv...](http://csl.stanford.edu/~christos/publications/2014.hwkvs.nvmw.slides.pdf)
[2]
[https://www.cs.berkeley.edu/~kubitron/courses/cs262a-F14/pro...](https://www.cs.berkeley.edu/~kubitron/courses/cs262a-F14/projects/reports/project13_report.pdf)
[3]
[https://dspace.mit.edu/handle/1721.1/91829](https://dspace.mit.edu/handle/1721.1/91829)
[4]
[https://news.ycombinator.com/item?id=1628550](https://news.ycombinator.com/item?id=1628550)

~~~
dunkelheit
OK I find it easier to follow these ideas when thinking about how loads/stores
to volatile memory are organized. Memory is not accessed via a syscall.
Instead the OS sets up some data structures in the MMU and lets the
application run. Some kind of fault happens when control must be transferred
back to the OS.

Going back to non-volatile memory the question is what kind of abstraction
should be implemented in hardware? Presumably something simple that the OS and
applications can then use to implement higher level abstractions like file
systems and databases. Pushing parts of a SQL database engine into the
hardware does not intuitively seem like a right solution.

------
ChuckMcM
This is so true, the world has changed greatly and not everyone has gotten the
memo. I saw a really cool device made by Texas Memory systems which was a "ram
disk" that was all ram with disk backing, and when you lost power it flushed
to disk. I wanted something that worked better for a storage paradigm and
designed/invented a network accessible memory appliance[1]. Basically using
ethernet packets you could store 8K integrity protected chunks right there on
the network. Initially I wanted to use a typical low power CPU with a bunch of
DRAM attached but the CPU bottleneck got in the way, so we redesigned/rebuilt
it out of FPGAs so that it had a couple of terabytes of RAID protected RAM in
an appliance with a very simple network protocol for storing and fetching 8K
blocks out of what was essentially a linear address space. Two of these on
different power subsystems provided all of the fault tolerance you needed and
you could have a terabyte of 'structured' data live from the moment your
computer booted (made for very fast recovery from reboot).

[1]
[https://www.google.com/patents/US8316074](https://www.google.com/patents/US8316074)

~~~
AndrewKemendo
That is fascinating. Could this reliably reduce the hardware footprint on any
device?

~~~
ChuckMcM
If I understand the question, then yes. When you consider the amount of cache
memory in clustered systems which is all holding the same stuff in every
independent machine. Using it simply as a victim cache for a block storage
device penciled out to a pretty significant improvement.

It gets even better with 64 bit address spaces and a bit of kernel code to
'fault in' from the device.

------
erichocean
My sense is this is only true today because OS kernels are ridiculously slow
relative to what the hardware can achieve.

Most of my recent designs treat RAM as if it were (what we used to considered
to be) disks, i.e. all computation and in-process data is in cache
exclusively, and "going to RAM" requires the use of a B-tree-like structure to
amortize the cost.

For example, once you've opened a RAM page line on a normal four-channel Xeon
server, you can read the entire 4KB page in about the same time it takes to
read one byte, switch pages, and then read another byte. (Of course, you can't
do that either since the entire cache line will be filled, but the overall
point still stands.)

The situation we're in today with RAM is pretty much the identical situation
with the disks of yore. Anyway…interesting article nonetheless.

~~~
thescriptkiddie
Not _that_ is some interesting thinking. You got a blog post elaborating on
this?

~~~
hadagribble
Not sure exactly what OP is referring to, but CSS-trees [1] are a classic
example of cache-aware indexing structures that fetch entire pages into cache
and arrange data so that most of the comparisons happen on cached data. In
most cases, they significantly outperform binary trees. Masstree [2] is a more
recent example of this.

[1]
[http://www.vldb.org/conf/1999/P7.pdf](http://www.vldb.org/conf/1999/P7.pdf)

[2]
[https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf](https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf)

------
rdtsc
Yeah per packet processing at 40Gbps and higher is problematic on regular
kernels, OS stack and CPUs. A lot of really can be cache hits -- hundreds of
nanoseconds. Article mentions that too:

\--- To put these numbers in context, acquiring a single uncontested lock on
today's systems takes approximately 20ns, while a non-blocking cache
invalidation can cost up to 100ns, only 25x less than an I/O operation. \---

It also depends if workload is throughput sensitive or latency sensitive. If
it is latency, can do things like tie processes and interrupts to cores,
isolate those cores, etc. For throughput can processes more than one packet at
a time perhaps.

Then there is dpdk and even unikernels.

> CPU has responsibilities beyond simply servicing a device—at the very least,
> it must process a request and act as either a source or a sink for the data
> linked to it. In the case of data parallel frameworks such as Hadoop and
> Spark,7,17 the CPU

That's why you get more CPUs and explicitly isolate them if you can. But now
depending on how they share data with other CPUs there will be invalidated
cache lines so will pay that way as well.

In general if you run on RHEL / CentOS ( a lot banks, military and enterprise
deployments do ), there is this helpful guide as an overview:

[https://access.redhat.com/documentation/en-
US/Red_Hat_Enterp...](https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/sect-
Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-CPU-
Configuration_suggestions.html)

~~~
vvanders
Cache miss latency was the first thing that popped into my mind as well when I
saw the title.

It seems like they don't make a clear distinction between latency and
bandwidth. From the little I know on SSDs(don't claim to be an expert here)
the sequential reads are below or on-par with high spindle speed disks.

What seems to be a better take-away would be that sequencing of your reads
isn't nearly as important as it used to be. Back in games we'd duplicate data
across a DVD so that we could do "seekfree" loading where duplicating 5-10MB
of data would mean just a single big call to read() and gain massive load time
performance.

~~~
frankchn
SSDs are still faster than hard disks even with sequential reads. 15,000 rpm
enterprise spinning disks read at about 260 MB/s [1] while NVMe SSDs (like
those in a 2015 MacBook Pro) reads at >1300 MB/s [2].

[1]: [http://www.tomshardware.com/charts/enterprise-hdd-
charts/-02...](http://www.tomshardware.com/charts/enterprise-hdd-
charts/-02-Read-Throughput-Maximum-h2benchw-3.16,3372.html)

[2]: [http://www.computerworld.com/article/2900330/apple-
mac/holy-...](http://www.computerworld.com/article/2900330/apple-mac/holy-
smoke-the-new-macbook-literally-is-twice-as-fast.html)

~~~
rdtsc
> NVMe SSDs (like those in a 2015 MacBook Pro) reads at >1300 MB/s [2].

No joke. Just got an new MBP for work. Before had a spinning disk (well
hybrid). Was running some silly benchmarks that I ran before and clocked my
disk throughput at about 100MB/s. On MBP got 800MB/s. I thought something
broke (was hitting page cache or some trickery like that) or didn't compile
things right. But no, tried other tools, looked online and it seemed correct.
It really surprised me.

------
emcq
I like the article, but not your title. It implies that this trend of I/O
becoming highly performant has occurred recently, when in fact has been
observed and studied for quite some time [0, 1]. Even before SSDs Gigabit
ethernet was saturating CPUs needing to do more than DMA a packet, and I'm
sure this trend continued for some time. The original title seems more
accurate: "Implications of the Datacenter's Shifting Center", and references
the existing trends in an insightful article.

[0]
[http://ucsdnews.ucsd.edu/archive/newsrel/supercomputer/11-09...](http://ucsdnews.ucsd.edu/archive/newsrel/supercomputer/11-09Gordon.asp)

[1]
[http://nvsl.ucsd.edu/index.php?path=pubs](http://nvsl.ucsd.edu/index.php?path=pubs)

------
dogma1138
So NVDIMM's... Is any one actually making those except Viking and is anyone
actually supporting them in servers except SuperMicro?

These are basically DDR3/DDR4 DIMM's with onboard flash and a supercap/battery
pack to provide persistence incase of system reboots and power failures.

They are also a bit odd as they would ignore various system event calls from
the BIOS/UEFI and then have to be specifically managed by various software
hacks that create RAM drives and access the memory directly rather than
working with OS virtual memory. Since NVDIMM's are basically treated as system
memory by both the server and the OS they pretty much only work for very very
boutique applications I'ts a bit odd that these are presented as the next step
in storage evolution while being effectively an overpriced hack. I've only
seen them actually been use in weird server setups like the overclocked
watercooled servers that are used for HFT where they strip everything even the
OS as possible and bypass anything that adds even a few NS of latency and
don't mind running their own code for everything from a bastardized TCP stack
that isn't even remotely compliant but works to their own in memory custom
database.

~~~
wmf
Intel will be pushing them hard starting with Skylake-EP, so some people are
getting themselves ready.

~~~
dogma1138
Did Intel created a new interface for NVDIMM's? because to work with the ones
Viking makes you pretty much need to hack your linux kernel to ensure that it
doesn't access physical memory over a certain address range, and I don't even
know if or how can you use them on Windows based applications.

~~~
wmf
Yes, Intel basically controls UEFI/ACPI and has been submitting patches to
Linux for a while. [http://pmem.io/](http://pmem.io/)

------
zinxq
This trend has been clear for awhile. Interestingly, this will put performance
pressure back on programming and languages as they become the "new"
bottleneck.

I'd expect an implicit migration away from slower languages toward faster
ones.

~~~
andrewvc
To some extent, yes, but 'slow' languages usually delegate batch work over
large datasets to optimized libraries.

Even more likely, as I see it, is this contributing to the increasing rise of
tools like spark, hadoop, etc. Slow languages will continue to be popular as
orchestration around these tools.

------
pmehra
Both Tandem DP2 and IBM Coupling Facility on zSeries Sysplexes worked exactly
like how you envisage an SPU would work. Therefore, when we developed RDMA-
attached persistent memory at Tandem in 2002, we put it under the control of
DP2/ADP process pairs. Later, we ported it to HP-UX and InfiniBand RDMA. There
is one paper at IPDPS'04 and several published patents you can look up in my
Google Scholar page. pmem.io crowd is reinventing some of this wheel. If any
of you work at HPE, you can find much more detailed internal papers, source
code, drivers, firmware, and other stuff that the outside world cannot get to.

------
marcosdumay
> and the performance of an SCM (hundreds of thousands of I/O operations per
> second) is such that one or more entire many-core CPUs are required to
> saturate it.

So, we are getting a lot of data, but latency is still killing. (Even more
taking into account that this thing has a few stages of pipeline inside.)

Anyway, our CPU is getting distributed nearer IO and memory. We are going to
get NUMA machines, everything points at it.

~~~
CyberDildonics
Knights Landing from Intel is already NUMA (I think). I'm not sure if it can
be bought yet, but it should be very close to release.

~~~
thescriptkiddie
Aren't AMD parts already (cache-coherent) NUMA?

~~~
marcosdumay
Well, a bit, but not radical enough to be visible to its software.

------
rsp1984
Please excuse my ignorance on this matter but will this technology have any
impact on the hierarchy levels below disk (i.e. RAM and CPU caches)? Compared
to Register, L1 and L2 access RAM access is still really slow. Will non-
volatile storage latencies rival or exceed those of standard RAM? From how I
understand the article it's primarily disk IO speed that's affected, correct?

~~~
hadagribble
Yes. Even the persistent memories that attach to the memory bus are currently
quite a bit slower than DRAM (5-7x from estimates I've seen), while the
difference with PCIe-attached ones is even more.

I'm not sure what the future holds in terms of latencies for non-volatile
storage but sub-DRAM levels aren't within reach yet.

~~~
matt_d
On a side note, it's interesting to me that emerging memory technologies
currently seem to be mainly focused on addressing the "from-DRAM-to-disk" part
of the memory hierarchy.

That is, as you mentioned, not directly competing with DRAM, and consistently
on the same side of the 1 microsecond dividing line between memory and
storage; as in:

[http://www.rambusblog.com/2015/10/15/mid-when-memory-and-
sto...](http://www.rambusblog.com/2015/10/15/mid-when-memory-and-storage-
converge/) (note SCM placed between DRAM and SSD)

[http://semiengineering.com/the-memory-and-storage-
hierarchy/](http://semiengineering.com/the-memory-and-storage-hierarchy/)

As far as the other side of the line is concerned, I think I've only seen
proposals for hybrid-cache architectures (HCA) -- other than
[http://link.springer.com/chapter/10.1007%2F978-1-4419-9551-3...](http://link.springer.com/chapter/10.1007%2F978-1-4419-9551-3_7)
\-- with a hybrid approach (e.g., combining SRAM/eDRAM/STT-RAM/PCRAM) probably
making sense due to latency/endurance/bandwidth trade-offs.

If anything, there seems to be more development on the DRAM interface itself
-- with multiple candidates for the (or a) DDR4's successor, so far involving
Wide I/O (Samsung), Hybrid Memory Cube (Intel, Micron), High Bandwidth Memory
(SK Hynix, AMD, Nvidia): [http://www.extremetech.com/computing/197720-beyond-
ddr4-unde...](http://www.extremetech.com/computing/197720-beyond-
ddr4-understand-the-differences-between-wide-io-hbm-and-hybrid-memory-cube)

(Latency and bandwidth improvements seem promising,
[http://semiengineering.com/which-memory-type-should-you-
use/](http://semiengineering.com/which-memory-type-should-you-use/))

One interesting development I've seen involves reducing SRAM's footprint, by
moving from 6T (6-transistors) cell to a 1T (one-transistor) one:
[http://www.eetimes.com/document.asp?doc_id=1328453](http://www.eetimes.com/document.asp?doc_id=1328453)

It's fairly recent development, though, and it remains to be seen how is it
going to fare.

Other than the above, there doesn't really seem to be much progress around
competing with/improving SRAM. However, this may become increasingly
important, since some of the technological process scaling issues apply to
SRAM, too.

------
teraflop
For anyone else who was momentarily confused: on figure 2, the y-axis scale is
incorrectly labeled "ns" when it should be "ms".

~~~
rasz_pl
Anyone old enough will remember the time hard drivers were correctly marketed
with access time in milliseconds as the main speed indicator. This ended
around 1994( _) when pretty much all the drives reached ~10ms access time.

[https://en.wikipedia.org/wiki/Hard_disk_drive_performance_ch...](https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics)

_I did a quick scan of old computer magazines (infoworld, pc mag etc).

------
peter303
HP was designed a "flat memory" OS based on vast amounts of cheap memrister
memory
[http://www.technologyreview.com/featuredstory/536786/machine...](http://www.technologyreview.com/featuredstory/536786/machine-
dreams/). But when I googled gfor this article I saw the project has been
delayed.

~~~
david927
I'm very interested in working on a "flat memory" OS that doesn't use any RAM
or file system but simply registers and a distributed database of key/value
stores.

If you're interested in talking about this more (especially if you're in the
SF Bay Area), my email is in my profile.

------
mozumder
I'm seeing this on a new Postgres database server: my cache miss and cache hit
queries take virtually the same time!

This is with a Skylake Xeon E3-1275, 64GB ECC UDIMM, and Intel 750 PCIe SSD
(probably the fastest setup you can get).

It looks like I have to figure out something that tunes Postgres to account
for the fact that disk lookups are no-cost.

~~~
anarazel
> I'm seeing this on a new Postgres database server: my cache miss and cache
> hit queries take virtually the same time!

That's likely because it's hitting the OS's page cache.

------
brudgers
Serious but naive question: does a bottleneck curve trending toward CPU
subsystems suggest micro kernel based approaches replacing spinning up virtual
machines as a future trend due to the possibility of reduced overhead at the
CPU?

tldr; Does increasing use of Storage Class Memory imply increasing use of
microkernals?

------
jsolson
The numbers in this are daunting, but I personally believe massively multi-
core systems make the problem a lot less daunting than the article makes out.
Core counts in big servers can get up over 100 per server for Intel (see
Amazon's new EC2 offerings for public evidence of this). Intel's Xeon Phi
series of processors offer core counts approaching ~300. Going to 300x takes
the required latencies per request from microseconds up to the millisecond
range. POWER systems can go even higher. Moreover, for many workloads that
actually leverage this sort of compute you can do something horrifying with
the new DRAM-addressable persistent storage: DMA directly from the NIC into
block storage. Some (many?) high performance network adapters offer the
ability to filter packets to distinct Rx queues; buffers can be posted with
addresses in the storage mapped region allowing direct NIC->storage transfer.
If you bake more intelligence into the NIC, you can even do things like
Mellanox's NVMe Fabrics:

[http://www.mellanox.com/blog/2015/04/mangstor-mellanox-
show-...](http://www.mellanox.com/blog/2015/04/mangstor-mellanox-show-nvme-
over-fabrics-solution-to-reduce-latency-tax/)

This is particularly relevant to the JBOD example.

Now, there's the question of what you're actually going to _do_ with all of
that data, but in a lot of cases it's likely a durable read-mostly cache
that's effectively a materialized view optimized of some (hopefully much
slower write-rate) transactional store (say, product data on Amazon -- detail
pages served up at some absurdly high rate, but a relatively low mutation
rate).

Other workloads I can think of fall into a category I tend to think of as log
processing -- a high-rate series of streaming writes which are slurped up and
batch processed/reconciled to some (much smaller) state (which of course may
then be exploded back out to large materialized views as above). In these
scenarios, presuming the log entries have low contention over the underlying
state, CPUs like those I called out above are more than up to the task of
streaming over the input and optimistically updating the backing state.

Finally, in terms of real workloads, there is almost always going to be a
bottleneck limiting your ability to fully utilize your resources. Either
you're CPU bound and leaving network bandwidth on the table or you're network
bound and are leaving CPUs/storage devices under-utilized. Massively improved
storage performance local to a node is fantastic in terms of computation you
can do locally, but if each network fabric upgrade costs you 10x what the
previous one did to keep up with the storage/CPU available per-node, you're
going to have a bad time. Amin Vahdat talked a bit about our (Google's)
historical network fabric evolution:
[https://www.youtube.com/watch?v=FaAZAII2x0w](https://www.youtube.com/watch?v=FaAZAII2x0w)

If I were betting on an annoying bottleneck to full resource utilization
coming up in the near future, I'd put my money on network before CPU :)

------
crudbug
Increasing industry trend to expose CPU / GPU / SPU / NPU directly to
applications for more efficient data handling.

------
thrownaway2424
Performant: still not a word.

~~~
pklausler
Seymour Cray never said "performant". Engineers say "fast" or "fast enough".
Marketing types and nontechnical management seem to prefer this neologism. But
it might also be a generational thing.

A new coinage that I noticed in the past year that also grates on my ears:
"learning" as a substitute for "lesson", as in "what were your learnings from
the hackathon?" Anyone else caught this one?

~~~
windowsworkstoo
Past year? Past decade champ.

~~~
pshc
Maybe in Microsoft-land or certain circles, but anecdotally I've only started
hearing "learnings" this year too.

