
NVMe, the fast future for SSDs - ryan_j_naughton
http://www.pcworld.com/article/2899351/everything-you-need-to-know-about-nvme.html
======
ziedaniel1
The numbers in the article are all wrong. SATA's ceiling is 600 MBps
(megabytes per second), not 600 Gpbs (gigabits per second). SAS goes up to
12Gbps, not 12GBps, which is actually the same thing as 1.5GBps. At least the
PCIe numbers look right.

~~~
ChuckMcM
Yes they are all wrong.

6 Gbps, 600 MBps (the capital B is supposed to indicate 'Bytes' versus 'bits')
the encoding is 8b/10b which is 10 bauds per 8 bit byte.

PCIe 2.0 has 2.5Gbps "lanes" PCIe 3.0 has 5Gbps "lanes" they can be ganged
together for additional bandwidth. (x1, x2, x4, x8, x16) it is also 8b/10b so
you divide by 10 to get Bytes per second (250MBps/500MBps).

Both SATA and PCIe have a 'transaction limit' which is a function of the
controller, which limits the total number of operations per second (IOPs). The
product of the IOPs and the size of the transaction can never exceed the
bandwidth of the channel. But it often is under. For example a typical SATA
disk control (prior to the popularity of SSDs) would do about 25,000 IOPs, and
if you had 512 byte (.5K) block reads and writes, you could read and write
25,000 * .5 or 12,500K or 12 MBps (which was much lower than the theoretical
bandwidth of 200MBps on 2Gbps SATA II channels. Optimizing channel utilization
requires that you figure out how many IOPs your OS/Controller can initiate and
then sizing the payload to consume the max bandwidth. Large payloads and
you'll push IOPs down, smaller payloads and you won't use all the bandwidth.

One of the nicer aspects of ATM was that it was designed and specified for
full channel utilization with 64 byte packets which made it possible to reason
about the performance and latency of an arbitrary number of streams of data
moving through it.

~~~
JohnBooty

      > For example a typical SATA disk control (prior to the 
      > popularity of SSDs) would do about 25,000 IOPs, and if 
      > you had 512 byte (.5K) block reads and writes, you 
      > could read and write 25,000 * .5 or 12,500K or 12 MBps 
      > (which was much lower than the theoretical bandwidth 
      > of 200MBps on 2Gbps SATA II channels.
    

That doesn't seem to jive with reality. What am I missing here? SATA HDDs
would regularly hit 100MBps in sequential transfers.

To pick a 2009-era HDD review/benchmark at random that illustrates this:
[http://www.storagereview.com/western_digital_scorpio_black_5...](http://www.storagereview.com/western_digital_scorpio_black_500gb_review_wd5000bekt)

~~~
ChuckMcM
Oh you can get faster throughput with longer reads, to get 100MBps on a 2Gbps
channel you simply increase the read size until you've maxed out the bandwidth
you can get.

So when characterizing a typical SATA drive you would start with 4K sequential
reads and work up until your bandwidth hit either the channel bandwidth or
stopped going up (which would be the disk bandwidth). Unless you ran across a
reallocated sector many SATA drives could return data at a rate of 100MBps
with 1MB reads. Or even smaller read sizes if you had command caching
available. Random r/w was an issue of course because of head movement (burns
your IOPs rate while waiting for the heads to change tracks)

You can do these experiments with iometer[1], there was a great paper out of
CMU which talked about illuminating the inner workings of a drive by varying
the workload[2]. Well worth playing with if you're ever trying to get the
absolute most I/O out of a disk drive.

[1] [http://www.iometer.org/](http://www.iometer.org/)

[2]
[http://repository.cmu.edu/cgi/viewcontent.cgi?article=1136&c...](http://repository.cmu.edu/cgi/viewcontent.cgi?article=1136&context=pdl)

------
mrmondo
NVMe is one of the most important changes to storage over the past decade.

I'm currently replacing all our SANs with storage servers filled with NVMe
SSDs (as the tier 1 storage, commodity SATA SSDs for second tier). I've posted
the link to my first blog post about it in the comments on another NVMe post
in the past: [http://smcleod.net/building-a-high-performance-ssd-
san/](http://smcleod.net/building-a-high-performance-ssd-san/)

I'm close to writing the next post around the actual build, my findings,
benchmarks etc... Hopefully I'll have that done next week - but the system
comes first.

I'm a little disappointed with this article as I think it could do with a)
some technical review and b) some more detailed information.

~~~
zsmith928
awesome read, can't wait to see your results on performance, etc.

------
jccalhoun
Here's a good review with lots of pics and benchmarks:
[http://www.pcper.com/reviews/Storage/Intel-
SSD-750-Series-12...](http://www.pcper.com/reviews/Storage/Intel-
SSD-750-Series-12TB-PCIe-and-25-SFF-Review-NVMe-Consumer)

------
nqzero
one of the fundamental problem with SSDs is the impedance mismatch introduced
by emulating HDDs. NVMe doesn't appear to help with that at all

we need an interface that allows us to bypass the FTL and access the
underlying erase blocks

~~~
skrause
Is there actually any evidence that this would improve performance
significantly? Removing the translation layer means that every OS and their
file systems have to do good wear leveling because otherwise you'll destroy
blocks quickly.

~~~
chongli
That's pretty easy to do. Just use a log-structured filesystem[0]. The
abstraction we use now is antiquated. It's very much reminiscent of the
impedance mismatch in graphics APIs (such as OpenGL) which is now being solved
(with Vulkan).

[0] [https://en.wikipedia.org/wiki/Log-
structured_file_system](https://en.wikipedia.org/wiki/Log-
structured_file_system)

------
ccleve
So the important question is, how does the affect how we write applications?

Is the api different, or are we still reading/writing disk files?

Should we do memory mapping of the files or not?

Should we parallelize access to different sections of big files? Or write a
ton of small files?

How does this affect database design? Current big data apps emphasize large
append-only writes and large sequential reads (think LSM trees). Does this
make sense any more?

What does disk caching mean in the context of these new drives?

~~~
pjc50
API is the same. Memory map if you prefer that access style and it suits your
OS/language preferences. The OS will almost certainly not let you map straight
across into the device's PCI memory mapped window so you'll incur a copy to
userspace penalty either way. Benchmark. There is probably no longer any
advantage to a sequential write, but you still have a per-syscall and per-IOP
overhead, so large writes will be faster than N small ones. Disk caching is
still there and still decreases latency but isn't so critical.

------
seatonist
I can remember putting ISA "hardcards" in my 286s and similar. Full circle!

~~~
ChuckMcM
But did they conform to the LIM[1] spec ? :-)

They also had RAM drives you could buy. The point then as now is that
increasing the "high performance" working set space of a program, increases
the amount of transactional data that can be "in flight" during an operation,
and that increases the overall size of the data set you can work with.

I've been waiting for these boards to come down in price for about 4 years
now. I started talking with Intel about them early on (we used their XM-25
SSDs because it was a price point for flash that was "enough" better than
spinning rust that it made sense) and they insisted on trying to sell us the
same flash chips on a PCIe card for 10x the dollars, I (and many others
apparently) refused to pay that. Sure if you have a 'cost is no object' data
base or something but for a large internet working set where revenue
differences are measured in cents per thousand transactions? Not so much. I
know one company that went so far as to design and build their own PCIe Flash
card. I have heard it did great stuff for them.

[1] LIM - Lotus-Intel-Microsoft spec for extended memory on IBM PC compatible
machines.

~~~
yuhong
hardcards have nothing to do with LIM.

~~~
ChuckMcM
Wow, so much for relying on my memory.

[http://en.wikipedia.org/wiki/Expanded_memory](http://en.wikipedia.org/wiki/Expanded_memory)
vs

[https://books.google.com/books?id=KjwEAAAAMBAJ&pg=PA61&lpg=P...](https://books.google.com/books?id=KjwEAAAAMBAJ&pg=PA61&lpg=PA61&dq=286+hardcard&source=bl&ots=tUPGkGS8mi&sig=_7k1A79zs_A5nNFSmqmR2YTFXYY&hl=en&sa=X&ei=jBAiVYH1OMitogTfmIHwAQ&ved=0CB8Q6AEwAA#v=onepage&q=286%20hardcard&f=false)

I was thinking of the plug-in expanded memory cards rather than the plug in
hard drive cards.

------
mojoe
Another cool thing about NVMe is the smaller command set -- only about 10
commands (excluding admin commands), vs the 200+ that SCSI has grown to over
the years. It's fairly quick to learn.

------
higherpurpose
Wasn't someone here saying these SSDs consume an abnormal amount of power to
achieve those (2x?) faster speeds than current PCIe SSDs? Or was that
applicable just to Intel's SSDs?

~~~
petrbela
Samsung claims the opposite: NVMe SSD provides low energy consumption to help
data centers and enterprises operate more efficiently and reduce expenses.
Power-related costs typically represent 31% of total data center costs, with
the memory and storage portion of the power (including cooling) consuming 32%
of the total data center power. NVMe SSD requires lower power (less than 6W
active power) with energy efficiency (IOPS/Watt) that is 2.5x as efficient as
SATA.

[http://www.samsung.com/global/business/semiconductor/product...](http://www.samsung.com/global/business/semiconductor/product/flash-
ssd/nvmessd) [http://www.pcworld.com/article/2866912/samsungs-
ludicrously-...](http://www.pcworld.com/article/2866912/samsungs-ludicrously-
fast-pcie-ssd-uses-almost-no-power-in-standby-mode.html)

~~~
petrbela
And as far as I understand, isn't NVMe SSD just a "different name" for PCIe
SSD? PCIe being the protocol (that's already being used for graphics cards),
and NVMe the standard for SSDs to understand that protocol.

~~~
wtallis
There's more to it than that. NVMe is a higher layer technology than PCIe.

SATA drives connect to the host system over a SATA PHY link to a SATA HBA that
itself is connected to the host via PCIe. The OS uses AHCI to talk to the HBA
to pass ATA commands to the drive(s).

PCIe SSDs that don't use NVMe exist, and work by unifying the drive and the
HBA. This removes the speed limitation of the SATA PHY, but doesn't change
anything else. The OS can't even directly know that there's no SATA link
behind the HBA; it can only observe the higher speeds and 1:1 mapping of HBAs
to drives. Some PCIe SSDs have been implemented using a RAID HBA, so the speed
limitation has been circumvented by having multiple SATA links internally,
presented to the OS as a single drive.

NVMe standardizes a new protocol that operates over PCIe, where the HBA is
permanently part of the drive, and there's a new command set to replace ATA.
New drivers are needed, and NVMe removes many bottlenecks and limitations of
the AHCI+ATA protocol stack.

------
exabrial
SFF-8639 cables look pretty cool! Reminds me of what Apple/Intel was trying to
do with thunderbolt. I can imagine some creative people will find other uses
for this connector.

------
MCRed
Why not simply make an SSD Controller that has a thunderbolt port? Since Intel
is building thunderbolt into its support chips now, this seems like a good way
to get quick performance, plenty of bandwidth without having to come up with
new standard, separate drivers, etc. Thunderbolt ports could be put on
motherboards fairly easily, etc.

Is there something I'm missing?

Plus this would have the advantage of driving down prices for thunderbolt and
increasing adoption.

~~~
eurleif
Isn't Thunderbolt just externalized PCI Express?

~~~
TazeTSchnitzel
Thunderbolt is PCIe and DisplayPort.

------
Derbasti
Is this relevant for consumers? SSDs were a huge improvement for everyday
computing; would NVMe be a similar jump?

