Hacker News new | past | comments | ask | show | jobs | submit login
What we learned about SSDs in 2015 (zdnet.com)
179 points by NN88 on Dec 20, 2015 | hide | past | web | favorite | 109 comments

The most exciting recent development in SSDs (until 3DXpoint is released), is bypassing the SATA interface, connecting drives straight into the PCIe bus (no more expensive raid controllers). Just a shame hardly any server motherboards come with M.2 slots right now.

The 4x speed increase and lower CPU overhead means it is now possible to move RAM only applications (for instance in-memory databases) to SSDs, keeping only the indexes in memory. Yeah, we've been going that way for a while, just seems we've come a long way from expensive Sun e6500's I was working with in just over a decade ago.

M.2 slots don't make much sense for servers, at least if you're trying to take advantage of the performance benefits possible with a PCIe interface. Current M.2 drives aren't even close to saturating the PCIe 3.0 x4 link but they're severely thermally limited for sustained use and they're restricted in capacity due to lack of PCB area. Server SSDs should stick with the traditional half-height add-in card form factor with nice big heatsinks.

most of the NVMe backplanes I've seen give full 'enterprise 2.5" drive clearance to the thing, so if they are actually as thick as consumer SSDs, as most current SATA 'enterprise SSD' are, there's plenty of room for a heatsink without expanding the slot. The supermicro chassis (and I've only explored the NVMe backplanes from supermicro) usually put a lot of effort into drawing air through the drives, so assuming you put in blanks and stuff, the airflow should be there, if the SSD are setup to take advantage of it.

You need to be more precise with your terminology. There is no such thing as a NVMe backplane. NVMe is a software protocol. The backplanes for 2.5" SSDs to which you refer would be PCIe backplanes using the SFF-8639 aka U.2 connector. None of the above is synonymous with the M.2 connector/form factor standard, which is what I was talking about.

edit: okay, I re-read what you said and yes, these won't support M.2 drives, if I understand what's going on here, and it's possible I still don't. (I have yet to buy any non-sata SSD, though I will soon be making experimental purchases.)

I was talking about these:


Note, though, it looks like if you are willing to pay for a U.2 connected drive, you can get 'em with the giant heatsinks you want:


further edit:


available; not super cheap, but I'm not sure you'd want the super cheap consumer grade stuff in a server, anyhow.

further edit:

but I object to the idea of putting SSDs on PCI-e cards for any but disposable "cloud" type servers (unless they are massively more reliable than any that I've seen, which I don't think is the case here.) just because with a U.2 connected hard drive in a U.2 backplane, I can swap a bad drive like I would swap a bad sata drive; an alert goes off and I head off to the co-lo as soon as convenient and I can swap the drive without disturbing users, whereas with a low-profile PCI-e card, I've pretty much gotta shut down the server, de-rack it, then make the swap, which causes downtime that must be scheduled, even if I have enough redundancy that there isn't any data loss.

Take a look at how much stricter the temperature and airflow requirements are for Intel's 2.5" U.2 drives compared to their add-in card counterparts. (And note that the U.2 drives are twice as thick as most SATA drives.)

M.2 has almost no place in the server market. U.2 does and will for the foreseeable future, but I'm not sure that it can serve the high-performance segment for long. It's not clear whether it will reach the limits on capacity, heat, or link speed first, but all of those limits are clearly much closer than for add-in cards.

>M.2 has almost no place in the server market. U.2 does and will for the foreseeable future, but I'm not sure that it can serve the high-performance segment for long. It's not clear whether it will reach the limits on capacity, heat, or link speed first, but all of those limits are clearly much closer than for add-in cards.

No argument on m.2 - it's a consumer grade technology. No doubt, someone in the "cloud" space will try it... I mean, if you rely on "ephemeral disk" - well, this is just "ephemeral disk" that goes funny sooner than spinning disk.

But the problem remains, If your servers aren't disposable, if your servers can't just go away at a moment's notice, the form factor of add-in cards is going to be a problem for you, unless the add-in cards are massively more reliable than I think they are. Taking down a whole server to replace a failed disk is a no-go on most non-cloud applications...

You're probably tired of hearing this from me, but if you distribute the storage, you can evacuate all the VMs off a host, down it, do whatever, bring it back up, and then unevacuate.

> You're probably tired of hearing this from me, but if you distribute the storage, you can evacuate all the VMs off a host, down it, do whatever, bring it back up, and then unevacuate.

Yes. The same conversation is happening right now in the living room, because security has forced three reboots in the last year, after having several years where simply by being xen and pv and using pvgrub (rather than loading a user kernel from the dom0) we haven't been vulnerable to the known privilege escalations. This is a lot more labor (and customer pain) when you can't move people.

No progress on that front yet, though.

And if uptime is that important, you can just buy a server that supports PCIe hotswap.

can you point me at a chassis designed for that?

OTOH if you can fit 24 U.2s but only 6 AICs in a server, maybe RAIDed U.2 will still win.

Only if you postulate that your drives won't be able to saturate their respective number of PCIe lanes. The total number of lanes in the system is fixed by the choice of CPU, and having drives with a total lane count in excess of that only helps when your drives can't fully use the lanes they're given or if you're trying to maximize capacity rather than performance.

2.5" NVMe is fine and more scalable than add-in cards (more slots) with increased serviceability. The equivalent add-in card is usually much more expensive than the 2.5" version.

I thought this was the point of SATA Expresses, but the last I read, SATA Express actually holds on par and even falls behind M.2.

How exactly does connecting to the PCIe bus obviate the need for RAID? RAID isn't about connecting drives, it's about not losing data. If you don't need RAID for your application today, you can purchase a non-RAID SAS adapter for a couple hundred bucks, or just use the onboard ports that are sure to be there.

> RAID isn't about connecting drives, it's about not losing data.

RAID-0 is used as a way to get faster performance from spinning disk drives, as you can return parts of each read request from different (striped) drives.

You also get better write performance, as your writes are split across the drives.

uh, this shouldn't be voted down. Now, I don't use RAID for performance, I use raid for reliability, like parent said, and I've never actually been in a position where it would make sense, but people do use raid0 to increase performance. It happens, even if it's not nearly as common as using raid to prevent data loss.

Back in the day, folks would RAID0 AWS EBS volumes to get the level of performance they needed (this was long before SSD EBS volumes).

Now you're making me feel old - when "back in the day" and "AWS" get used in the same sentence... :-)

This is still (AFAIK) the recommended way to get higher IOPS out of Azure SSDs.

I like RAID10. It's almost as reliable as RAID1, and you get about half the capacity and performance boost of RAID0. With four drives, RAID10 gives you the same capacity as RAID6. It's only ~30% slower, and it rebuilds much faster after drive failure and replacement. With more drives, you get 100% of added drive capacity for RAID6 vs 50% for RAID10, but rebuild time gets crazy huge.

High-end video servers used in the entertainment industry typically use RAID 0. My last project, for example, used 10 250GB SSDs in RAID 0.

What kind of video servers are you talking about? For ingesting from a feed/camera in? Streaming out video? Curious.

While these servers can ingest video feeds, they wouldn't (typically) be saving the incoming feeds. These servers play out video over many projectors blended into a single seamless image. If you've watched the past few Olympics opening ceremonies, you most certainly saw the projections.

My recent project, which was relatively small, played back a single video over 4 blended projectors. Each video frame was a 50MB uncompressed TGA file, 30 times a second. On a more complex show, you could be trying to play back multiple video streams simultaneously.

D3 - http://www.d3technologies.com/ and Pandora - http://www.coolux.de/products/pandoras-box-server/

are two of the big players in the industry.

Each video frame was a 50MB uncompressed TGA file, 30 times a second.

Ouch! Not even a PNG?

Uncompressed TGA is used for two reasons: Quality (this particular image was going on a screen three stories tall) and, as the other commenter mentioned, the time required to decompress images.

I'm not an expert on the exact architecture, but I'm guessing that with uncompressed TGA you just throw the bits at the GPU and they get displayed, while if you have to uncompress images that first gets handled by the CPU (?).

I'd guess the time to do compression from TGA to PNG may be more than 1/30th of a second. Or at least reliably/consistently so.

A friend of mine does a lot of programming and installations using Dataton Watchout - where they use 4 or more 4K projectors stitched together into one big image. He's regularly got war stories about problems with the latest batch of SSDs not playing well in RAID0 causing stutters on simultaneous playback of multiple 4K video streams.

Is there an advantage compared to putting each stream on a different drive?

It's possible, and probably advantageous to a point. Eventually you'll hit bottlenecks somewhere, at which point you throw more servers at it.

Sure. My first purchase of SSDs were 3 80GB Intel SSDs which I put into RAID-0 and used for my primary system drive on my gaming machine. RAID-0 provides very-near 100% performance boost per drive added. Of course, it also provides the same bonuses to the likelihood of total data loss... but that was a risk I was OK with taking (and which never bit me!).

Mine was the same drive, my first SSD was the first 80GB Intel one, and just over a year of use, it started reporting itself as an 8MB drive. You never know for sure what will remain good, or be good consistently.

RAID 0 does not have any level of redundancy, so you might as well remove the 'R' in RAID and replace it by 'S' for striping or something. However, people would probably get confused if you start calling it SAID.

Complaining about industry-standard terminology like this is mostly a waste of time.

The sort of people who need to know that but don't aren't going to learn it from a post on HN, and the sort of people who'll read it on HN mostly don't need to be told.

Sad but true. I gather that RAID0 has been the default for notebooks with two drives. And for the first 1TB MyBooks.

0 == no redundancy...

> RAID [...] it's about not losing data

It's a pretty ineffective way to guard against data loss, you need backups.

RAID [1-6] is an availability solution, saving the downtime of restoring from backup in the case of the most predictable disk failures. It doesn't help with all the other cases of data loss.

If we can connect SSDs directly to the PCIe bus... why are there no cute little Thunderbolt flash drives that run at full NVMe speed?

Windows has had astoundingly bad Thunderbolt hot-plug support given that ExpressCard and CardBus existed. NVMe support for non-Apple SSDs under OS X can be problematic. Thunderbolt PHYs are expensive. USB 3.0 is fast enough for now, and USB 3.1 exists.

This is a bit of a question from my lack of understanding of disk IO:

But would this in practice play well with the CPU prefetcher? If you're crunching sequential data can you expect the data in the L1 cache after the initial stall?

SSDs are still grossly slower than RAM. The fastest SSD I know of is the Intel 750, which is like ~2.0 GigaBYTES/second (or for what is more typical for storage benchmarks: about 16Gbps or so over PCIe x4).

Main DDR3 RAM is something like 32GigaBYTES per second, and L1 cache is even further.

What I think the poster was talking about, is moving from disk-based databases to SSD-based databases. SSDs are much faster than hard drives.

L1 Cache, L2 Cache, L3 Cache, and main memory are all orders of magnitude faster than even the fastest SSDs today. Thinking about the "CPU prefetcher" when we're talking about SSDs or Hard Drives is almost irrelevant due to the magnitudes of speed difference.

"2015 was the beginning of the end for SSDs in the data center." is quit a bold statement especial when not discussing any alternative. I do not see us going back to magnetic disk, and most new storage technology are some kind of ssd...

My thoughts exactly. The article is quite inflammatory and tosses out some bold statements without really deep diving into them. My favorite:

"Finally, the unpredictable latency of SSD-based arrays - often called all-flash arrays - is gaining mind share. The problem: if there are too many writes for an SSD to keep up with, reads have to wait for writes to complete - which can be many milliseconds. Reads taking as long as writes? That's not the performance customers think they are buying."

This is completely false in a properly designed server system. Use the deadline scheduler with SSD's so that reads aren't starved from bulk I/O operations. This is fairly common knowledge. Also, if you're throwing too much I/O load at any storage system, things are going to slow down. This should not be a surprise. SSD's are sorta magical (Artur), but they're not pure magic. They can't fix everything.

While Facebook started out with Fusion-io, they very quickly transitioned to their own home-designed and home-grown flash storage. I'd be wary of using any of their facts or findings and applying them to all flash storage. In short, these things could just be Facebook problems because they decided to go build their own.

He also talks about the "unpredictability of all flash arrays" like the fault is 100% due to the flash. In my experience, it's usually the RAID/proprietary controller doing something unpredictable and wonky. Sometimes the drive and controller do something dumb in concert, but it's usually the controller.

EDIT: It was 2-3 years ago that flash controller designers started to focus on uniform latency and performance rather than concentrating on peak performance. You can see this in the maturation of I/O latency graphs from the various Anandtech reviews.

There is unpredictablilty in SSDs however, its most like whether an IOP will take 1 ns or 1 ms, instead of 10 ms, or 100 ms with an HD.

The variability is an order of magnitude greater but the worst case is an is several orders of magnitude better. Quite simply no one cares that you might get 10,000 IOPS or 200,000 IOPS from an SSD when all you're going to get from a 15K drive is 500 IOPS

Best-case for a SSD is more like 10µs, and the worst-case is still tens of milliseconds. Average case and 90th percentile are the kind of measures responsible for the most important improvements.

And the difference between a fast SSD and a slow SSD is pretty big: for the same workload a fast PCIe SSD can show an average latency of 208µs with 846µs standard deviation, while a low-end SATA drive shows average latency of 1782µs and standard deviation of 4155µs (both are recent consumer drives).

Where does one find 10us reads? The NAND is usually with a Tread of 50 to 100 us so just the NAND operation itself is more than 10us.

Tprog is around 1ms and Terase can be upwards of 2ms.

All in all this means a large variability in read performance depending on what other actions are done on the SSD and how well the SSD manages the writes and erase operations in the background.

This doesn't even change with the interface (SAS/SATA/PCIe), those add their own queues, link errors and thus variability.

Then you have the differences in over provisioning that allow high OP drives to mask out better the programming and erase processes.

One can see tens of ms from an SSD if you run it long enough and hard enough. It is also possible to get hundreds of ms and even seconds if the SSD is getting bad.

It's true that 99% of your IOs will see service time of below 1ms but it's the other 1% that really matters to avoid getting late night calls or even a mid-day crisis.

deadline seems less useful now that a bunch of edge cases were fixed in cfq (the graphs used to look way worse than this): http://blog.pgaddict.com/posts/postgresql-io-schedulers-cfq-...

Yeah I thought that too. I think the author is specifically referring to 2.5" form factor SATA SSD's though.

At one of my previous employers, they built a massive "cloud" storage system. The underlying file system was ZFS, which was configured to put its write logs onto an SSD. With the write load on the system, the servers burnt through an SSD in about a year, i.e. most SSDs started failing after about a year. The hardware vendors knew how far you could push SSDs, and thus refused to give any warranty. All the major SSD vendors told us SSDs are strictly considered wearing parts. That was back in 2012 or 2013.

Just a note ... we use SSDs as write cache in ZFS at rsync.net and although you should be able to withstand a SLOG failure, we don't want to deal with it so we mirror them.

My personal insight, and I think this should be a best practice, is that if you mirror something like an SLOG, you should source two entirely different SSD models - either the newest intel and the newest samsung, or perhaps previous generation intel and current generation intel.

The point is, if you put the two SSDs into operation at the exact same time, they will experience the exact same lifecycle and (in my opinion) could potentially fail exactly simultaneously. There's no "jitter" - they're not failing for physical reasons, they are failing for logical reasons ... and the logic could be identical for both members of the mirror...

This is good advice, and fwiw the problem it addresses can happen in spinning drives too. We had a particular kind of WD drive that had a firmware bug where the drive would reset after 2^N seconds of power-up time (where that duration was some number of months).

FWIW I bricked a very cheap consumer SSD by using it as write log for my ZFS array. This was my experiment machine, not a production server.

Fortunately I had followed accepted practice of mirroring the write cache. (I'd also used dedicated, separate host controllers for each of these write-cache SSDs, but for this cheap experiment that probably didn't help.)

So yes this really happens.

We ship voice recording and conferencing appliances based on Supermicro hardware, a RAID controller and 4x disks on RAID 10.

We tried to mitigate the failure interval on the drives by mixing brands. Our Supermicro distributor tried to really dissuade us from using mixed batches and brands of SAS drives in our servers. Really had to dig in our heels to get them to listen.

Even when you buy a NAS fully loaded like a Synology it comes with the same brand, model and batch of drives. In one case we saw 6 drive failures in two months for the same Synology NAS.

Wonder whether NetApp or EMC try mixing brands or at least batches on the appliances they ship?

I can tell you that EMC and IBM both use the same drives from the same batch in an entire system of tens to hundreds of drives and while I don't know about all cases completely I did oversee a large number of systems and drives and there was never a double disk failure we had that completely took two drives. With a proper background media scan procedure you also reduce the risk of a media problem in two different drives.

Ofcourse, the SSDs we use are properly vetted for design issues and bugs in the firmware actually get fixed for us in a relatively timely manner. You get that level of service with the associated large volume.

In a sense, so is all storage hardware. It's just a matter of how long it takes before it fails. This goes for SSDs, HDDs, Flash cards, etc.

If your SSDs were wearing out after a year, and were warrantied for a year, I'm guessing you weren't using "enterprise" SSDs?

Even enterprise SSDs like Samsung's have a guarantee like "10 years or up to x TB of written data". So if you write a lot you can lose the guarantee after a year even with enterprise SSDs.

They were enterprise models (i.e. not cheap), but they had no warranty in the first place. Every single hardware supplier simply refused to give any. I _guess_ because of the expected wear and tear.

That's interesting. I don't even know how to buy these things without a warranty. Were they direct from the manufacturer?

Logging to flash storage is just asking for issues. We bought a recent model of LARGE_FIREWALL_VENDOR's SoHo product, and enabling the logging features will destroy the small (16GB?) flash storage module in a few weeks (!).

The day before we requested the 3rd RMA, the vendor put a notice on their support site that using certain features would cause drastically shortened life of the storage drive, and patched the OS in attempt to reduce the amount of writes.

Logging to poorly specified Flash storage is the real problem. They were likely using a cheap eMMC flash, which are notorious for having extremely poor write leveling.

Unfortunately the jump to good flash is quite expensive, and often hard to find in the eMMC form factor which is dominated by low cost parts.

This article has a lot of good information, but its weirdly sensationalistic tone detracts from it. I appreciate learning more about 3D Xpoint and Nantero, but SSDs are not a "transitional technology" in any real sense of the word, and they won't be displaced by anything in 2016, if nothing else because it takes multiple years from volume capability to stand up a product pipeline on a new memory technology, and more years to convince the enterprise market to start deploying it. The most solid point the article makes is that the workload-specific performance of SSD-based storage is still being explored, and we need better tools for it.

I got the sense that it was a PR hit for Nantero, bought and paid for. Notice the arc of the article: it says "[Popular hot technology] is dead. [Big vendors] have recently introduced [exciting new product], but [here are problems and doubts about those]. There's also [small startup you've never heard of] which has [alternative product] featuring [this list of features straight from their landing page]."

Usually these types of articles are designed to lead people directly to the product that's paying for the article. Sensationalistic is good for this; it gets people to click, it gets people to disagree, and then the controversy spreads the article across the net. Seems like it's working, in this case.

Robin Harris has been advocating for the abolishment of block abstraction layer for a couple of years now and this piece is consistent with his usual rhetoric

Eh, it listed it as "promises" and talked the same way about Adesto. It's reasonable to say "this is the basic claim of the product; we'll see if they get there" without it being PR.

FWIW, the top-end HPE SSD models are rated for up to 25 writes of the entire SSD drive, per day, for five years.

The entry-level SSDs are rated for ~two whole-drive writes per week.

Wear gage, et al.


Also maybe of interest, the Techreport The SSD Endurance Experiment. Their assorted drives lasted about 2,000 - 10,000 whole disk writes.


Kind of stupid to end with "Since CPUs aren't getting faster, making storage faster is a big help."

CPUs and storage exist for completely disjoint purposes, and the fastest CPU in the world can't make up for a slow disk (or vice versa). Anyway, CPUs are still "faster" than SSDs, whatever that means, if you wish to somehow compare apples to oranges. That's why even with NVMe if you are dealing with compressible data enabling block compression in your FS can speed up your IO workflow.

Storage and CPU cycles aren't completely disjoint. While this is true for plain old data archival, a lot of storage in reality is just used as cache. You could argue even your customer data is a cache, because you can always go back to the customer for most of it. Most data can be recomposed from external sources given enough computation.

Ever tried to play a modern computer game? You never have enough RAM for stuff; a lot of content gets dumped onto hard drive sooner or later (virtual memory), or is be streamed from the drive in the first place. Having faster access helps tremendously.

From my observation, actually most personal and business use machines are IO-bound - it often takes just the web browser itself - with webdevs pumping out sites filled with superfluous bullshit - to fill out your RAM completely, and then you have swapping back and forth.

I don't think I've touched a game on PC where you can't fit all the levels into RAM, let alone just the current level. Sometimes you can't fit music and videos into ram, but you can stream that off the slowest clunker in the world. A game that preloads assets will do just fine on a bad drive with a moderate amount of RAM. Loading time might be higher, but the ingame experience shouldn't be affected.

As far as swapping, you do want a fast swap device, but it has nothing to do with "Since CPUs aren't getting faster". You're right that it's IO-bound. It's so IO-bound that you could underclock your CPU to 1/4 speed and not even notice.

So in short: Games in theory could use a faster drive to better saturate the CPU, but they're not bigger than RAM so they don't. Swapping is so utterly IO-bound that no matter what you do you cannot help it saturate the CPU.

The statement "Since CPUs aren't getting faster, making storage faster is a big help." is not true. A is not a contributing factor to B.

> I don't think I've touched a game on PC where you can't fit all the levels into RAM, let alone just the current level.

I know, right?! I would rather like it if more game devs could get the time required to detect that they're running on a machine with 16+GB of RAM and -in the background, with low CPU and IO priority- decode and load all of the game into RAM, rather than just the selected level + incidental data. :)

For years total available RAM could easily exceed the install size of entire and entire game thanks to consoles holding everything back.

And by exceeded I mean the games were 32-bit binaries so the ram left over was enough to cache the entire game data set even in light of RAM used by the game.

Recently install size seems to have grown quite a bit.

Your process spends some amount of time waiting on various resources; some of those may be CPU-bound, some may be disk-bound. Speeding either of those up will make your process faster. In fact, you can even trade one for the other if you use compression.

Storage is definitely the bottleneck for many applications and at one point in the past, it used to be the CPU that was the main bottleneck for these same applications so I can understand their point too.

Can't wait for the inevitable discovery that NAND chips are being price-fixed. You would think after the exact same companies price-fixed RAM and LCD panels that peoples radars would go off faster. You expect me to believe that the metals and neodymium magnets and now helium containing drives are cheaper to manufacture than an array of identical NAND chips? NAND chips which are present in a significant percentage of all electronic products purchased by anyone anywhere? When a component is that widely used, it becomes commoditized. Which means its price to manufacture drops exponentially, not linearly like the price drops of NAND has done. This happened with RAM and LCDs as well. When you can look around and see a technology being utilized in dozens of products within your eyesight no matter where you are, and those products are still more expensive than the traditional technology they replace, price-fixing is afoot.

I am open to being wrong on this, but I don't think I am. Can anyone give a plausible explanation why 4TB of NAND storage should cost more to manufacture than a 4TB mechanical hard drive does, given the materials, widespread demand for the component, etc?

"Apples do not cost the same as oranges, therefore oranges are being price-fixed" is not a convincing line of reasoning. The two technologies are very different and NAND storage is much newer, and has always been much more expensive than disk storage.

The correct thing to compare NAND prices to is other chips that are being fabbed at the same process node, by die area.

May be SSD should add something like "raw mode", when controller just reports everything he knows about disk, and operating system takes in control the disk, so firmware won't cause unexpected pauses. After all, operating system knows more, what files are not likely to be touched, what files are changing often, etc.

The industry is moving toward a mid-point of having the flash translation layer still implemented on the drive so that it can present a normal block device interface, but exposing enough details that the OS can have a better idea of whether garbage collection is urgently needed: http://anandtech.com/show/9720/ocz-announces-first-sata-host...

Moving the FTL entirely onto the CPU throws compatibility out the window; you can no longer access the drive from more than one operating system, and UEFI counts. You'll also need to frequently re-write the FTL to support new flash interfaces.

OS compatibility is important for laptops/desktops, but not in at least some database / server applications, and those are the applications that would benefit most from raw access

And now we must rely on each OS to implement their own version of a virtual firmware, and do comparisons between different implementations, etc. etc. etc.

This has existed for a long time, see the Linux kernel's mtd infrastructure, and the filesystems designed to run on top of it (jffs, yaffs). It used to be used in a lot of embedded devices before eMMC became cheap, and is still used in things like home routers.

I am not sure how well mtd's abstraction fits with modern Flash chips, though.

Not so sure. There are plenty of benefits to SSD too. I suspect system designers will just add more RAM to act as cache to offset some of these performance issues. Not to mention further improve temperature control.

More RAM means more reserve power needed to flush it to permanent storage when the main power is cut.

What's more likely to happen is exposing the low level storage and software to kernel drivers.

I think transfire was referring to RAM in the system to act as a pagefile read cache, not RAM on the SSD to act as a cache there. There's no power risk to an OS-level read cache.

>log-structured I/O management built into SSDs is seriously sub-optimal for databases and apps that use log-structured I/O as well

This assert piqued my interest given that my hands-on experience with HBase speaks to the contrary. The paper by SanDisk they refer to https://www.usenix.org/system/files/conference/inflow14/infl... seems to suggest that most of the issues are related to sub-optimal degragmentation by the disk driver itself. More specifically, the fact that some of the defragmentation is unnecessary. Hardly a reason to blame the databases and can be addressed down the road. After all, GC in Java is still an evolving subject.

Article is garbage. Basically "I told you so" by someone who never got up-to-date after the first SSDs came out and found some numbers to cherry pick that seemed to support his false beliefs.

I need an external disk for my laptop that I leave plugged in all the time.

What is the most reliable external hard drive type? I thought SSDs were more reliable than spinning disks, especially to leave plugged in constantly, but now I'm not as sure.

This article, and the majority of the comments here, are about using SSDs in server environments, where permanently high load and zero downtime is the norm. And it doesn't even seem to be about SSDs vs HDDs, it is about SSDs vs future technologies.

For personal use, SSDs outperform HDDs in just about every aspect, if you can afford the cost, an SSD is the better choice. And there is nothing mentioned here about downsides of leaving a drive plugged in and powered on at all times.

I still don't trust SSDs as much as I do spinning disks. While neither kind of drive should be trusted with the only copy of important data, I would say that drives used for backup, or for access to large amounts of data that can be recovered or recreated if lost and where the performance requirements do not demand an SSD, might as well be HDDs -- they're cheaper, and arguably still more reliable. If the workload is write-heavy, HDDs are definitely preferred as they will last much longer.

While all disks can fail, HDDs are less likely to fail completely; usually they just start to develop bad sectors, so you may still be able to recover much of their contents. When an SSD goes, it generally goes completely (at least, so I've read).

So it depends on your needs and how you plan to use the drive. For light use, it probably doesn't matter much either way. For important data, you need to keep it backed up in either case. SSDs use less power, particularly when idle, so if you're running on battery a lot, that would be a consideration as well.

Anecdotal evidence: 2.5" SATA HDD failed for me suddenly just last Tuesday, SMART was fine before, both the attributes and a long (complete surface scan) selftest I did a few weeks ago after I got this lightly used notebook from a colleague (I only needed for tests).

I think what people experience is that the sudden death of SSDs doesn't occur more often than on HDDs. But with the mechanical issues and slowly building up of bad sectors gone, sudden death is probably the only visible issue left.

(Just my personal opinion.)

Does NVMe solve the garbage collection problem?

NVMe is just the communications protocol between the host and device. Garbage collection is due to hiding the non ideal properties of the NAND from the host. Primary among them is endurance limits and the size difference between write and erase units. You could for example move the handling of some of these GC details to the OS filesystem level but then that becomes more complicated and has to deal with the differences with each NAND technology generation. You couldn't just copy a file system image from one drive to another for example.

> You could for example move the handling of some of these GC details to the OS filesystem level

Isn't this basically the Linux 'discard' mount flag? Most distros seem to be recommending periodic fstrim for consumer uses, what's the best practice in the data center?

The trim is kind of like the free counterpart to malloc. When the drive has been fully written, there is a lot less free space which constraints the GC and makes it works much harder. The trim tells the drive to free up space allowing for less GC work.

A big difference between client and enterprise drives is the amount of over-provisioning. A simple trick if you can't use trim is to leave 5-10% of the drive unused to improve the effective over-provisioning and improve worst case performance.

The question of using trim in data centers might be due to interaction between trim and raid configurations because sometimes trim is implemented as nondeterministic (as a hint to the drive; persistant trim can cause performance drops unless you are willing to through in extra hardware to optimize it well) which causes parity calculation issues when recovering from a failed drive.

Discard operations only tell the drive that a LBA is eligible for GC. It's basically a delete operation that explicitly can be deferred. It does not give the OS any input into when or how GC is done, and it doesn't give the OS any way to observe any details about the GC process.

I think the recommendations for periodic fstrim of free space is due to filesystems usually not taking the time to issue a large number of discard operations when you delete a bunch of data. Even though discards should be faster than a synchronous erase command, not issuing any command to the drive is faster still.

SATA drives until recently didn't have queued trims so if you did an occasional trim between read/writes you would have to flush the queue. Queued trims were added later on but have been slow to be adopted because it can be difficult to get it working fast, efficient and correct when intermingled with reads and writes. I know atleast one recent drive with queued trim had some bugs in the implementation.

Yeah, SATA/AHCI's limitations and bugs have affected filesystem design and defaults, to the detriment of other storage technologies. NVMe for example requires the host to take care of ordering requirements, so basically every command sent to the drive can be queued.

Given the number of IOPS SSDs produce they are a win even if you have to chuck them every 6 months.

...unless you are Fusion-io, in which case, most of these problems don't affect you.

why? isn't fusion-io based on ssd ?

They have a special OS driver which moves the FTL closer to the OS. So one of the things they need is multi GB of memory from the OS to keep the FTL mapping tables. Also sudden power loss requires OS intervention to piece back the FTL structure (I seem to recall the original product taking 5 minutes to recover.) This also means you couldn't boot from an Fusion IO drive. I'm not sure if they fixed these issues on a more recent drive.

They use flash like all other SSDs, but don't use a disk controller or any community defined protocols for mass compatibility. They use their own software and an FPGA to control the flash.

fusion-io (i believe, but please verify online) uses a spinning drive for frequent writes and an ssd for frequent reads, with software deciding what goes where, lessening the write traffic to the ssd, and thus wear to it.

No, Fusion-io has nothing to do with spinning drives. They make PCI-e flash drives with an FPGA as the "controller" for the flash. There are multiple communication channels, so you can simultaneously read and write from the drives, and there are various tunings available to control how garbage collection works. They are the only "SSD" that doesn't hide all the flash behind any kind of legacy disk controller or group protocol like NVMe

I think what you're trying to describe is apple's fusion drive.

Great thread.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact