The 4x speed increase and lower CPU overhead means it is now possible to move RAM only applications (for instance in-memory databases) to SSDs, keeping only the indexes in memory. Yeah, we've been going that way for a while, just seems we've come a long way from expensive Sun e6500's I was working with in just over a decade ago.
I was talking about these:
Note, though, it looks like if you are willing to pay for a U.2 connected drive, you can get 'em with the giant heatsinks you want:
available; not super cheap, but I'm not sure you'd want the super cheap consumer grade stuff in a server, anyhow.
but I object to the idea of putting SSDs on PCI-e cards for any but disposable "cloud" type servers (unless they are massively more reliable than any that I've seen, which I don't think is the case here.) just because with a U.2 connected hard drive in a U.2 backplane, I can swap a bad drive like I would swap a bad sata drive; an alert goes off and I head off to the co-lo as soon as convenient and I can swap the drive without disturbing users, whereas with a low-profile PCI-e card, I've pretty much gotta shut down the server, de-rack it, then make the swap, which causes downtime that must be scheduled, even if I have enough redundancy that there isn't any data loss.
M.2 has almost no place in the server market. U.2 does and will for the foreseeable future, but I'm not sure that it can serve the high-performance segment for long. It's not clear whether it will reach the limits on capacity, heat, or link speed first, but all of those limits are clearly much closer than for add-in cards.
No argument on m.2 - it's a consumer grade technology. No doubt, someone in the "cloud" space will try it... I mean, if you rely on "ephemeral disk" - well, this is just "ephemeral disk" that goes funny sooner than spinning disk.
But the problem remains, If your servers aren't disposable, if your servers can't just go away at a moment's notice, the form factor of add-in cards is going to be a problem for you, unless the add-in cards are massively more reliable than I think they are. Taking down a whole server to replace a failed disk is a no-go on most non-cloud applications...
Yes. The same conversation is happening right now in the living room, because security has forced three reboots in the last year, after having several years where simply by being xen and pv and using pvgrub (rather than loading a user kernel from the dom0) we haven't been vulnerable to the known privilege escalations. This is a lot more labor (and customer pain) when you can't move people.
No progress on that front yet, though.
RAID-0 is used as a way to get faster performance from spinning disk drives, as you can return parts of each read request from different (striped) drives.
You also get better write performance, as your writes are split across the drives.
My recent project, which was relatively small, played back a single video over 4 blended projectors. Each video frame was a 50MB uncompressed TGA file, 30 times a second. On a more complex show, you could be trying to play back multiple video streams simultaneously.
D3 - http://www.d3technologies.com/
and Pandora - http://www.coolux.de/products/pandoras-box-server/
are two of the big players in the industry.
Ouch! Not even a PNG?
I'm not an expert on the exact architecture, but I'm guessing that with uncompressed TGA you just throw the bits at the GPU and they get displayed, while if you have to uncompress images that first gets handled by the CPU (?).
The sort of people who need to know that but don't aren't going to learn it from a post on HN, and the sort of people who'll read it on HN mostly don't need to be told.
It's a pretty ineffective way to guard against data loss, you need backups.
RAID [1-6] is an availability solution, saving the downtime of restoring from backup in the case of the most predictable disk failures. It doesn't help with all the other cases of data loss.
But would this in practice play well with the CPU prefetcher? If you're crunching sequential data can you expect the data in the L1 cache after the initial stall?
Main DDR3 RAM is something like 32GigaBYTES per second, and L1 cache is even further.
What I think the poster was talking about, is moving from disk-based databases to SSD-based databases. SSDs are much faster than hard drives.
L1 Cache, L2 Cache, L3 Cache, and main memory are all orders of magnitude faster than even the fastest SSDs today. Thinking about the "CPU prefetcher" when we're talking about SSDs or Hard Drives is almost irrelevant due to the magnitudes of speed difference.
"Finally, the unpredictable latency of SSD-based arrays - often called all-flash arrays - is gaining mind share. The problem: if there are too many writes for an SSD to keep up with, reads have to wait for writes to complete - which can be many milliseconds. Reads taking as long as writes? That's not the performance customers think they are buying."
This is completely false in a properly designed server system. Use the deadline scheduler with SSD's so that reads aren't starved from bulk I/O operations. This is fairly common knowledge. Also, if you're throwing too much I/O load at any storage system, things are going to slow down. This should not be a surprise. SSD's are sorta magical (Artur), but they're not pure magic. They can't fix everything.
While Facebook started out with Fusion-io, they very quickly transitioned to their own home-designed and home-grown flash storage. I'd be wary of using any of their facts or findings and applying them to all flash storage. In short, these things could just be Facebook problems because they decided to go build their own.
He also talks about the "unpredictability of all flash arrays" like the fault is 100% due to the flash. In my experience, it's usually the RAID/proprietary controller doing something unpredictable and wonky. Sometimes the drive and controller do something dumb in concert, but it's usually the controller.
EDIT: It was 2-3 years ago that flash controller designers started to focus on uniform latency and performance rather than concentrating on peak performance. You can see this in the maturation of I/O latency graphs from the various Anandtech reviews.
The variability is an order of magnitude greater but the worst case is an is several orders of magnitude better. Quite simply no one cares that you might get 10,000 IOPS or 200,000 IOPS from an SSD when all you're going to get from a 15K drive is 500 IOPS
And the difference between a fast SSD and a slow SSD is pretty big: for the same workload a fast PCIe SSD can show an average latency of 208µs with 846µs standard deviation, while a low-end SATA drive shows average latency of 1782µs and standard deviation of 4155µs (both are recent consumer drives).
Tprog is around 1ms and Terase can be upwards of 2ms.
All in all this means a large variability in read performance depending on what other actions are done on the SSD and how well the SSD manages the writes and erase operations in the background.
This doesn't even change with the interface (SAS/SATA/PCIe), those add their own queues, link errors and thus variability.
Then you have the differences in over provisioning that allow high OP drives to mask out better the programming and erase processes.
It's true that 99% of your IOs will see service time of below 1ms but it's the other 1% that really matters to avoid getting late night calls or even a mid-day crisis.
My personal insight, and I think this should be a best practice, is that if you mirror something like an SLOG, you should source two entirely different SSD models - either the newest intel and the newest samsung, or perhaps previous generation intel and current generation intel.
The point is, if you put the two SSDs into operation at the exact same time, they will experience the exact same lifecycle and (in my opinion) could potentially fail exactly simultaneously. There's no "jitter" - they're not failing for physical reasons, they are failing for logical reasons ... and the logic could be identical for both members of the mirror...
Fortunately I had followed accepted practice of mirroring the write cache. (I'd also used dedicated, separate host controllers for each of these write-cache SSDs, but for this cheap experiment that probably didn't help.)
So yes this really happens.
We tried to mitigate the failure interval on the drives by mixing brands. Our Supermicro distributor tried to really dissuade us from using mixed batches and brands of SAS drives in our servers. Really had to dig in our heels to get them to listen.
Even when you buy a NAS fully loaded like a Synology it comes with the same brand, model and batch of drives. In one case we saw 6 drive failures in two months for the same Synology NAS.
Wonder whether NetApp or EMC try mixing brands or at least batches on the appliances they ship?
Ofcourse, the SSDs we use are properly vetted for design issues and bugs in the firmware actually get fixed for us in a relatively timely manner. You get that level of service with the associated large volume.
The day before we requested the 3rd RMA, the vendor put a notice on their support site that using certain features would cause drastically shortened life of the storage drive, and patched the OS in attempt to reduce the amount of writes.
Unfortunately the jump to good flash is quite expensive, and often hard to find in the eMMC form factor which is dominated by low cost parts.
Usually these types of articles are designed to lead people directly to the product that's paying for the article. Sensationalistic is good for this; it gets people to click, it gets people to disagree, and then the controversy spreads the article across the net. Seems like it's working, in this case.
The entry-level SSDs are rated for ~two whole-drive writes per week.
Wear gage, et al.
CPUs and storage exist for completely disjoint purposes, and the fastest CPU in the world can't make up for a slow disk (or vice versa). Anyway, CPUs are still "faster" than SSDs, whatever that means, if you wish to somehow compare apples to oranges. That's why even with NVMe if you are dealing with compressible data enabling block compression in your FS can speed up your IO workflow.
From my observation, actually most personal and business use machines are IO-bound - it often takes just the web browser itself - with webdevs pumping out sites filled with superfluous bullshit - to fill out your RAM completely, and then you have swapping back and forth.
As far as swapping, you do want a fast swap device, but it has nothing to do with "Since CPUs aren't getting faster". You're right that it's IO-bound. It's so IO-bound that you could underclock your CPU to 1/4 speed and not even notice.
So in short: Games in theory could use a faster drive to better saturate the CPU, but they're not bigger than RAM so they don't. Swapping is so utterly IO-bound that no matter what you do you cannot help it saturate the CPU.
The statement "Since CPUs aren't getting faster, making storage faster is a big help." is not true. A is not a contributing factor to B.
I know, right?! I would rather like it if more game devs could get the time required to detect that they're running on a machine with 16+GB of RAM and -in the background, with low CPU and IO priority- decode and load all of the game into RAM, rather than just the selected level + incidental data. :)
And by exceeded I mean the games were 32-bit binaries so the ram left over was enough to cache the entire game data set even in light of RAM used by the game.
Recently install size seems to have grown quite a bit.
I am open to being wrong on this, but I don't think I am. Can anyone give a plausible explanation why 4TB of NAND storage should cost more to manufacture than a 4TB mechanical hard drive does, given the materials, widespread demand for the component, etc?
The correct thing to compare NAND prices to is other chips that are being fabbed at the same process node, by die area.
Moving the FTL entirely onto the CPU throws compatibility out the window; you can no longer access the drive from more than one operating system, and UEFI counts. You'll also need to frequently re-write the FTL to support new flash interfaces.
I am not sure how well mtd's abstraction fits with modern Flash chips, though.
What's more likely to happen is exposing the low level storage and software to kernel drivers.
This assert piqued my interest given that my hands-on experience with HBase speaks to the contrary. The paper by SanDisk they refer to https://www.usenix.org/system/files/conference/inflow14/infl... seems to suggest that most of the issues are related to sub-optimal degragmentation by the disk driver itself. More specifically, the fact that some of the defragmentation is unnecessary. Hardly a reason to blame the databases and can be addressed down the road. After all, GC in Java is still an evolving subject.
What is the most reliable external hard drive type? I thought SSDs were more reliable than spinning disks, especially to leave plugged in constantly, but now I'm not as sure.
For personal use, SSDs outperform HDDs in just about every aspect, if you can afford the cost, an SSD is the better choice.
And there is nothing mentioned here about downsides of leaving a drive plugged in and powered on at all times.
While all disks can fail, HDDs are less likely to fail completely; usually they just start to develop bad sectors, so you may still be able to recover much of their contents. When an SSD goes, it generally goes completely (at least, so I've read).
So it depends on your needs and how you plan to use the drive. For light use, it probably doesn't matter much either way. For important data, you need to keep it backed up in either case. SSDs use less power, particularly when idle, so if you're running on battery a lot, that would be a consideration as well.
I think what people experience is that the sudden death of SSDs doesn't occur more often than on HDDs. But with the mechanical issues and slowly building up of bad sectors gone, sudden death is probably the only visible issue left.
(Just my personal opinion.)
Isn't this basically the Linux 'discard' mount flag? Most distros seem to be recommending periodic fstrim for consumer uses, what's the best practice in the data center?
A big difference between client and enterprise drives is the amount of over-provisioning. A simple trick if you can't use trim is to leave 5-10% of the drive unused to improve the effective over-provisioning and improve worst case performance.
The question of using trim in data centers might be due to interaction between trim and raid configurations because sometimes trim is implemented as nondeterministic (as a hint to the drive; persistant trim can cause performance drops unless you are willing to through in extra hardware to optimize it well) which causes parity calculation issues when recovering from a failed drive.
I think the recommendations for periodic fstrim of free space is due to filesystems usually not taking the time to issue a large number of discard operations when you delete a bunch of data. Even though discards should be faster than a synchronous erase command, not issuing any command to the drive is faster still.