In other words, assuming good wear leveling, each bit can only be rewritten ~2K times, which is actually quite low for endurance... the capacity is high and they're using that to hide the fact that it's low, but if this was e.g. SLC flash with 100K endurance, you'd be able to write 50x more.
No mention of retention either, which makes me think SSDs today are more for temporary nonvolatile not-quite-RAM storage and not for more permanent applications.
To reach a full rewrite a day, that's like 200Mb/s consistently over the span of a 24 hour day. If your loads/usage looks like this, you're probably expecting your drives to fail and are ready to replace them.
Seriously, what do you think a guarantee is?
If these O-Rings rupture, we will replace them under warranty.
Ah, so you're thinking of a guarantee like a politician would use the word. "I personally guarantee that..."
Every time I hear them say that on TV I'm thinking, "Really? So, what are you going to give me if it doesn't work out? What's that? Nothing?! Then it's not a guarantee is it! You might as well have said you pinky-promise, it would be worth as damn much!"
And then my wife tells me to calm down and stop ranting at the TV.
> If these O-Rings rupture, we will replace them under warranty.
See, that's a guarantee.
(which are only $79.50 on Amazon)
And it will still be in favor of microsd, which can be read by a tiny connector + tiny controller in a smartphone.
Their EVO drives suffer from read degradation slowly over long periods of time. See EVO 840, after about a year it was down to 20MiB/s of continuous read speed. An SSD they call that thing. They refused warranty and released a "Speed up flash warez tool now.exe" as help, which all it did was move the data around on the SSD in the background so as to keep up the charade that their SSDs dont suffer from manufacturing/design error.
They marketed it as warranty so and so, speed such and such, and in fact its a piece of shit, and Samsung does not stand by its own product.
Even their newest Samsung EVO 850 is marketed as write speed ~500mb/s, but that is only true for the first 3gb, due to their usage of "TurboWriteCache", which means it has some kind of DRAM or faster real flash disk, and the rest of the disk is below 100mb/s in write speed.
Perhaps, they should not offer warranty on their "low-end consumer" drives, or lie in marketing/product specifications?
So that we as low-end consumers can make informed decisions? I bought SanDisk Extreme Pro instead, they are very fine disks, and did not cost much more than the evo. Recommend SanDisk strongly.
It's a perfectly fine tradeoff for a consumer drive. Just because it doesn't fit your use case doesn't make it junk.
Samsung witheld/lied about the write speed, and only mentioned TurboWriteCache in hard-to-find support forums, because they know it matters for a meaningfully large segment of the market.
Its easy to state "about 250mb/s write speed, up to 500mb/s burst write speed" in marketing materials, if it didnt impact the sales as you state for consumers, why wouldnt they?
Juniper is nice, ExtremeNetworks, Pluribus etc, its not like I have to eat shit from a supplier just because they have labeled it "low end", "customer" grade and so on.
Is this solely aimed at replacing drives in existing enclosures? Not much good there as most of them will be SAS6. They'll still work, but not nearly as fast as SAS12 nor NVMe.
Similarly, SAS supports multipath for HA while PCI-E does not appear to offer that ability. If it does, then I have yet to hear of an example of hardware implementing it.
Lastly, the idea is likely to appeal to those that care about density. For that purpose, it is better to lower costs and requiring a multitude of PCI express slots is not a great way of doing that. Motherboards with more PCI-E slots are expensive due to the layers needed to accommodate the circuits required by them. Assuming JBODs for such things exist (using a chip to share lanes among multiple devices), they would be expensive and also add to cost.
There is also a decent write up on this here:
NVMe scales just fine. As for the livelihood of those selling these storage appliances, well check out NetApp stock price lately.
Also,I wouldn't want to run thousands of these SAS drives in today's storage appliances. Even with high performance, the control plane and rebuild/rebalance stuff would be a nightmare. Trying to imagine a 50 node Isilon cluster with these things... not pretty.
As for being different than virtio because it needs PCI-E lanes, what keeps QEMU from emulating however many are needed?
The easiest problems are the ones that go away if you throw money at them and NVMe drastically expands the set of such problems. Most small-mid-sized companies have DBs in the 10Gb-1TB range. If you have a single table that's 100GB in size, you can parse through every single row in just under a minute! This means you can actually use an easy to implement O(n) algorithm instead of trying to make O(1) or O(log n) fit your problem. NVMe SSDs are not that advantageous for companies that are built to scale horizontally on AWS. They are amazing when you have a monolithic DB that you can't partition/shard/cluster easily.
You're forgetting seek latency. It's orders of magnitude better with SSD, but it's still not necessarily zero. Depending on how the data is laid out and queried you can pay the seek cost per row, which multiplied by the number of rows (100GB+) isn't trivial.
Nearly all of the tick (>4GB/day) databases I've used aren't laid row oriented.
Even in the absence of variable-width fields, the presence of nullable fields causes the majority of database tables to have variable-width rows. In any case, neither of these are reasons why common databases do or do not lay rows out sequentially on disk (some do, some don't).
Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.
Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.
Database servers additionally are very likely to have their own larger-than-a-byte-sized buffers in order to avoid system call latency, so the requests they make are never going to be quite so small.
The logic being that in the days of spinning media, evicting 124kb of cold page cache in favour of avoiding a seek a few microseconds later was definitely worth it (a seek being a ~14ms stall on rotating disks)
This is why I said it was high level, but hopefully illustrated the point. In addition to the disk page size, you also have all the various metadata associated with the file(s). So, reading a byte from a page can imply reading even more data than the block size (4KiB current).
> Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.
AFAIK, Linux only reads ahead if it detects a sequential pattern, or if you specify POSIX_FADV_SEQUENTIAL (double normal). But, as far as the query is concerned, all of the data read that isn't necessary is effectively subtracted from the overall throughput.
I was trying to illustrate the importance of seek latency (~80us vs. ~9-14ms), but yes there are a myriad of other concerns when you're trying to maximize disk throughput.
Have you tried to buy one in the past year? If you're not in the Fortune 100, fat effing chance! Here's how it goes, Samsung announces NVMe product, companies beat down their door screaming "take my money!" and Samsung conveniently "cancels" the product for sale. They made it. They just sold their entire production run to a handful of customers. Maybe, just maybe you can get a couple hundred units if you're willing to wait 3-4 months and someone returns some or changes an order after delivery (and you get the returns).
Even second tier (non Intel, non Samsung) suppliers are sold out. About the only thing you can buy now is HGST because no one wanted their stuff in the first place, and they jacked up their prices in response to other vendors' product shortages.
Yes, NVMe is on fire right now. Everyone wants it. I wouldn't put new tech like this into a 3-4 year old system because of a sunk cost fallacy. NVMe is not also exactly "new". They're already on version 1.2 (or 1.3) of the spec. Intel has gone through 2 major NVMe product revisions (with the third out in 3 months).
Also, Samsung isn't exactly breaking 3D NAND ground here. Novachips did it last year, and in a NVMe interface, too.
But, you know, I'm not a fortune 100 company, and this seems to have a reasonable shipping time:
SFF-8639 (or U.2 as the branding is) is the way forward.
..it would be more like Redundant Array of Prohibitively Expensive Drives.
(I'll show myself out)
I happen to have a 950 Pro, and normal usage keeps the temperature at reasonable levels (though I stuck heatsinks on mine), but this is much more dense (512 vs 2 on the 950 Pro).
I get about 2GByte/second sustained. It's fun :)
To give a specific example, let's say you have a 10-node Ceph cluster with 10 of these SSDs in each node filled up to 75% capacity. During a rebuild you throttle down to 50 MBps per SSD. After a single drive failure, the remaining 99 SSDs will work together to redistribute the data in about 40 minutes:
$ python3 -c "print(round((15.36 * 1024 * 1024 *.75) / (50.0 * 99) / 60.0, 2))"
The 50 MBps per drive may seem like a low number, but that actually means each node has to move data at roughly 5 Gbps over the cluster network.
* Edited to fix the python line where the multiplication characters converted to italics formatting.
~4 hours to read/write a whole drive ain't too bad. Of course there might be other bottlenecks in your raid, or you only reserve so much speed for rebuild, but the drives size is not the problem.
I would've expected Samsung to be able to best that. They make the chips, after all.
Has the thinking changed?
This is a very good thought and you should be thinking it.
However, it is really only applicable to raid mirrors. Raid stripes will not have identical wear lifespans.
All of our mirrored boot devices are either:
- current intel enterprise SSD paired with the previous generation intel enterprise SSD
- current intel enterprise SSD paired with current samsung enterprise SSD
That way if there is a usage related failure (or some weird firmware bug triggered by use pattern) they don't seize up identically.
Well there are different RAID settings. My product is an indexing engine (indexes Git repositories) and I've personally found RAID 0 reduces indexing time by about 30% compared to a single drive. I have a machine that has 4 SSDs running RAID 0 and I've found the performance gain after 3 SSDs is negligible.
It seems like 3 SSDs running RAID 0 is the best combination given my very limited sampling size.
You mean due to exhausting their endurance? This is something you monitor, you should have plenty of time to replace the drives before it becomes a concern.
For other failures, how's it going to be any different from normal HD's? There's always the risk that having the same models/batches in the same conditions might lead to a cluster of failures, but RAID's still likely to save you from plenty of other failure modes.
For instance, the recent Google study suggests that uncorrectable errors with SSDs are actually more common than HD's: "More than 20% of flash drives develop uncorrectable errors in a four year period... significantly higher rates .. than hard disk drives".
I've certainly seen enough IO errors from SSDs to know I want to defend against them. Not to mention the silent data corruption I've seen (and been protected from, yay ZFS).
He's talking about something else. Physical drives fail due to physical reasons, but SSDs fail for logical reasons.
A weird firmware bug or an unexpected usage pattern or wear pattern (not just burning the entire things lifetime up) can cause the SSD to die.
If you have a mirror, and you have two identical SSDs, and you produce such a condition ... then no more mirror. They both die identically.
See my response to the parent about how you can easily address this by mixing SSDs in mirrors.
HDD's also have a long and distinguished history of nasty firmware bugs, and this is only going to get worse as things like drive-managed SMR and hybrid flash caches get more common and their internal complexity ramps up.
Both also fail due to electrical reasons. Chips fail due to manufacturing defects, solder degrades, capacitors dry out, etc.
And I'll reiterate the uncorrected bit error rate. Transient errors are more common with SSDs, and are unlikely to happen identically with multiple units.
> If you have a mirror, and you have two identical SSDs, and you produce such a condition ... then no more mirror. They both die identically
I have a pair of mirrored SanDisk Extreme Pro's in my main server. Both suffered from a firmware bug that caused data corruption. ZFS was able to repair all damage, because they didn't fail identically.
Also thanks to the mirror I was able to upgrade the firmware without taking the server down.
Mixing different SSDs might be a good idea, but you can make much the same arguments for doing the same with HDD's, and like with HDD's it's still better than nothing to have redundancy with "identical" drives.
But I read that like 3-4y ago. We should have more experience with SSDs in production now so I was wondering if that thinking still applied.
Not really, you'd expect them to diverge over time as the cells aren't going to be all 100% identical - they'll have different failures, different error rates, and different read patterns will result in different read-disturb errors, all of which will effect even completely deterministic block allocation.
Regardless, as I said, you monitor endurance. Drives are unlikely to just silently wear out out of the blue unless you've been completely ignoring their SMART readings, in which case, more fool you.
You know where this is going, right?
I was a SysAdmin at the time and one of the things I was responsible for was Oracle Financials. It was at the core of everything that mattered at that company -- and I was absolutely convinced that short of a water balloon fight in the machine room, my raid configuration was 100% reliable.
The first disk died late on Friday. I had two on-hand so I casually replaced it -- and before I got back to my desk another had died.
I placed an order for more drives and went back to the machine room and replaced it.
By Monday morning, things were starting to get serious -- I had to drop back to raid 5 because a few more had died over the weekend and my replacements wouldn't arrive till Tuesday.
You see, all the drives in all our raid enclosures had come from the same batch -- and they all -- every single one of them -- died within 90 days of the first one.
The chances may have been low, but reality has sharp teeth and loves the taste of overconfident sysadmin ass.
That happens very very very often. It has nothing to do with a bad batch.
The second disk actually failed a while ago, but no one noticed because no one read from that part of it.
When you did the rebuild you read from the failed area and woke up the failure.
When you setup raid you MUST read the entire raid at least monthly, so that any errors are detected early! This is absolutely critical. Without that you have not installed the raid correctly. mdadm on debian does that by default. Linux has a builtin way to do that, but you must have a tool that will alert you on failure, or it's worthless.
You should also run a week full disk read of each hard disk in the array using its onboard long self test feature. You can use smartd to schedule that automatically. And more importantly: Notify you on failure.
Not using both tools is not setting up the raid correctly.
Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes. Consequently, there is no correct way to setup RAID in a way that makes dats safe. A check summing filesystem such as ZFS would handle this without a problem though.
He made no such claim.
> while you seem to claim that it means an error caught by low level formatting.
A: There is no such thing as low level formatting in a modern drive, and B: No I don't. I said he should do a full disk read. Not format.
The SMART built in self-test does a full read of the drive, not write.
> Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes.
That's only true with RAID 5. Ever other RAID level can compare disks and check that the data matches exactly. The Linux md software RAID does that automatically if you ask it to check things, and then it will report how many mismatches it found.
If you look he wrote: "I had to drop back to raid 5". He had a better level of RAID before, with multiple disk redundancy, that allows the RAID to check for mismatches and even correct them.
But because he never scheduled full disk reads the RAID never detected that many of the drives had problems.
> Consequently, there is no correct way to setup RAID in a way that makes dats safe.
That is not correct. The only advice I would give is avoid RAID 5. The other levels let you check for correctness.
> A check summing filesystem such as ZFS would handle this without a problem though.
Only if A: you actually run disk checks, B: and only if ZFS handles the RAID!!! ZFS on top of RAID will NOT detect such errors 50% of the time (randomly depending on which disk is read from).
Doing a full read causes every sector's ECC in the low level formatting to be checked. If something is wrong, you get a read error that can be corrected by RAID, ZFS or whatever else you are running on top of it provides redundancy. Without the ECC, the self test mechanism would be pointless as it would have no way to tell if the magnetic signals being interpreted are right or wrong.
As for other RAID levels catching things. With RAID 1 and only two mirrors, there is no way to tell which is right either. The same goes for RAID 10 with two mirrors and RAID 0+1 with two mirrors. You might be able to tell with RAID 6, but such things are assumed by users rather than guaranteed. RAID was designed around the idea that uncorrectable bit errors and drive failures are the only failure states. It is incapable of handling silent corruption in general and in the few cases where it might be able to handle it, whether it does is implementation dependent. RAID 6 also degrades to RAID 5 when a disk fails and there is no way for a patrol scrub to catch a problem that occurs after it and before the next patrol scrub. RAID will happily return incorrect data, especially since only 1 mirror member is read at a given time (for performance) and only the data blocks in RAID 5/6 are read (again for performance) unless there is a disk failure.
There is no reason to use RAID with ZFS. However, ZFS will always detect silent corruption in what it reads even if it is on top of RAID. It just is not guarenteed to be able to correct it. Maybe you got the idea of "ZFS on top of RAID will NOT detect such errors 50% of the time" from thinking of a two-disk mirror. If you are using ZFS on RAID instead of letting ZFS have the disks and that happens, you really only have yourself to blame.
1. install mdadm and configure it to run /usr/share/mdadm/checkarray every month. (The default on debian.)
2. have it run as a daemon constantly monitoring for problems. (Also the default on debian.)
3. test that it actually works by setting one of your raid devices faulty and making sure you get an instant email. A tool that detects a problem and can't tell you is quite useless.
4. install smartmontools and configure /etc/smartd.conf to run nightly short self tests, and weekly long self tests. Something like: /dev/sda -a -o on -S on -m firstname.lastname@example.org -s (S/../.././02|L/../../6/03)
5. do a test of smartmontools by adding -M test to the line above to make sure it is able to contact you
This way you will find out about problems with the disk before they grow large.
There are other settings for smartmontools to monitor all the SMART attributes and you can tell it when to contact you.
6. Extra credit: Install munin or other system graphing tools and graph all the SMART attributes. Check it quarterly and look for anomalies. Everything should be flat except for Power_On_Hours.
(Bookmarking this entire thread for much RAID wisdom.)
Drives have on-condition monitoring via SMART so you can predict age related failure.
Scheduled replacement is almost the worst maintenance policy.
TRIM is well supported by modern host-based controllers designed for SSDs.
I've got pretty much the same setup (albeit with several more spinning disks) in a bunch of servers and have yet to have any problems (* crosses fingers *).
A first gen i5 2.4GHz can generate XOR for RAID5 at 4GB/s. Newer generations at faster core speeds should be able to beat that significantly.
I went and tested this on a Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz
[1053854.445872] avx : 17860.000 MB/sec
[1053854.475878] raid6: sse2x1 gen() 6554 MB/s
[1053854.492857] raid6: sse2x2 gen() 9070 MB/s
[1053854.509853] raid6: sse2x4 gen() 11273 MB/s
[1053854.526848] raid6: avx2x1 gen() 21554 MB/s
[1053854.543844] raid6: avx2x2 gen() 25152 MB/s
[1053854.560838] raid6: avx2x4 gen() 29167 MB/s
[1053854.560839] raid6: using algorithm avx2x4 gen() (29167 MB/s)
[1053854.560840] raid6: using avx2x2 recovery algorithm
How many millions of pixels are on my two LCD screens, and they are each individually controllable and none have failed in 8 years?
I disagree however that nobody alive today will live "forever". The technology that might allow us to fix aging is on an exponential curve, look for example at the cost of genome sequencing. So the time is ripe for some breakthrough results in aging research that actually address some causes of aging.
The second thing I got was this: http://www2.technologyreview.com/sens/docs/estepetal.pdf
The tl;dr summary is:
However, given the recent successes and highly emotional nature of life extension research, Aubrey de Grey is not the first, nor will he be the last, to promote a hopelessly insufficient but ably camouflaged pipe-dream to the hopeful many. With this in mind, we hope our list provides a general line of demarcation between increasingly sophisticated life extension pretense, and real science and engineering, so that we can focus honestly on the significant challenges before us.
The improvements have mostly come from violence reduction and public health and dealing with disease though, I agree that those things aren't really part of a life extension technology trajectory.
If you solve all of these things, then lifespan is asymptotically something like 80 years. That is plain as day. We might come up with ways to increase lifespan, but we haven't yet, and none of the data Kurzweil shows are a result of increases in lifespan for people who die of old age.
"Longevity" is the term in the social sciences for average age at death for only people that die of old age. Kurzweil used that word, he said "In the eighteenth century, we added a few days every year to human longevity; during the nineteenth century we added a couple of weeks each year; and now we’re adding almost a half a year every year."
Life expectancy increased. Longevity did not. It's somewhat common to confuse them, so he could be forgiven for a mistake, except that it wasn't a mistake, he knows the difference, and his graph is evidence. That he included numbers from the 1920s and then skipped to 2000, is impossible to do accidentally, and if you include 1940-1990, you'd see lifespan flatlines. But then he'd have to admit neither life expectancy nor longevity are on exponential growth curves. They're not linear or even sublinear either. Longevity has always been flat, and lifespan has historical dips and bumps.