Hacker News new | past | comments | ask | show | jobs | submit login
Samsung ships the world's highest capacity SSD, with 15TB of storage (computerworld.com)
266 points by elorant on Mar 3, 2016 | hide | past | web | favorite | 140 comments



The 15.36TB PM1633a drive supports one full drive write per day, which means 15.36TB of data can be written every day on a single drive without failure over its five-year warranty.

In other words, assuming good wear leveling, each bit can only be rewritten ~2K times, which is actually quite low for endurance... the capacity is high and they're using that to hide the fact that it's low, but if this was e.g. SLC flash with 100K endurance, you'd be able to write 50x more.

No mention of retention either, which makes me think SSDs today are more for temporary nonvolatile not-quite-RAM storage and not for more permanent applications.


SLC and MLC endurance aren't much different: http://hothardware.com/news/google-data-center-ssd-research-...

To reach a full rewrite a day, that's like 200Mb/s consistently over the span of a 24 hour day. If your loads/usage looks like this, you're probably expecting your drives to fail and are ready to replace them.


That link says they're not more reliable, which is not at all the same thing.


Given how SSDs work, this is actually pretty good reliability. No one is going to be writing the full disk of 15 TB of data every day.


This disk looks just right for fatcache usage.


They're not guaranteeing failure in 5 years, but are pretty sure it won't fail. If "pretty sure" is 98%, then that's 100K (or something like it, probability maths is hard).


Thet aren't guaranteeing anything. They are saying they will replace it if it fails inside 5 years.


That is a guarantee. That's what "guarantee" means!


But in five years the cost of fulfilling it is low given falling prices.


and that's how companies can afford to offer guarantees.


Outside tech/Moore's Law type companies they offer guarantees by making the products actually durable. A very different model.


No, they offer a warranty. OP used the word guarantee.


A warranty is a guarantee. I used the word "guarantee" as well. It's the right word to use.

Seriously, what do you think a guarantee is?


We guarantee these O-Rings will not rupture.

If these O-Rings rupture, we will replace them under warranty.


> We guarantee these O-Rings will not rupture.

Ah, so you're thinking of a guarantee like a politician would use the word. "I personally guarantee that..."

Every time I hear them say that on TV I'm thinking, "Really? So, what are you going to give me if it doesn't work out? What's that? Nothing?! Then it's not a guarantee is it! You might as well have said you pinky-promise, it would be worth as damn much!"

And then my wife tells me to calm down and stop ranting at the TV.

> If these O-Rings rupture, we will replace them under warranty.

See, that's a guarantee.


On Amazon UK, the highest capacity Samsung SSD I can find is 2TB for 620GBP. So I am wondering when this 15TB we actually hit stores and what's the sweet spot price wise between 1 & 15TB.


Samsung is expected to release a 4TB consumer SSD this summer. Might be a better bargain that this enterprise SSD.

http://www.anandtech.com/show/9652/samsung-details-3rd-gen-v...


It's a SAS device which isn't meant for retail sales and probably only available to select enterprise customers and OEMs


It's awe inspiring to see fifteen terabytes of solid state storage in that little box.


It's still less dense than 200GB MicroSD cards.

(which are only $79.50 on Amazon)


Does that density include the connectors and buses that would be required to access those cards?


Then you have to include the connectors and buses to access the SSD as well.

And it will still be in favor of microsd, which can be read by a tiny connector + tiny controller in a smartphone.


I think parent was asking if you can fit 75 microSD cards (75 * 200 GB = 15 TB) and the required connections in the form factor of this SSD.


If you're just counting the flash chips on circuit boards, I would think that you could fit many more SD innards and wire them together with some type of controller in this space. I don't doubt however that given the use-case, it has some room allocated for chips going bad, heat dissipation, etc inside. If they could have made a 20TB drive instead for the enterprise market in this form factor, they would have.


Yep, that's exactly what I was asking. Remember also that both solutions have to use the SAS connector to interface with the main board. It's a little disingenuous to say that Samsung could have glued a bunch of SD cards together because it's not really comparing apples to apples.


What's more amazing is that it's still a long way to go compared to biological storage: https://news.ycombinator.com/item?id=4396931


But biological storage also doesn't have the random access characteristics and read/write speeds of this storage.


Does it need them? My memory seems to hold a lot of things, and the ability to look them up instantly (not counting the decompression delay AKA tip-of-the-tongue syndrome where you know you know something but can't recall it until it suddenly comes to you.)


If it comes to you. Sometimes biomemory just fails. In fact, it fails a whole lot, but the layer that does the access often just makes up something plausible to cover that fact. :)


It also writes every time it reads.


Your memory is not accurate as you think it is. Your processor fills out the missing pieces so you think you remember it accurately.


Samsung can try to sell me 100TB SSD, and Ill decline thanks.

Their EVO drives suffer from read degradation slowly over long periods of time. See EVO 840, after about a year it was down to 20MiB/s of continuous read speed. An SSD they call that thing. They refused warranty and released a "Speed up flash warez tool now.exe" as help, which all it did was move the data around on the SSD in the background so as to keep up the charade that their SSDs dont suffer from manufacturing/design error.


This is a completely different category of product than their low end consumer Evo drives, though.


Its the same company.

They marketed it as warranty so and so, speed such and such, and in fact its a piece of shit, and Samsung does not stand by its own product.

Even their newest Samsung EVO 850 is marketed as write speed ~500mb/s, but that is only true for the first 3gb, due to their usage of "TurboWriteCache", which means it has some kind of DRAM or faster real flash disk, and the rest of the disk is below 100mb/s in write speed.

Perhaps, they should not offer warranty on their "low-end consumer" drives, or lie in marketing/product specifications?

So that we as low-end consumers can make informed decisions? I bought SanDisk Extreme Pro instead, they are very fine disks, and did not cost much more than the evo. Recommend SanDisk strongly.


It's only good for 3gb bursts. Which is a ton of data to write to a consumer drive at one time. It's also 6gb/12gb for 500gb/1tb drives.

It's a perfectly fine tradeoff for a consumer drive. Just because it doesn't fit your use case doesn't make it junk.


What you define as ton of data for a consumer drive is your opinion only, there are many customers who do video editing work and really want/need the disks they buy to fulfill their stated technical specifications of write speed.

Samsung witheld/lied about the write speed, and only mentioned TurboWriteCache in hard-to-find support forums, because they know it matters for a meaningfully large segment of the market.

Its easy to state "about 250mb/s write speed, up to 500mb/s burst write speed" in marketing materials, if it didnt impact the sales as you state for consumers, why wouldnt they?


So you wouldn't use a high end Cisco router because they make the crappy "cisco small business" line?


No I would not. Whats to weird about that?

Juniper is nice, ExtremeNetworks, Pluribus etc, its not like I have to eat shit from a supplier just because they have labeled it "low end", "customer" grade and so on.


I don't know about routers but I've had very good experiences with Cisco's SMB switches like the SG500.


Their Pro series drives have not been affected by any of the EVO issues to my knowledge.


The Evo Pro drive was the last one standing in a number of SSD burnout tests: http://techreport.com/review/27909/the-ssd-endurance-experim...


What likely happened is that memory NAND degrade it contents with time and a correction algorithm takes place but it can take multiple iterations of reads and this slows the whole thing down. What the controller should be doing is periodically refreshing the data to prevent this from getting overwhelming.


Why SAS and not NVMe? You can buy all NVMe enclosures/servers now.

Is this solely aimed at replacing drives in existing enclosures? Not much good there as most of them will be SAS6. They'll still work, but not nearly as fast as SAS12 nor NVMe.


NVMe likely cannot scale to thousands of attached drives simultaneously due to limitations of PCI-E while SAS can. Scaling to thousands of attached disks was the motivation for QEMU's virtio-scsi because virtio-blk's 1 block device per PCI-E device meant that virtual machines using virtio-bulk could not scale past 32 drives. nVMe's 1 drive per PCI l-E device model should suffer from the same limitation.

Similarly, SAS supports multipath for HA while PCI-E does not appear to offer that ability. If it does, then I have yet to hear of an example of hardware implementing it.

Lastly, the idea is likely to appeal to those that care about density. For that purpose, it is better to lower costs and requiring a multitude of PCI express slots is not a great way of doing that. Motherboards with more PCI-E slots are expensive due to the layers needed to accommodate the circuits required by them. Assuming JBODs for such things exist (using a chip to share lanes among multiple devices), they would be expensive and also add to cost.

There is also a decent write up on this here:

http://www.datacenterjournal.com/nvme-connected-ssds/


NVMe doesn't require a PCIe slot. It just requires some lanes. You can use the 2.5" form factor with the same performance. It works really well.

NVMe scales just fine. As for the livelihood of those selling these storage appliances, well check out NetApp stock price lately.

Also,I wouldn't want to run thousands of these SAS drives in today's storage appliances. Even with high performance, the control plane and rebuild/rebalance stuff would be a nightmare. Trying to imagine a 50 node Isilon cluster with these things... not pretty.


People are building ZFS systems with thousands of drives.

As for being different than virtio because it needs PCI-E lanes, what keeps QEMU from emulating however many are needed?


You're asking for 2 major tech steppings in a single product. We don't even have many mass market NVMe drives yet, and that aside, they're undoubtedly pushing the limits of their existing controller tech just to handle the capacity. Has NVMe seen much enterprise adoption yet? No surprise they chose an interface to match their intended customers' existing product lines


We put a couple of Intel 750s in our primary DB server and so many of our issues just went away instantly. Reading at 2GB/s is amazing. On a 10gb network, our network backups now happen at 1+GB/s. Of course we optimize our DB queries as much as possible but sometimes, you just hit a brick wall and can't speed things up because of how the data is structured. Instead of spending 100 developer/dbadmin hours on reorganizing our tables for some query that runs once a week, we put $2500 of drives and solved more problems than I imagined.

The easiest problems are the ones that go away if you throw money at them and NVMe drastically expands the set of such problems. Most small-mid-sized companies have DBs in the 10Gb-1TB range. If you have a single table that's 100GB in size, you can parse through every single row in just under a minute! This means you can actually use an easy to implement O(n) algorithm instead of trying to make O(1) or O(log n) fit your problem. NVMe SSDs are not that advantageous for companies that are built to scale horizontally on AWS. They are amazing when you have a monolithic DB that you can't partition/shard/cluster easily.


Just wanted to say that this kind of real-world experience is why I read Hacker News.


Me as well. Discovering HN was such a refreshing respite from the thinly-veiled political soapbox that is Reddit.


> If you have a single table that's 100GB in size, you can parse through every single row in just under a minute!

You're forgetting seek latency. It's orders of magnitude better with SSD, but it's still not necessarily zero. Depending on how the data is laid out and queried you can pay the seek cost per row, which multiplied by the number of rows (100GB+) isn't trivial.


It's a pretty bad database that can't queue up enough reads to keep the drive constantly active during a full-table scan.


It's a throughput problem, not an activity problem. Normally, since the rows are a fixed size the dbms will lay them out sequentially on disk. So, when the database reads from disk unless you read every column in the row, the dbms has to skip over data. This is kind of a high level picture, but hopefully it illustrates the point.

Nearly all of the tick (>4GB/day) databases I've used aren't laid row oriented.


This is so far off the mark..

Even in the absence of variable-width fields, the presence of nullable fields causes the majority of database tables to have variable-width rows. In any case, neither of these are reasons why common databases do or do not lay rows out sequentially on disk (some do, some don't).

Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

Database servers additionally are very likely to have their own larger-than-a-byte-sized buffers in order to avoid system call latency, so the requests they make are never going to be quite so small.

The logic being that in the days of spinning media, evicting 124kb of cold page cache in favour of avoiding a seek a few microseconds later was definitely worth it (a seek being a ~14ms stall on rotating disks)


> Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

This is why I said it was high level, but hopefully illustrated the point. In addition to the disk page size, you also have all the various metadata associated with the file(s). So, reading a byte from a page can imply reading even more data than the block size (4KiB current).

> Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

AFAIK, Linux only reads ahead if it detects a sequential pattern, or if you specify POSIX_FADV_SEQUENTIAL (double normal). But, as far as the query is concerned, all of the data read that isn't necessary is effectively subtracted from the overall throughput.

I was trying to illustrate the importance of seek latency (~80us vs. ~9-14ms), but yes there are a myriad of other concerns when you're trying to maximize disk throughput.


Not trying to nitpick, but don't most people running MySQL with innodb set O_DIRECT?


It doesn't have to skip over data if that would slow it down. I would expect your typical database to have some kind of index or bitmap that can tell it what to grab fast enough to saturate the disk while avoiding unused data, but if it has to fall back to vacuuming up 1GB at a time then so be it.


If you want to go even faster simply add as much RAM to your machine as your tables will consume when they're all in cache. That too is one of those tricks that makes problems just 'go away', it does still require a periodic flush but that can happen in a totally transparent fashion.


A blog post with some details and graphs would be appreciated by the HN readership.


> Has NVMe seen much enterprise adoption yet?

Have you tried to buy one in the past year? If you're not in the Fortune 100, fat effing chance! Here's how it goes, Samsung announces NVMe product, companies beat down their door screaming "take my money!" and Samsung conveniently "cancels" the product for sale. They made it. They just sold their entire production run to a handful of customers. Maybe, just maybe you can get a couple hundred units if you're willing to wait 3-4 months and someone returns some or changes an order after delivery (and you get the returns).

Even second tier (non Intel, non Samsung) suppliers are sold out. About the only thing you can buy now is HGST because no one wanted their stuff in the first place, and they jacked up their prices in response to other vendors' product shortages.

Yes, NVMe is on fire right now. Everyone wants it. I wouldn't put new tech like this into a 3-4 year old system because of a sunk cost fallacy. NVMe is not also exactly "new". They're already on version 1.2 (or 1.3) of the spec. Intel has gone through 2 major NVMe product revisions (with the third out in 3 months).

Also, Samsung isn't exactly breaking 3D NAND ground here. Novachips did it last year, and in a NVMe interface, too.


Prices seem to be pretty volatile, indicating some difficulty filling demand.

But, you know, I'm not a fortune 100 company, and this seems to have a reasonable shipping time:

http://www.newegg.com/Product/Product.aspx?Item=N82E16820167...


That's a consumer model - and I'd guess the 2.5in versions are selling faster than the PCIe add-on cards.


Yep, exactly this. You can generally find the add-on cards available, but most don't like them for replacement and quantity in a box reasons. They also don't fit in some of today's high density, multi node server chassis.

SFF-8639 (or U.2 as the branding is) is the way forward.


Impressive. First question that pops into mind is how long the rebuild time would take if one of these failed in a RAID. I can imagine it'll take a while.


It takes 8 hours 20 minutes to write 15TB over 500MBps. Why would you want to stick one of these in a RAID though?

..it would be more like Redundant Array of Prohibitively Expensive Drives.

(I'll show myself out)


article says 1200MBps.. so more like 3.5 hours.


I wonder how that 1200MBps will hold up with the thermal throttling with their other V-NAND products (SM951/950 Pro), especially when doing a raid rebuild?

I happen to have a 950 Pro, and normal usage keeps the temperature at reasonable levels (though I stuck heatsinks on mine), but this is much more dense (512 vs 2 on the 950 Pro).


FusionIO drives (PCIe attached flash storage) have serious heat sinks. The heat sinks are not for show.

I get about 2GByte/second sustained. It's fun :)


i remember the times when RAM of this speed was something to brag about... guess i'm getting old.


RAPD is a catchy acronym.


You mean RAPED?


If you use a fully distributed storage system, like Ceph (or 3Par?), then rebuild times during a drive failure would be significantly lower. Also, any self-respecting storage system (RAID or not) will lower rebuild priority over normal IO.

To give a specific example, let's say you have a 10-node Ceph cluster with 10 of these SSDs in each node filled up to 75% capacity. During a rebuild you throttle down to 50 MBps per SSD. After a single drive failure, the remaining 99 SSDs will work together to redistribute the data in about 40 minutes:

  $ python3 -c "print(round((15.36 * 1024 * 1024 *.75) / (50.0 * 99) / 60.0, 2))"
40.67

The 50 MBps per drive may seem like a low number, but that actually means each node has to move data at roughly 5 Gbps over the cluster network.

* Edited to fix the python line where the multiplication characters converted to italics formatting.


Isn't 50MBps closer to 0.5Gbps than 5Gbps? Where does the other factor of ten come from?


> 15.36TB > sequential read and write speeds of up to 1200MBps

~4 hours to read/write a whole drive ain't too bad. Of course there might be other bottlenecks in your raid, or you only reserve so much speed for rebuild, but the drives size is not the problem.


I'm kind of shocked that this drive is so slow. That's an impressive benchmark compared to your run of the mill consumer drives, but the new PCI-e drives Apple has been shipping are already beyond 2000Mbps.

http://blog.macsales.com/30725-owc-tests-speed-of-ssd-in-201...

I would've expected Samsung to be able to best that. They make the chips, after all.


I thought RAID and SSD were a terrible combination. As if both drives do exactly the same writes at the same time, chances are they will fail at the same time. Plus most RAID controllers don't support TRIM.

Has the thinking changed?


"I thought RAID and SSD were a terrible combination. As if both drives do exactly the same writes at the same time, chances are they will fail at the same time."

This is a very good thought and you should be thinking it.

However, it is really only applicable to raid mirrors. Raid stripes will not have identical wear lifespans.

All of our mirrored boot devices are either:

- current intel enterprise SSD paired with the previous generation intel enterprise SSD

or:

- current intel enterprise SSD paired with current samsung enterprise SSD

That way if there is a usage related failure (or some weird firmware bug triggered by use pattern) they don't seize up identically.


> I thought RAID and SSD were a terrible combination

Well there are different RAID settings. My product is an indexing engine (indexes Git repositories) and I've personally found RAID 0 reduces indexing time by about 30% compared to a single drive. I have a machine that has 4 SSDs running RAID 0 and I've found the performance gain after 3 SSDs is negligible.

It seems like 3 SSDs running RAID 0 is the best combination given my very limited sampling size.


An SSD is implemented a little like a RAID array internally. Higher capacity SSDs have more flash chips and tend to be faster. So one would expect that 2 x 128GB drives in RAID 0 to perform similarly to a single 256GB drive.


Wouldn't being able to offload to multiple SSD controllers potentially help too if that's the bottleneck?


It would - you'd have the aggregate bandwidth of the multiple controllers to use. Similarly, RAID-1 is better for random reads than RAID-0.


> As if both drives do exactly the same writes at the same time, chances are they will fail at the same time.

You mean due to exhausting their endurance? This is something you monitor, you should have plenty of time to replace the drives before it becomes a concern.

For other failures, how's it going to be any different from normal HD's? There's always the risk that having the same models/batches in the same conditions might lead to a cluster of failures, but RAID's still likely to save you from plenty of other failure modes.

For instance, the recent Google study[1] suggests that uncorrectable errors with SSDs are actually more common than HD's: "More than 20% of flash drives develop uncorrectable errors in a four year period... significantly higher rates .. than hard disk drives".

I've certainly seen enough IO errors from SSDs to know I want to defend against them. Not to mention the silent data corruption I've seen (and been protected from, yay ZFS).

1: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...


"You mean due to exhausting their endurance? This is something you monitor, you should have plenty of time to replace the drives before it becomes a concern. For other failures, how's it going to be any different from normal HD's? There's always the risk that having the same models/batches in the same conditions might lead to a cluster of failures, but RAID's still likely to save you from plenty of other failure modes."

He's talking about something else. Physical drives fail due to physical reasons, but SSDs fail for logical reasons.

A weird firmware bug or an unexpected usage pattern or wear pattern (not just burning the entire things lifetime up) can cause the SSD to die.

If you have a mirror, and you have two identical SSDs, and you produce such a condition ... then no more mirror. They both die identically.

See my response to the parent about how you can easily address this by mixing SSDs in mirrors.


> Physical drives fail due to physical reasons, but SSDs fail for logical reasons.

HDD's also have a long and distinguished history of nasty firmware bugs, and this is only going to get worse as things like drive-managed SMR and hybrid flash caches get more common and their internal complexity ramps up.

Both also fail due to electrical reasons. Chips fail due to manufacturing defects, solder degrades, capacitors dry out, etc.

And I'll reiterate the uncorrected bit error rate. Transient errors are more common with SSDs, and are unlikely to happen identically with multiple units.

> If you have a mirror, and you have two identical SSDs, and you produce such a condition ... then no more mirror. They both die identically

I have a pair of mirrored SanDisk Extreme Pro's in my main server. Both suffered from a firmware bug that caused data corruption. ZFS was able to repair all damage, because they didn't fail identically.

Also thanks to the mirror I was able to upgrade the firmware without taking the server down.

Mixing different SSDs might be a good idea, but you can make much the same arguments for doing the same with HDD's, and like with HDD's it's still better than nothing to have redundancy with "identical" drives.


I am actually referring to endurance. Irrespective of the RAID mode, I would expect each drive to be written exactly the same amount of data at the same time. If all models are identical and have been purchased at the same time, one would expect that the same firmware will allocate the same writes to the same cells (in RAID 1 it will be the same data, in RAID 5 the data will be slightly different, but the size will be the same). So in theory the wear should be identical. Obviously this only applies to a RAID mode where you have parity.

But I read that like 3-4y ago. We should have more experience with SSDs in production now so I was wondering if that thinking still applied.


> If all models are identical and have been purchased at the same time, one would expect that the same firmware will allocate the same writes to the same cells

Not really, you'd expect them to diverge over time as the cells aren't going to be all 100% identical - they'll have different failures, different error rates, and different read patterns will result in different read-disturb errors, all of which will effect even completely deterministic block allocation.

Regardless, as I said, you monitor endurance. Drives are unlikely to just silently wear out out of the blue unless you've been completely ignoring their SMART readings, in which case, more fool you.


Yeah, the statistical chance of the same thing happening on both drives is the same, but the chance of it happening at the same time is pretty low.


That sounds perfectly reasonable, and 20 years ago I'd have said the same thing. It's even fairly likely that I did say it.

You know where this is going, right?

I was a SysAdmin at the time and one of the things I was responsible for was Oracle Financials. It was at the core of everything that mattered at that company -- and I was absolutely convinced that short of a water balloon fight in the machine room, my raid configuration was 100% reliable.

The first disk died late on Friday. I had two on-hand so I casually replaced it -- and before I got back to my desk another had died.

I placed an order for more drives and went back to the machine room and replaced it.

By Monday morning, things were starting to get serious -- I had to drop back to raid 5 because a few more had died over the weekend and my replacements wouldn't arrive till Tuesday.

You see, all the drives in all our raid enclosures had come from the same batch -- and they all -- every single one of them -- died within 90 days of the first one.

The chances may have been low, but reality has sharp teeth and loves the taste of overconfident sysadmin ass.


> The first disk died late on Friday. I had two on-hand so I casually replaced it -- and before I got back to my desk another had died.

That happens very very very often. It has nothing to do with a bad batch.

The second disk actually failed a while ago, but no one noticed because no one read from that part of it.

When you did the rebuild you read from the failed area and woke up the failure.

When you setup raid you MUST read the entire raid at least monthly, so that any errors are detected early! This is absolutely critical. Without that you have not installed the raid correctly. mdadm on debian does that by default. Linux has a builtin way to do that, but you must have a tool that will alert you on failure, or it's worthless.

You should also run a week full disk read of each hard disk in the array using its onboard long self test feature. You can use smartd to schedule that automatically. And more importantly: Notify you on failure.

Not using both tools is not setting up the raid correctly.


I think you have different definitions of failure. The previous person seems to claim it meant the disk was dead (i.e. no read works) while you seem to claim that it means an error caught by low level formatting.

Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes. Consequently, there is no correct way to setup RAID in a way that makes dats safe. A check summing filesystem such as ZFS would handle this without a problem though.


> The previous person seems to claim it meant the disk was dead (i.e. no read works)

He made no such claim.

> while you seem to claim that it means an error caught by low level formatting.

A: There is no such thing as low level formatting in a modern drive, and B: No I don't. I said he should do a full disk read. Not format.

The SMART built in self-test does a full read of the drive, not write.

> Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes.

That's only true with RAID 5. Ever other RAID level can compare disks and check that the data matches exactly. The Linux md software RAID does that automatically if you ask it to check things, and then it will report how many mismatches it found.

If you look he wrote: "I had to drop back to raid 5". He had a better level of RAID before, with multiple disk redundancy, that allows the RAID to check for mismatches and even correct them.

But because he never scheduled full disk reads the RAID never detected that many of the drives had problems.

> Consequently, there is no correct way to setup RAID in a way that makes dats safe.

That is not correct. The only advice I would give is avoid RAID 5. The other levels let you check for correctness.

> A check summing filesystem such as ZFS would handle this without a problem though.

Only if A: you actually run disk checks, B: and only if ZFS handles the RAID!!! ZFS on top of RAID will NOT detect such errors 50% of the time (randomly depending on which disk is read from).


Low level formatting exists on modern disks. You just are not able to reformat it. There is a diagram showing it here:

http://www.anandtech.com/show/2888

Doing a full read causes every sector's ECC in the low level formatting to be checked. If something is wrong, you get a read error that can be corrected by RAID, ZFS or whatever else you are running on top of it provides redundancy. Without the ECC, the self test mechanism would be pointless as it would have no way to tell if the magnetic signals being interpreted are right or wrong.

As for other RAID levels catching things. With RAID 1 and only two mirrors, there is no way to tell which is right either. The same goes for RAID 10 with two mirrors and RAID 0+1 with two mirrors. You might be able to tell with RAID 6, but such things are assumed by users rather than guaranteed. RAID was designed around the idea that uncorrectable bit errors and drive failures are the only failure states. It is incapable of handling silent corruption in general and in the few cases where it might be able to handle it, whether it does is implementation dependent. RAID 6 also degrades to RAID 5 when a disk fails and there is no way for a patrol scrub to catch a problem that occurs after it and before the next patrol scrub. RAID will happily return incorrect data, especially since only 1 mirror member is read at a given time (for performance) and only the data blocks in RAID 5/6 are read (again for performance) unless there is a disk failure.

There is no reason to use RAID with ZFS. However, ZFS will always detect silent corruption in what it reads even if it is on top of RAID. It just is not guarenteed to be able to correct it. Maybe you got the idea of "ZFS on top of RAID will NOT detect such errors 50% of the time" from thinking of a two-disk mirror. If you are using ZFS on RAID instead of letting ZFS have the disks and that happens, you really only have yourself to blame.


Is there a well known tool to check for this?


If you are using software raid on linux then:

1. install mdadm and configure it to run /usr/share/mdadm/checkarray every month. (The default on debian.)

2. have it run as a daemon constantly monitoring for problems. (Also the default on debian.)

3. test that it actually works by setting one of your raid devices faulty and making sure you get an instant email. A tool that detects a problem and can't tell you is quite useless.

4. install smartmontools and configure /etc/smartd.conf to run nightly short self tests, and weekly long self tests. Something like: /dev/sda -a -o on -S on -m email@domain.com -s (S/../.././02|L/../../6/03)

5. do a test of smartmontools by adding -M test to the line above to make sure it is able to contact you

This way you will find out about problems with the disk before they grow large.

There are other settings for smartmontools to monitor all the SMART attributes and you can tell it when to contact you.

6. Extra credit: Install munin or other system graphing tools and graph all the SMART attributes. Check it quarterly and look for anomalies. Everything should be flat except for Power_On_Hours.


For btrfs, you'll want to schedule a scrub on a regular basis. This will also detect read errors and try to fix them.


If you use 3ware/LSI/Avago (now all Broadcom) you install the Megaraid tools and it has a scheduler that does a patrol read whenever you schedule it to. ReadyNAS devices have a similar setting that will check the disks.


I wonder, for such mission critical data a good strategy would be to start rotating out old drives at set periods. Maybe replace one drive of your RAID5 every 6 months, not matter what the health. Once you've replaced all the drives, they will all be staggered in age by 6 months. Hopefully then the chances of multiple failures is greatly reduced.


Replacing a drive in a RAID5 would mean voluntarily placing yourself in a high risk position every 6 months. If you're going to do that, you better make sure you use a raid level that gives you at least two drive redundancy.


In the GGP post he mentions being "down to just RAID5", so presumably he was at RAID6.

(Bookmarking this entire thread for much RAID wisdom.)


That would make your failure rate higher (from increased infant mortaility) and your maintenance more expensive (more spares, more transaction costs, more work, replacing perfectly fine working parts too soon).

Drives have on-condition monitoring via SMART so you can predict age related failure.

Scheduled replacement is almost the worst maintenance policy.


For mission critical data, you should have backups and use at least triple mirrors or raidz2. Quadruple mirrors or raidz3 are even better for the extra paranoid.


That might have been the case with HDD, and the rationale of using RAID. But I thought that with SSD, if you have drives in RAID 1 or RAID 5, each drive makes the same amount of writes at the same time, and if they are the same model / age, the firmware will allocate the writes to the same cell. And you would end up with exactly the same wear on all cells. And therefore all drives failing simultaneously.


Software RAID or a host-based controller is the way to go with SSDs. Otherwise the raid controller becomes your bottleneck as whatever onboard ASIC or RAID-on-a-chip hits its throughput limit (and potentially overheats and dies if it's some crappy product designed for HDDs).

TRIM is well supported by modern host-based controllers designed for SSDs.


I just (like, about the time you were writing your comment) replaced a pair of spinning disks with a pair of 120 GB SSDs in one of my machines here at home. It's not technically RAID but a ZFS mirror instead. The operating system and related files will live on it while my data still lives on some spinning disks (simply because of the amount of data).

I've got pretty much the same setup (albeit with several more spinning disks) in a bunch of servers and have yet to have any problems (* crosses fingers *).


A Raid 5 could work, since you are not writing the same data. You would need a hardware coprocessor to handle generating the checksums at 4GB/s though.


>You would need a hardware coprocessor to handle generating the checksums at 4GB/s though.

A first gen i5 2.4GHz can generate XOR for RAID5 at 4GB/s. Newer generations at faster core speeds should be able to beat that significantly.

...

I went and tested this on a Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz

[1053854.445872] avx : 17860.000 MB/sec

[1053854.475878] raid6: sse2x1 gen() 6554 MB/s

[1053854.492857] raid6: sse2x2 gen() 9070 MB/s

[1053854.509853] raid6: sse2x4 gen() 11273 MB/s

[1053854.526848] raid6: avx2x1 gen() 21554 MB/s

[1053854.543844] raid6: avx2x2 gen() 25152 MB/s

[1053854.560838] raid6: avx2x4 gen() 29167 MB/s

[1053854.560839] raid6: using algorithm avx2x4 gen() (29167 MB/s)

[1053854.560840] raid6: using avx2x2 recovery algorithm


RAID 5 is terrible with 4K random IO. I know guys who measured >1 million IOPS with MD RAID 0 and then switched to MD RAID 5 only to have performance drop from read-modify-write overhead. They could not believe that RAID 5 was the cause.


They'll have the same lifespan, but they won't fail at exactly the same time. Also, a lot of RAID now is done in software, and most software raid types support TRIM.


Anyone care to take a guess how much the 15TB version will cost?


The projected cost when it was announced was $5000.


10,000 USD?


I don't see any SAS SSD's going for much less than $1,000/TB. Some of them are twice that! And I'm sure the density will let it command a premium by targeting users who really need more storage in tight spaces. My guess is $50,000.


On the other hand, I still remember when hard drives (the spinning platter ones) reached the $1/GB mark ~10 years ago. Now high-end SSDs are there. A billion bits of storage costing $1 is pretty amazing, I think.


Hell, I remember reading PC Mag and being SHOCKED about a 4GB HD that cost a few hundred bucks. (Also shocked that I subscribed to PC Mag.)


I was shocked when 200GB drives were called drivezilla and cost $400.


And I was shocked when Seagate announced a 8TB HDD for $200..


I'm shocked every time I open Newegg. Having hardware with at least 256MB of RAM, 100 Mhz clock speed CPUs, 10Gb hard drives that works with such high reliability is amazing. You have to think, even a single screwed up instruction or corrupted bit could bring down the OS.

How many millions of pixels are on my two LCD screens, and they are each individually controllable and none have failed in 8 years?


Our brain thinks linear, technology grows exponentially.

http://www.kurzweilai.net/the-law-of-accelerating-returns


Kurzweil also claims, as the capstone final paragraph in this piece, that human life expectancy is on an exponential curve, which is false. Not just a little wrong, it's all the way wrong, and intentionally misleading, there's no possible way he doesn't know it's false. It's very easy to check the basic facts: https://en.m.wikipedia.org/wiki/Life_expectancy. Lifespan for people who live past 20 (discounting infant mortality) has been around 70 for at least a thousand years. It's gone up a little in developed countries recently, but he intentionally uses only a small window of time, and includes infant mortality in his stats. If we eliminate all infant mortality, still nobody will live forever, and average lifespan will not be on an exponential curve. This intentionally misleading falsehood alone makes me doubt all the rest of his claims about exponential and double exponential trends. (And I don't doubt Moore, only the additional unsupported implications and loose associations that Kurzweil adds to Moore's law.)


I agree that the claim that lifespan is on an exponential curve is bogus. Modern medicine basically extended extremal lifespan up to the maximum attainable without fixing the underlying causes of aging and then stopped. There is for example some research that shows that the oldest woman's blood was derived from just two stem cells.[1] We currently don't have the technology to fix problems like that

I disagree however that nobody alive today will live "forever". The technology that might allow us to fix aging is on an exponential curve, look for example at the cost of genome sequencing. So the time is ripe for some breakthrough results in aging research that actually address some causes of aging.

[1] https://www.newscientist.com/article/dn25458-blood-of-worlds...


The tech you're talking about can't do anything for the living, it only has potential for the unborn. Either way, there is exactly zero evidence that it's already happening. Kurzweil's claim is pure fantasy.


This discussion is becoming way to off-topic here. But I was convinced by Aubrey de Grey's arguments, perhaps you will be too. He is easy enough to Google.


So I took your advice and did Google Aubrey, I got his WP page first. It says he's into and wants to fund cryonics. I'm not a a biologist, but that tickles my spidey sense. Any remaining hope I had for cryogenics was firmly buried after listening to this: http://www.thisamericanlife.org/radio-archives/episode/354/m...

The second thing I got was this: http://www2.technologyreview.com/sens/docs/estepetal.pdf

The tl;dr summary is:

However, given the recent successes and highly emotional nature of life extension research, Aubrey de Grey is not the first, nor will he be the last, to promote a hopelessly insufficient but ably camouflaged pipe-dream to the hopeful many. With this in mind, we hope our list provides a general line of demarcation between increasingly sophisticated life extension pretense, and real science and engineering, so that we can focus honestly on the significant challenges before us.


I think you are understating the improvements in adult life expectancy. 1000 years ago, the very wealthy (the aristocrats) were getting to their 60s. The population wasn't.

The improvements have mostly come from violence reduction and public health and dealing with disease though, I agree that those things aren't really part of a life extension technology trajectory.


I think you misunderstand my point. If you take away all early mortality factors, lifespan is long, and always has been. Separating the burgeoisie is actually an important piece of this argument, and it reinforces what I'm saying. Healthy adults that don't die of diseases or violence often lived as long as healthy adults today. Aristocrats a thousand years ago, on average lived to 65, but bunches of them made it to 80 and beyond. The same factors you cited lowered the lifespan a little for the entire population, even the adults. If I had data on people that only died of old age, I would use that instead, and that would (and probably still does) favor the rich.

If you solve all of these things, then lifespan is asymptotically something like 80 years. That is plain as day. We might come up with ways to increase lifespan, but we haven't yet, and none of the data Kurzweil shows are a result of increases in lifespan for people who die of old age.


I just didn't read carefully enough (missed that you were mostly speaking of potential life span).


All good, my beef is with Kurzweil's unscientific intentional misuse of data.

"Longevity" is the term in the social sciences for average age at death for only people that die of old age. Kurzweil used that word, he said "In the eighteenth century, we added a few days every year to human longevity; during the nineteenth century we added a couple of weeks each year; and now we’re adding almost a half a year every year."

Life expectancy increased. Longevity did not. It's somewhat common to confuse them, so he could be forgiven for a mistake, except that it wasn't a mistake, he knows the difference, and his graph is evidence. That he included numbers from the 1920s and then skipped to 2000, is impossible to do accidentally, and if you include 1940-1990, you'd see lifespan flatlines. But then he'd have to admit neither life expectancy nor longevity are on exponential growth curves. They're not linear or even sublinear either. Longevity has always been flat, and lifespan has historical dips and bumps.


And I regret not linking to WP on longevity earlier, it directly contradicts Kurzweil's statements. https://en.m.wikipedia.org/wiki/Longevity


About 20 years ago, I worked on a project replacing/rewriting servers and their management software for a manufacturer. They had an already aging *nix system (circa 1990). Story went that not long before we came in for the project they had to replace a HDD that was out of production to the tune of about $50k.


Samsung's 2TB SSD is $860 so maybe something like that.

http://www.amazon.com/Samsung-2-5-Inch-SATA-Internal-MZ-7KE2...


That's a consumer SSD. The 16TB looks like an enterprise SSD (much higher endurance).


You could extrapolate from the PM863 series which seems to be the closest to this: http://www.amazon.com/Samsung-Pm863-Internal-Solid-State/dp/...



Well, I doubt this study will make Samsung charge less for enterprise drives


I hope I can get a 1TB SSD for ~$100 in the near future. I'd love to replace all my HDD.


and a 2015 macbook pro still ships with 128gb


Not sure what you're talking about, mine shipped with 1tb, my work has some with 500gb and 1tb as well.


The maximum on the Apple site is 512GB standard but you can purchase 1TB to make it 500 dollars extra.


If you cheap out and buy the lowest priced one, yes.


Bye bye mechanical ...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: