Hacker News new | past | comments | ask | show | jobs | submit login
Everything I Know About SSDs (kcall.co.uk)
352 points by classified on Jan 15, 2020 | hide | past | favorite | 185 comments

archive.is links have not been working for me for a while now. I wasn't sure if something was wrong with the site or what, but seeing this link posted here convinced me to do a bit of digging.

I have news:, aka CloudFlare's DNS, resolves archive.is to

Live proof: https://digwebinterface.com/?hostnames=archive.is&useresolve...

Archived proof: http://archive.is/utJfW

Of course the moment I changed to in resolv.conf I was able to access the site again.

It's over a disagreement between Cloudflare and the owner of Archive.is regarding the forwarding of EDNS metadata.

See here: https://news.ycombinator.com/item?id=19828317

Yes, there was some discussion a while ago here, CloudFlare says archive.is won't reply with the correct DNS to them and the author of archive.is says CloudFlare isn't passing the correct information to his server, and he refuses to change it.

I don't remember details, I'm afraid, but it is an unfortunate situation.

Nice find. archive.is stopped working for me recently and I just assumed it had disappeared. Now I know its because I happened to change my router's resolver from to

Edit: Would be cool if archive.is implemented this comment https://news.ycombinator.com/item?id=19832572 (from previous HN discussion on this) has been acting up for me recently as well but haven’t had a chance to investigate. Glad to hear it’s not just me

It’s really archive.is acting up here...

The part about filesystems is slightly incorrect.

> The way the file system handles this is incompatible with the workings of NAND flash.

That's true of most conventional filesystems, but log-structured filesystems are much more flash-friendly. That's why there has been a resurgence of interest in them, and also why a typical flash translation layer bears a striking resemblance to a log-structured FS. There are also flash-specific filesystems.

> to an HDD all sectors are the same.

This is not true because of bad blocks. Every disk has a reserve of blocks that can be remapped in place of a detected bad block, transparently, much like flash blocks are remapped. Beyond that, it's also useful for a disk to know which blocks are not in use so it can return all zeroes without actually hitting the media. There are special flags to force media access and commands to physically zero a block, for the cases where those are needed, but often they're not. Trim/discard actually gets pretty complicated, especially when things like RAID and virtual block devices are involved.

> to an HDD all sectors are the same.

Also I believe some humans, and (filesystems?) intentionally stored certain data towards the inside/outside of the HDD because the simple cylinder geometry allowed faster reads in those regions. However, I'm not seeing conclusive proof that modern HDDs show performance variation with respect to radius.

> However, I'm not seeing conclusive proof that modern HDDs show performance variation with respect to radius.

They definitely still do; it's fundamental to drives that run at fixed RPM but maintain high areal density across the entire platter. One of the 1TB drives I have lying around does about 183MB/s at the beginning of the disk, 148MB/s in the middle, and 97MB/s at the end.

I'm really interested in how you benchmarked that? Is there a "simple" way to specify the physical location on a HD data should be written to?

Hard drives generally use a fairly simple linear mapping between LBAs and physical location. Low LBAs are on the outer edge of the platter where transfer speeds are highest, and high LBAs are on the inner edge where speeds are lowest. Unlike SSDs, hard drives don't need wear leveling, so there's no reason to break from that pattern except in the relatively rare instance of damaged sectors.

At smaller scales, the layout of individual tracks can vary quite a bit, but that doesn't have as much impact on overall sequential transfer speed. See http://blog.stuffedcow.net/2019/09/hard-disk-geometry-microb...

Yep. I found it interesting enough to bookmark, and now I tend to pull it up when somebody wants to know how hard drives really work.

Low LBAs are outside rather than inside?? I had no idea.

It stands to reason. The low LBAs get filled first and used most. The outside of the disk is where the linear speed is fastest, and it's closest to where the heads park.

Yeah I understood that immediately after I read it, but it's not how typical optical discs work, and I'd just assumed they work the same way until now. Hadn't really thought about it before.

What I find a bit bizarre though that some defragmenters have an option to move files to the end of the drive. Wouldn't you want to move files to the beginning in that case?

Maybe if you're optimizing for write performance of new files (e.g. back in the day when sequential disk performance was more of a bottleneck, you'd want lots of contiguous free fast disk space for something like digitizing video)

To build on the sibling comment, the simple way to benchmark this is with an empty hard drive, make a series of partitions in order:

1) start of drive partition

2) spacer

3) middle of drive partition

4) spacer

5) end of drive partition

Make 1,3, and 5 the same size, and then run raw disk benchmarks against them. The usual pattern is, like a record, the data begins at the edge of the platter and you'll get higher transfer speeds, and the end of the disk is in the middle, so slower there.

There are also dedicated benchmarking tools that will plot disk performance as a function of LBA (logical block address).

Also worth pointing out that sticking to the front of a spinning disk doesn't just help throughput, it also helps latency as the head doesn't have to move as far.

Just open /dev/sda and run a while loop doing an fread and time each one.

You'll end up with a graph like: http://broadley.org/disk/consumer-no-vibration.png

Unless of could you put a consumer disk in a server with many high RPM fans causing substantial vibration, then you get: http://broadley.org/disk/consumer.png

A server/enterprise/RAID edition disk handles the vibration MUCH better, around 3x the bandwidth: http://broadley.org/disk/server.png

> I'm really interested in how you benchmarked that?

You can benchmark disks, including seek time and speed at different points, using the tools "bogodisk" and "bogoseek": https://djwong.org/programs/bogodisk/

HDTune is a very simple benchmark tool that shows it off.


I remember reading a blog post ca. 2005, before SSDs were common, in which the author explained how he had identified all the files his system needed to boot and the order in which they were read from disk, and then contrived to put them all in the correct order in a contiguous block at the start of the drive to eliminate seeks during boot entirely. This apparently shaved a couple seconds off his boot time.

There are automated solutions to this.

Windows has had similar tricks built in (at various levels of cleverness, depending on the Windows variant you run) for a while now: https://en.wikipedia.org/wiki/Prefetcher

No doubt there are similar options for Linux and other OSs.

A similar trick I've seen is using a small SSD as a manual cache: copy in the files that need to be fast and mount it with the larger filesystem (on slower drives) as a unioned filesystem. Though just using block device based automatic cacheing may be easier and safer than rolling your own mess. There are a few options for Linux though some are not currently maintained (see https://serverfault.com/questions/969302/linux-ssd-as-hdd-ca... amongst other places), and some IO controllers support it directly (even some motherboards have this built in) without needing to bother your OS with the details at all.

This is still very much true, to the point that game consoles will reserve areas of rotating media for different purposes, depending on speed (e.g., disk caches on the Xbox are on the outer tracks).

For optical media the layout of assets can be critical to the user experience. You probably want your startup assets located on the outer, faster tracks.

Io schedulers would also understand the geometry and reorder commands to minimize seek distance (and therefore seek time).

I used to do some of that a long time ago, and it was actually kind of fun. There's a whole different set of optimizations to do on flash, both for performance and (more importantly in many use cases) power efficiency, and I think that would be fun too.

At some point, I think around the mid '90s or early 2000s but maybe earlier, though, seek time was fast enough on widely available drives that on average for random access you spent about as much time on the right cylinder waiting for the sector you wanted to rotate to under the head as you did on seeking to the right cylinder.

You could get some decent gains then if you made your scheduler take rotation into account. A long seek that arrived just before the target sector came under the head could be faster than a short seek that would arrive just after the sector passed the head.

On the other hand, taking rotation into account could make the scheduler quite a bit more complex. You needed a model that could predict seek time well, and you needed to know the angular position of each sector in its cylinder.

I don't think that there were any drives that would tell you this. SCSI drives wouldn't even tell you the geometry. IDE drives would tell you a geometry, but it didn't necessarily have anything to do with the actual geometry of the drive.

At the time I worked at a company that was working on disk performance enhancement software (e.g., drivers with better scheduling, utilities that would log disk accesses and then rearrange data on the disk so that the I/O patterns in the logs would be faster [1], and that sort of thing).

We had a program that could get the real disk geometry. It did so by doing a lot of random I/O and looking at the timing of when the results came back. If there were no disk cache, this would be fairly easy. (Well, it didn't necessarily get the real geometry, but rather a purported geometry and seek and rotational characteristics that could predict I/O time well).

For instance, read some random sector T, then read another random sector, then read T again. Look at the time difference between when you started getting data back on the two reads of T. This should be a multiple of the rotation time.

If the disk has caching that can still work but you need to read a lot of random sectors between the two reads of T to try to get the first read out of the cache.

Anyway, we had to give up on that approach because the program to analyze the disk took a few days of constant I/O to finish. Management decided that most consumers would not put up with such a long setup procedure.

[1] Yes, that could mean that it would purposefully make files more fragmented. A fairly common pattern was for a program to open a bunch of data files and read a header from each. E.g., some big GUI programs would do that for a large number of font files. Arranging that program and those data files on disk so that you have the program code that gets loaded before the header reads, then the headers of all the files, and then the rest of the file data, could give you a nice speed boost.

The flaw in this method is that, to use the above example, if another big GUI program also uses those same font files, the layout that makes the first program go fast might suck for the second program. If you've got a computer that you mostly only use for one task, though, it can be a viable approach.

> take rotation into account.


> I don't think that there were any drives that would tell you this

Old MFM/RLL drives would not only tell you this, they'd let you alter it during the low-level format procedure.

There was a parameter called "sector interleave" that would let you deliberately stagger the sector spacing, so it would be like 1,14,2,15,3,16,4,17,5,18,6,19,7,20,8,21,9,22,10,23,11,24,12,25,13,26 or something.

This was because controllers didn't do caching yet, and PIO mode and CPUs of the era were so slow they couldn't necessarily keep up with data coming off at the full rotation rate. If you missed the start of the next sector, you had to wait a whole rev for it to come around again, a nearly-26x slowdown. Whereas a 2:1 interleave would virtually guarantee that you'd be ready in time for the next sector, for only a 2x slowdown. (Really crap machines could even need a 3:1 interleave, the horrors!)

> I don't think that there were any drives that would tell you this. SCSI drives wouldn't even tell you the geometry. IDE drives would tell you a geometry, but it didn't necessarily have anything to do with the actual geometry of the drive.

I thought the point of native command queuing was precisely to enable the drive itself to make these lower-level scheduling decisions, while the OS scheduler would mostly deal with higher-level, coarser heuristics such as "nearby LBA's should be queued together."

BTW, discovering hard drive physical geometry via benchmarking was extensively discussed in an article that's linked in the sibling subthread. I've linked the HN discussion of that as well.

Native command queuing won't help with placement decisions . (And the NCQ queue is rather short anyway for scheduling optimisation.)

For example, even something as trivial and linear as a database log or filesystem log can benefit from placement optimisation.

Each time there's a transaction to commit, instead of writing the next commit record to the next LBA number in the log, increment the LBA number by an amount that gives a sector that is about to arrive under the disk head at the time the commit was requested. That will leave gaps, but those can be filled by later commits.

That reduces the latency of durable commits to HDD by removing rotational delay.

Command queueing doesn't help with that, although it does help with keeping a sustained throughput of them by pipelining.

> (And the NCQ queue is rather short anyway for scheduling optimisation.)

Isn't it 31 or 32 commands in the queue? That's a worst-case of around a quarter second for a 7200rpm drive, which sounds like an awfully long time horizon to me.

That's probably why the queue depth is limited in NCQ. It's not intended to do bulk parallel scheduling, and as you imply, you wouldn't want the drive committed to anything for much longer than that.

But for ideal scheduling, you need something to deal with the short timings as well.

For example, if you have 1024 x 512-byte single-sector randomly arriving reads, of which 512 sectors happen to be in contiguous zone A and 512 sectors happen to be in contiguous zone B, all of those reads together will take about 2 seek times and 2 rotation times.

Assuming the generators of those requests are some intensively parallel workload (so there can be perfect scheduling), which is heavily clustered in the two zones (e.g. two database-like files), my back-of-the-envelope math comes to <30ms for 1024 random access reads in that artificial example, on 7200rpm HDD.

Generally that's what the kernel I/O scheduler is for.

> I thought the point of native command queuing was precisely to enable the drive itself to make these lower-level scheduling decisions

The main purpose of NCQ (and SCSI command queuing which came way before it) was to allow higher levels of parallelism at the drive interface. This does allow the drive to do some smart scheduling, but still only within that fairly small queue depth. Scheduling across larger numbers of requests, with more complicated constraints on ordering, deadlines, etc., remains the OS's job. And once it's doing that, the incremental benefit of those on-disk scheduling smarts becomes pretty small.

People reduce the number of cylinders to keep seek time under control. It just happens that if you are going to choose what cylinders to use, you better choose the inner ones that are slightly faster.

Wouldn’t the outer ones be slightly faster in terms of both transfer rate and head travel time?

> but log-structured filesystems are much more flash-friendly. That's why there has been a resurgence of interest in them

I could swear I read such in 2010 -- did that actually happen in the last ten years?

f2fs (https://en.wikipedia.org/wiki/F2FS) is now popular for Android-based mobile devices.

That counts, thanks!

He's also inaccurate in characterizing all cells as "1 or 0"; digital electronics have always been a cut-off of an underlying analogue value (such as voltage).

They explain this in more detail later on. See "SSD Reads" and "interpreting the results".

I briefly searched for a source of (mostly) FTL-less microSD cards (I'd be fine with them doing wear leveling-related remapping of whole blocks, as long as it's power-loss-safe, but I don't want them to expend any effort on emulating 512/4k random write capability).

I didn't find anything but bunnie's blog entry[0] on hacking/re-flashing the firmware on the card's controller.

Due to how important it is to have a good FTL, and the pretty much native suitability of log-structured data formats, like those based on RocksDB or InfluxDB (both not unlikely for data handled and stored on RPi-like SBCs), it'd be much better for reliability and performance reasons to let these LSM engines deal with NAND flash's block erase/sector write/sector read behavior.

[0]: https://www.bunniestudios.com/blog/?p=3554

Things I've learned from using SSDs at prgmr:

Since the firmware is more complicated than hard drives, they are way more likely to brick themselves completely instead of a graceful degradation. Manufacturers can also have nasty firmware bugs like https://www.techpowerup.com/261560/hp-enterprise-ssd-firmwar... . I'd recommend using a mix of SSDs at different lifetimes, and/or different manufacturers, in a RAID configuration.

How different manufacturers deal with running SMART tests under load drastically varies. Samsung tests always take the same amount of time. The length of Intel tests vary depending on load. Micron SMART tests get stuck if they are under constant load. Seagate SMART tests appear to report being at 90% done or complete, but the tests do actually run.

Different SSDs also are more or less tolerant to power changes. Micron SSDs are prone to resetting when a hard disk is inserted in the same backplane power domain, and we have to isolate them accordingly.

Manual overprovisioning is helpful when you aren't able to use TRIM.

What a drive does with secure-erase-enhanced can be different too. Some drives only change the encryption key and then return garbage on read. Some additionally wipe the block mappings so that reads return 0.

>I'd recommend using a mix of SSDs at different lifetimes, and/or different manufacturers, in a RAID configuration.

Oof, that's extremely obvious yet it never crossed my mind. Nice tip!

Have you found any real value to instructing in-service SSDs to run SMART self-tests, vs simply observing and tracking the SMART indicators over time?

It's not as valuable in and of itself as monitoring SMART counters. We've only had a single SSD report failures during a long test, and it also reported an uncorrectable error. However, not finishing the test is a good proxy for if a drive is overloaded and less able to perform routine housekeeping.

"Website is sleeping"

000webhost lives up to its name!

In the meantime:


strange. archive.is is resolving to for me

even if I do

nslookup archive.is Server: one.one.one.one Address:

Name: archive.is Address:

It's some feud between the operators and the archive.is operator. I never really cared enough to figure out who was at fault. Probably both of them.

edit; here you go: https://jarv.is/notes/cloudflare-dns-archive-is-blocked/

Those will stop working some time soon. Long-term mirror: https://web.archive.org/web/20200115163630/http://kcall.co.u...

archive.is is censored by CloudFlare. Use a different DNS resolver.

It's quite the other way around.

There are a few technologies that I’ve tried very earnestly to understand, only to find out that it’s basically black magic and there’s no use in trying to understand it. Those things are modern car transmissions, nuclear reactors, and SSDs.

A very basic nuclear reactor can be explained pretty simply I think. You enrich a bunch of let's say uranium. Pack it together in a rod, and put a bunch of those rods in a pond. Those rods have controlled (ideally) nuclear decay from their being in close proximity to other rods which generates a lot of heat, which is transferred to a separate cooling loop that boils water to make steam which drives an electric turbine.

Now I'm no nuclear scientist so please be forgiving with that description, but that's how I understand them to work :)

I can't even begin to explain how an SSD works, but I know there are no moving parts besides electrons.

edit: moved the "(ideally)"

Ok, I've studied the Flash storage (most SSDs these days) technolgy and can be understood like this:

* At the "lowest" level, there's a little cell that it's very much an EEPROM (but better, because newer tech). This little cell can hold 1, 2, 3 or 4 bits, depending on gen/tech.

* You group a bunch on those cells together and they form a page. Usually it's 1024 cells a page.

* You group a bunch of pages together and they form a block (don't confuse with "block" as in "block oriented device"). Blocks are usually made of 128 pages.

* You group a bunch (1024 usually) of blocks together and you get a plane.

* You get your massive storage by grouping a lot of planes together. Think of it as small (16-64 MB) storage devices that you connect in a RAID-like manner.

* Operations are restricted because of technology. On an individual level, cells can only be "programmed", that is, a 1 can bit flipped into a 0, but a 0 cannot be made a 1.

* If you need to turn a 0 into a 1, then you must do it on a block level (yep, 128 pages at a time).

* That's where the Flash Translation Layer kicks in: it's a mapping between the (logical) sectors (512b or 4096b) and the underlying mess. The FTL tells you how you form the sectors (which would be the blocks of a "block oriented device", but I'm trying to avoid that word).

* You also have "overprovisioning" at work - that is, if your SSD is 120gb, it's actually 128GB inside, but there's 8GB you don't get access (not even at the OS level), that the device uses to move things around.

* Wear Leveling/Garbage Collection mechanisms work to prevent individual cells from being used too much. Garbage Collection makes sure (or tries) that there are always enough "ready to program" cells around.

* The firmware makes everything work transparently to the world above it.

That would be a very (very very) simple explanation of how Flash storage works. Things like memory cards and thumb drives usually don't get overprovisioning nor wear leveling.

Your quantities are way off if you're trying to describe the kind of NAND flash that goes into SSDs. Typical page sizes are ~16kB plus room for ECC, so a page is several thousand physical memory cells, not just one thousand. Erase blocks are several MB, so at least a thousand pages per erase block. A single die of NAND typically has just 2 or 4 planes, each of which is at least 16GB.

Largest sizes I'm finding for erase blocks are 128 and 256KB, not several MB. I am finding larger plane sizes, that probably comes from grouping more blocks together. In general, it's not massively different from what I described, it's just a difference in sizes involved at the higher levels.

It still sounds like you're looking at tiny (≤4Gb) flash chips (or NOR?) for embedded devices, not 256Gb+ 3D NAND as used in SSDs, memory cards and USB flash drives. Micron 32L 3D NAND (released 2016) had 16MB blocks for 2-bit MLC, ~27MB blocks for 3-bit TLC. SK Hynix current 96L TLC has 18MB blocks, and even their last two generations of planar NAND had 4MB and 6MB blocks.

Having only 2 or 4 planes per die with per-die capacities of 32GB or more is a big part of why current SSDs need to be at least 512GB or 1TB in order to make full use of the performance offered by their controllers. 265GB SSDs are now all significantly slower than larger models from the same product line.

I guess what I meant is that the overall concepts are understandable. Nuclear fuel gets hot, boils water, drives a turbine. For transmissions, different sized gears allow things to turn at different rates.

But as soon as I dive into the details, I get lost. How exactly can you control the nuclear decay? How exactly does do the gears in the transmission move around and combine with eachother to create a specific gear ratio? These concepts probably are probably pretty simple for a lot of people, but they just make my head spin.

> How exactly can you control the nuclear decay?

That's what the control rods are for. The uranium in one fuel rod in isolation decays at whatever natural rate, which would warm water but not boil it, and placing the rods near each other allows for the decay products (high energy particles) to interact with other fuel rods and induce more rapid decay.

The control rods slot in between the fuel rods, and absorb the decay products without inducing further nuclear decay. Usually these are graphite rods.

> How exactly does do the gears in the transmission move around and combine with eachother to create a specific gear ratio?

It really depends on the specific transmission, a manual transmission is using the shift lever movement to move the gears into place. An automatic transmission most likely uses solenoids to move things (a solenoid is basically a coil of wires around a tube with a moveable metal rod inside, when you put current through the wire, the metal rod is pulled into the tube, you attach the larger thing you want to move to the end of the rod (sometimes with a pivot or what not), and use a spring, another solenoid, or gravity, etc to make the reverse movement. A solenoid by itself gives you linear movement, if you need rotational movement, one way to do that is have a pivot on the end of the solenoid rod, then a rod from there to one end of a clamp on a shaft, then when the solenoid pulls in its rod, the shaft will rotate (this is the basic mechanism for pinball flippers).

> The control rods slot in between the fuel rods, and absorb the decay products without inducing further nuclear decay. Usually these are graphite rods.

AFAIU graphite rods increase fission by slowing (not capturing) neutrons which in turn have a better chance of propagating further fission, because .. physics.

Quite nifty actually - without the moderator, the fuel wont burn.

I think you're right; I misinterpreted the term 'graphite-moderated reactor' to mean something it doesn't. Graphite will slow the neutrons so they react more. Also, the Chernobyl reactor design has graphite tips on its control rods, which I misremembered as the primary substance of the rod.

The primary substance of the control rods is (usually) a neutron absorber, and most reactors with control rods have a passive safety system, so gravity and springs will force the control rods in to significantly slow the reaction unless actively opposed by the control system.

The Chernobyl rods had graphite ends so that when fully retracted, the reactor output was higher than if there was simply no neutron absorber present; unfortunately, this also meant that going from fully retracted to fully inserted would increase the reactivity in the bottom of the reactor before it reduced it, and in the disaster, this process overheated the bottom of the reactor, damaging the structure and the control rods got stuck, and then really bad things happened.

Long story short, most control rods don't have graphite. ;)

Start here: https://www.youtube.com/watch?v=pWWjbnAVFKA

Scott Manley explains things so well. Highly recommended channel.

For anyone who hasn't seen it, Scott's video on Chernobyl is also well worth a watch: https://www.youtube.com/watch?v=q3d3rzFTrLg

What you describe is basically a radioisotope thermal generator, like those used by space probes. In such a device, you use natural decay heat of unstable radioisotopes.

In a nucler reactor, things are a bit different. You start with uranium that is only slightly rsdioactive and does not produce any usable quantities of heat. You place in ib the corrrct geometry and start a controlled nuclear chain reaction. You jave neurons split uranium, which produces heat and more neutrons.

Controlling this reaction so that it actually runs runs, but not so much as to melt your reactor is what, as far as I understand it, makes nuclear reactor design hard.

I think the harder part is making the reactor safe, idiot proof, bomb material production semi-incapable, issues of dealing with remaining irradiated materials..

Also politics, corruption etc.

You can explain SSD as a controller you send commands to that writes data to and reads from a log structured storage on top of raw flash.

My car doesn't even have a transmission. Seriously though, a long time ago, I tore down a 1990 Mustang Automatic transmission and rebuilt it myself and, even back then, it was an amazing piece of machinery. The planetary gear systems are really cool. I don't think I completely understood it, but it was a lot of fun. It did take me two months, heh.

I assume you have an EV? Even EVs have transmissions (unless it's some weird niche car that's completely direct drive). They're just vastly simpler than modern automatic/manual gearboxes, usually a single gear reduction.

Model 3. Calling it a transmission is technically correct, but it is extremely simple. It is single reduction gear as you suggested: https://cleantechnica.com/2018/10/16/tesla-model-3-motor-gea...

I used to work on SSD firmware for a long time. There is a lot of technology and individuals involved but to be honest I find learning on the present day web development stack more daunting at times!

That is because the firmware is all about efficiency and reliability is strictly business no f..ng around. This is at least from my experience.

Web is like: look ma, new shiny stack, gotta use it. The amount of tooling involved in creation of even simple things is often staggering without any real need for it. And often if you do not approach webdev in "politically correct" way you can be laughed out of the door.

> webdev in "politically correct" way you can be laughed out of the door

Indeed there's this pile of people who don't know how computers actually work and only have experience as 'web developers' so they pile on level and level of abstraction without concern for performance or even common sense. Don't get me wrong some of the abstractions are very good, but most are founded in ignorance of what technology has gone before and so they eventually collapse due to the problems inherent in their architecture, mostly things that were discovered in the 80's or earlier.

Planetary gearsets are amazing. Cool animation here https://www.youtube.com/watch?v=7iTn8OWxVFU

> Those things are modern car transmissions, ..

Even though it may be off-topic, but could you elaborate on this a little bit more?

Not OP but in my experience people tend to have trouble comprehending systems of planetary gear sets.

Although modern automatics are probably a bit easier to understand than old ones, especially CVTs? As long as you're ok with "the computer just triggers this solenoid.." rather than understanding a big hydrualic computer.

Modern automatic transmissions are actually manual transmissions with a robot moving through the gears as far as I know. However, they do a bunch of stuff that I don't understand like pre-engage the next gear so the switch is faster -- I have no idea how that works

Yeah they sure are, they are manual transmissions with a solenoid controlled dual-clutch setup. One clutch engages the next gear as the first releases, there is no lost thrust as with a single clutch pedal, and you get to save all the weight of hauling around a huge valvebody and fluid and clutch bands.... Great stuff.

The Howstuffworks article is great. Have fun. https://auto.howstuffworks.com/dual-clutch-transmission.htm

> like pre-engage the next gear so the switch is faster -- I have no idea how that works

I think you're referring to dual-clutch transmissions: odd gears on one clutch, even gears on the other. So the transmission can switch from eg. gear 2 to 4 with the even-numbered clutch disengaged while transferring power through the odd-numbered clutch in gear 3. When it's time to move up to gear 4, one clutch is disengaged as the other is engaged, instead of having to leave a single clutch disengaged while the gear change happens.

I've found that building the Lego Technic Porsche 911 set from a couple years ago really helps a lot with understanding dual-clutch transmission systems. https://brickset.com/sets/42056-1/Porsche-911-GT3-RS

There's a pretty succinct (2 minute) non-engineering description of how a nuclear reactor works in the Chernobyl miniseries you might want to check out: https://www.youtube.com/watch?v=BpwU4mtWXAE

Why is that scene so pointlessly edgy? Is the whole show like that?

In context - the scientist had basically said that the politicians and their bureaus were all wrong, that it wasn't a minor incident, and "accused" them of lying, covering up, and/or being willfully ignorant and stupid.

The whole show isn't like that, but it does try to show that so much of the issue was caused by a desire to be seen as infallible (of course the reactor design wasn't flawed, because the people's greatest minds worked on it, etc., etc.) and that was something that had to be dealt with atop the actual disaster.

It's pointedly edgy, because thousands of people died.

Or 30, if you believe the govt. report.

Either way, far more people die prematurely every day from causes due to pollution from coal-fired plants. But if you're a journalist or documentary filmmaker trying to make something look all cool and edgy and scary and shit, an exploding nuclear power plant makes for a much more interesting story than just another day in the pulmonary ward.

If not for some very edgy heroics, it would have been millions, and most of Ukraine down to the Black Sea uninhabitable for centuries. What did happen was not just "because nuke", but because of a whole series of design and management failures, very far from the least of which was denialism.

They did finally fix the cause of the explosion in the other reactors of that design, remarkably many years later.

Nobody here will defend coal, but graphite-moderated reactors are not the tech you want to be defending.

If not for some very edgy heroics, it would have been millions

That seems a tad unlikely.

Nobody here will defend coal, but graphite-moderated reactors are not the tech you want to be defending.

Exactly. And the reason we're stuck with 60-year-old reactor technology is...

There is a reason I mentioned "denial". The Central Committee shared your skepticism while they could. Ignorance is a luxury.

Sorry, I don't have the faintest clue what argument you're making.

Uninformed skepticism worth no more than the effort that was put into it.

I hoped it would discuss how SSDs cope with sudden power loss, but it doesn't seem to.

I remember this page but I don't know of a modern update: http://lkcl.net/reports/ssd_analysis.html

These days, if I want an SSD for my desktop and want to minimise the chance I have a disk problem and have to restore from backup, would I be better off with one "data centre" drive (eg Intel D3-S4510), or two mirrored "consumer" drives (perhaps from two different manufacturers)?

The prices look similar either way.

The big problem with one data center drive is that if that goes bad, you still lose all your data. You're assuming their marketing MTBF is correct.

They do make NVME raid solutions now -- with the advantage being that NVME can be faster than SATA. And there are various price points for the NVME drives depending upon speed.

This one from 2018 (not sure if it has full raid or uses VROC)(EDIT: it requires software raid)


This one is cheaper but relies upon Intel VROC (which has been hard to get working on some mobo's apparently)


In either case you're looking at max throughput of 11 gigabytes per second, which is roughly 20 times faster than SATA 3's 6 gigabits per second.

Almost all NVMe RAID products—including both that you've linked to—are software RAID schemes. So if you're on Linux and already have access to competent software RAID, you should only concern yourself with what's necessary to get the drives connected to your system. In the case of two drives, most recent desktop motherboards already have the slots you need, and multi-drive riser cards are unnecessary.

PERC HP740 controllers in Dell servers iirc are hardware raid for the flex port U.2 and backplane pcie nvme drives.

Yes, that's one of the cards that use Broadcom/Avago/LSI "tri-mode" HBA chips (SAS3508 in this case). It comes with the somewhat awkward caveat of making your NVMe devices look to the host like they're SCSI drives, and constraining you to 8 lanes of PCIe uplink for however many drives you have behind the controller. Marvell has a more interesting NVMe RAID chip that is fairly transparent to the host, in that it makes your RAID 0/1/10 of NVMe SSDs appear to be a single NVMe SSD. One of the most popular use cases for that chip seems to be transparently mirroring server boot drives.

So stay under 8 physical NVME and it should be fine?

A typical NVMe SSD has a four-lane PCIe link, or 2+2 for some enterprise drives operating in dual-port mode. So it usually only takes 2 or 3 drives to saturate an 8-lane bottleneck. Putting 8 NVMe SSDs behind a PCIe x8 controller would be a severe bottleneck for sequential transfers and usually also for random reads.

I need to think about this for a second.

You’re saying the performance gains stop at two drives in raid striping. RAID10 in two strip two mirror would still bottleneck at 8 total lanes?

I also need to see about the PERC being limited to 8 lanes - no offense - but do you have a source for that?

Edit: never mind on source, I think you are exactly right [0] Host bus type 8-lane, PCI Express 3.1 compliant


To be fair; they have 8GB NV RAM, so it’s not exactly super clear cut how obvious a bottleneck would be.

I can’t edit my other post anymore, but I it’s worse than I thought. I’m not sure the PERC 740 supports nvme at all. Only examples I can find are S140/S150 software raid.

No idea if 7.68TB RAID1 over two drive with software RAID is much worse than a theoretical RAID10 over 4 3.92TB drives.... apparently all the RAID controllers have a tough time with this many IOPs.

As stated, support for that one is not guaranteed, you need to first figure, out if your motherboard configuration supports lane bifucartion, otherwise it won't work, or only one of the installed keys.

Cards with PLX switches are are way to fix this if you cant upgrade your whole hardware, but the price point is a multiple of simple bifurcation cards, since you have to integrate a whole PCI switch on-card.

The architecture diagrams here are quite helpful:


Intel SSDs are known for advertising graceful degradation to Read Only mode, when in reality simply suddenly die during endurance tests.

As I recall, there are at least three problems here:

1. Basically no software is prepared to be "graceful" about their storage suddenly going read-only, especially when the OS is trying to run off that drive.

2. Intel drives at end of life go read-only until the next power cycle, whereupon they turn into bricks.

3. The threshold at which Intel drives go read-only is when the warrantied write endurance runs out, not when the actual error rate becomes problematic. This makes sense if the drive is trying to ensure the flash is still in good enough condition to have long data retention, so that you can reliably and easily recover data from the drive. But (2) already rules that out.

> Intel drives at end of life go read-only until the next power cycle, whereupon they turn into bricks.

That's inconvenient, given that the first instinct of many people at the first sign of read trouble is to power-cycle the drive. Maybe not in enterprise scenarios, but certainly in consumer ones.

I was under the impression that if you do not encrypt an SSD from the first use, then any attempt at overwriting with 0s is futile, as well as any other method to securely delete the files. The files will be easily recovered.

This guy seems to say the opposite, in that the files are "simply not there anymore", contrary to everything I've read: who's right here?

The big problem with erasing SSDs is that when you think you're overwriting something you're actually writing new blocks (because of the translation layer). How many iterations do you have to do before you're certain to have actually hit every physical block? Nobody knows. Maybe infinite. Non-zero data blocks might not be readable through the flash drive's front-end interface, but they're still sitting there on the actual NAND chips. That's why secure erase was added (and even that seems to be less than fully trustworthy).

One iteration of grinding the drive to fine powder/melting it down should be enough. Anything else might be insufficient.

That's a good way to erase a disk, but horrendously expensive if all you want to erase is a single file.

> The files will be easily recovered.

Not really “easily”. At the very least you’ll need a modded flash controller that can bypass the flash translation layer. Also, on SSDs with TRIM support you’re also racing against the garbage collector which will erase any unused (ie. deleted) blocks.

In practice all SSDs are always encrypted because they use the encryption to whiten the data written to them. That's why "Secure Erase" takes less than a second on SSDs, it doesn't erase anything but the key.

Yes, but in practice, this encryption isn't competently implemented. Hence why MS stopped supporting hardware-accelerated BitLocker.


They still support it, it’s just not the default any longer for drives that report the capability.

Thanks for the correction - I expect that support is just a temporary state of affairs though. I used hardware accelerated BitLocker for a few years, and had my systems break multiple times due to BitLocker-related regressions when MS pushed updates. I can't imagine it's going to get more attention now that's it's never enabled by default.

With the caveat that this occurs in proprietary firmware which is non-trivial to audit. There have also been vulnerabilities discovered in the encryption features of disk firmware.

But AFAIK this isn't true for all solid state storage, like cheap USB memory keys. These would generally also benefit from log structured filesystems - but really should be encrypted too.

(which is one great promise of zfs on linux/openzfs - cross-platform encrypted removable storage).

Interesting! Do you happen to know which encryption algorithm is used? I would think that, if the only goal is whitening (as opposed to robust security), a fairly weak algorithm would be used, or perhaps a strong algorithm with a reduced number of rounds.

the hardware is going to use AES because their ASIC vendor will have well tested AES IP that they can just throw down on the chip. any other algorithm would require massive development effort for zero benefit.

and by using AES they can probably claim to satisfy some security standards that will make their marketing people happier.

It would need to be a fairly good algorithm to provide good "whitening", but they could be using a small or easy to guess/derive keys especially if the primary purpose is just whitening rather than security.

The algorithm probably depends on the drive, but AES-256 is common due to hardware acceleration. Read a SSD spec sheet sometime, it will likely mention it, along with supporting TCG Opal (the self encrypting drive standard).

Depends on the drive. Some setup a password when initialized and the SATA secure erase command zero's the password. So you can technically read all the old data (or if the firmware blocked you, you could by directly accessing the chips), but you would end up with encrypted data, not the original bits.

That's why the secure erase takes seconds and not drive size/bandwidth seconds.

>>The files will be easily recovered.

From my limited second-hand experience, it's the complete opposite. Data recovery services can't reassemble data from memory blocks of dead SSDs.

"As for writes, at 1 gb a day - far more than my current rate of data use - it would take the same 114 years to reach 40 tb."

Maybe it works for author's very specific system or use-case, but on my personal MBP laptop with very occasional usage pattern--some days I do not use it at all during the week--I end up with 10 GB per day of writes on average. That way it will be 11.4 years already, not so many. And I do not do something very disk-expensive on it like torrents downloading or database testing, only some general development task, watching online videos, web surfing, docs, etc.

I have an NVME SSD as a boot drive in my desktop and three years and change in (say 1200 days) it’s used 40tb of write (out of an advertised 400tb endurance so I’m not too worried). That works out to about 30gb of writes per day, which seems about right for medium to heavy use.

I guess what I’m saying is that for modern SSDs I don’t think write endurance is a binding constraint in most cases.

For client/consumer SSDs, most vendors seem to view 50GB/day as plenty of write endurance for mainstream, non-enthusiast products. Virtually all retail consumer SSDs have a warranty that covers at least that much, and usually several times more for larger drive capacities (since write endurance is more often specified in drive writes per day).

Coincidentally 50GB/day is what Chrome craps out to the hard drive during few hours of usage, doing such clever things like "caching" YT videos (YT player NEVER reuses data, and rewinding always generates new server fetch with new randomly generated URL).

How much RAM do you have? And how many Chrome tabs?

16 GB of RAM and I use Safari mostly and Firefox from time to time with less than 20 tabs usually (less than 10 in average, I guess).

I check Data written in the Activity monitor — Disk.

I get that it's meant to be subjective, but the premise makes it clear that it's an ongoing knowlegde collection, but for that some sources would be nice.

Currently my troubles with SSDs in PC/Server/NAS environment are a somewhat more practical, more about compatibility NVMe/SATA, M.2 key types, PCIe port bifurcation support vs PLX switches, none of them are even mentioned. Advice for this is notoriously hard to find, resorting to trusting rare and random forum posts is my state of knowledge progress there.

On forums like ServeTheHome all this information is readily available. It's spread between some guides and posts but if you have a specific question those people will be able to answer it most of the time.

> Deleted file recovery on a modern SSD is next to impossible for the end user

OK, he’s talking about SSDs, but I want to mention that I’ve easily recovered many large deleted files from SD or micro-SD cards (formatted as FAT32 or exFAT) using Norton Unerase or an equivalent utility.

Are the controllers for SSDs that different from the controllers for SD cards? Has anyone tried Norton Unerase or an equivalent program on an SSD? I’d like to hear a first hand account to help confirm (or deny) what the author claims.

That's because for FAT, erasing a file just means flagging the file allocation table entry as deleted, which normally makes the filesystem software put the linked list of sectors pointed by this entry back to the free sectors list (I'm pretty sure I am wrong on some details here, but I think the general idea is correct).

In other words, when you "delete" a FAT file, you are not even erasing a whole directory entry, you just merely flip one bit in that entry.

The data blocks get actually "erased" when they are reused by the FS software.

Recovery of basic deletion is therefore pretty much guaranteed as long as you didn't write something else on the disk.

What the author is talking about is more "serious" deletion, sometimes called "shredding" [1]. To actually erase the data from the disk, you use software that overwrite your file with random data before deleting it. This is supposed to work if the filesystem is as clever as FAT - that is somewhat dumb (but so simple that SoCs such as ARM Cortex Ax can boot from them directly).

SDs and SSDs add another challenge because they constantly lie to the filesystem software; they have their own inner controller mainly to manage bad sectors and do wear-leveling, so when the FS requests to fill a sector with zeros, they might say "ok" but actually just remap the sector internally to another empty sector. So the data is still somewhere on the chip, and an evil scientist with lots of pointy probes can in theory read them back.

[1] https://linux.die.net/man/1/shred

It would likely depend of if the filesystem uses TRIM to notify the SSD of deleted sectors.

If it doesn't, the deleted files should retain their data until the filesystem reuses the sector, as normal. If trim is used, the SSD doesn't have to retain the data, but it doesn't necessarily make it unreadable immediately, there are many implementation strategies for trim.

Enterprise SSDs for the most part promise that reading a trimmed range of the drive will return zeros. Consumer SSDs usually won't make that strong of a guarantee, so that if you're running enterprise software that requires this behavior you have to pay extra for enterprise drives. As originally formulated, the TRIM command was supposed to be more of a hint that the SSD could ignore if it was too busy or if the TRIMMed block was too small for the drive to do anything useful with.

For the end user it is difficult. Files can be recovered as long garbage collection hasnt happened yet. SSDs require the space to be empty before new data can be written to it. So often times the deleted space is empty. If files are stored in the recycle bin or trash those can still be easily recovered.

It was once explained to me that writing to a HDD is like painting. Where new data can just be painted over the top of the old discarded data. While writing to an SSD is like writing on a chalk board where the old data has to be removed before you can write over it.

Excellent summary. One thing he left off: some SSDs continue to copy/erase blocks even if there is nothing new to write because multi-level cell state does degrade over time. There is a concern that some MLC drives will suffer bit corruption over time if not regularly power up to allow this in the background. Citation needed: I only recall this when I was interviewing to work for Western Digital many years ago.

This problem was most prominent right before the switch to 3D NAND, when planar NAND dimensions were at their smallest and the consumer market had already mostly switched over to 3 bit per cell TLC rather than 2bpc MLC. In the worst case, we were down to about 8 electrons difference between cell voltage states. That's now been relaxed by 3D NAND allowing for larger cell sizes, and most 3D NAND also switched from floating-gate to charge-trap cell design so leakage is less of an issue. Nowadays, data retention in SSDs is only a concern toward the end of their lifespan (as measured by write endurance), and it's probably inadvisable for the SSD to start doing background data scrubbing until the raw read error rate starts climbing.

Thank you for bringing me up to speed!

I wonder if anyone here has experienced something similar: I have a Samsung Evo 860 SSD. Sometimes after powering on my desktop the BIOS either "forgets" the drive (sets some other drive as primary boot drive) or doesn't recognize it at all.

The non recognition issue goes away after I power off and power on.

It's been this way for about 8 months. Happens every 1/15 times I power on. I've heard it may have something to do with "sleep mode" or something like that. I always shutdown via software though.

That sounds like the drive is failing to initialize in time. Have you tried enabling a POST test or Boot Delay? I suspect the problem might magically go away if you do.

I'll try it out

might be an innocent config issue, but this usually means the drive detection timed out during boot, which can be an early sign of drive failure. one of my old ssds behaved this way for a while before never being detected again. make sure you have anything important backed up.

I had an ssd/system combination that would not communicate at sata 3, and either the bios wouldn't try at sata 2 or it would timeout detection too soon after. If your bios lets you set a sata revision, it's worth a try (mine didn't, but I used a sketchy download of the AMI Bios Editor to adjust the default; I couldn't figure out how to use the editor enough to make the setting visible). And also maybe reseat the cables.

Is the drive's firmware current? And what about the logic board firmware?

I found these helpful. The first one has a link to a video which is pretty ELI5



> The OCZ Myth: [...] with one overwrite pass of zeroes [...] A sort of RETRIM before that was invented.

It wasn't a myth, that was the idea all along.

> SSD Defragmentation [...]

An important factor left out is wear leveling, it doesn't make as much sense to arrange data in "file-system-order" when the bits on the drive move around.

Anyone know how TRIM works with Linux?

I find myself copying entire partitions between SSDs from time to time, is there a utility to clear the destination SSD before copy?

Is it possible to do the same for an SD card, so that writing a new Raspi OS to it doesn't do unnecessary garbage collection?

On linux, it can either get done using fstrim (often run periodically, e.g. via cron), which clears unused blocks, or can be done on file delete on supported file systems (e.g. mounting using the discard option). The periodic trim is supposed to be better, as a SSD does guarantee that a trim returns quickly.

If you want to trim an entire device, see blkdiscard (take care!).

On a systemd based system, you may have fstrim.timer enabled by default. Check: sudo systemctl status fstrim.timer

> I find myself copying entire/ partitions between SSDs from time to time, is there a utility to clear the destination SSD before copy?

The way TRIM works, you don't need to trim blocks right before overwriting them. TRIM is for blocks that you aren't going to care about for quite some time, it essentially returns them to the SSD management layer for use as overprovisioning.

> The way TRIM works, you don't need to trim blocks right before overwriting them.

In my experience "blkdiscard /dev/DESTINATION_DISK" does improve the speed of dd'ing a disk to another one quite a bit though.

And that does make some sense IMHO: If the SSD's internal datastructure which keeps track of pages which hold user-data is empty then each write will consume less time for looking up if the specific sector is contained in the datastructure.

> then each write will consume less time for looking up if the specific sector is contained in the datastructure.

For most SSDs, this lookup is a single DRAM fetch per 4kB of user data, and therefore much faster than even a single NAND flash read, let alone a NAND flash program or erase operation.

The reason that imaging a SSD is faster after it's been trimmed or secure erased is that every erase block is empty and ready to accept new data. When you overwrite data on a used disk, the drive has to free up erase blocks and that generally involves moving old data elsewhere to preserve it—because the drive doesn't know that the commands to overwrite that are also coming soon.

It also avoids wasting time (and write endurance) shuffling around data which is just going to be overwritten anyway. That could happen either as a result of static wear leveling or just because the drive needed to erase a block with a mix of valid and overwritten pages to make room for new data.

I believe all major OSes support trim in a reasonable way now. afaik it needs hardware support though, so idk how it would work for an SD card.

MMC defines trim as an optional command. But maybe to your point I've generally only seen it implemented on eMMC chips, not SD cards.

Unless you have a RAID card in the middle.

I don't mean to say it wouldn't work, just that I don't know enough about sd cards to say whether it would/should.

Sorry, I meant for TRIM being passed down to an SSD.

oh gotcha. not sure why I thought it was reasonable to think you were talking about a bunch of SD cards in a raid array...

If you're trimming an entire device rather than just blkdiscard as mentioned elsewhere you can just issue a secure erase which will restore speed to basically the same as it was out of the box. If you just want to wipe one single partition and not the entire drive then blkdiscard is the best you're going to get.

The author forgot about non-NAND SSDs (e.g. Optane SSDs). There's no garbage collection to worry about, for example.

Optane SSDs do need wear leveling. What makes it much simpler is that there isn't the mismatch between small-ish NAND pages and massive NAND erase blocks, so you don't have to suffer from the really horribly large read-modify-write cycles.

Says website is sleeping? did he go over the bandwidth limit?

It lost me at Comic Sans, and made me almost ill with any further font and design choices.

So I finally kicked it into Reader View, only to find a lot of questionable spelling and grammar issues.

These kinds of basic things go a long way to making an article valuable.

site doesn't work - archive link https://archive.is/K9SFI

The CSS of this site manages to making reading this article equality inconvenient and painful on both Mobile and Desktop.

Indeed. Thankfully, Reader View in both Firefox and mobile Safari remedies it.

Same experience for me. Text seems readable and can be easily followed at 250% zoom (at least for me)

I love a minimal, text-only website as much as the next crotchety HN reader, but a little bit of CSS goes a very long way in terms of readability.

Edit: Upon further inspection I see that this page was designed to be hard to read. Very curious.

Reader view helps a lot, and with such a simple website it is guaranteed to give a good result.

Not just a lot, it actually makes it readable. However I do feel this article needs a few graphs and diagrams. I've written [0] about data recovery and SSD/Flash storage is a massive beast to tackle even with diagrams to give you a vague idea of what's going on.

[0] in old fashioned tree-based paper, unfortunately

Is there an available method to acquire some of these marked-up dead trees?

not if you're reading via Materialistic via Android browser with no responsive design or zoom ~,~

In Materialistic you can hit the red round button that appears bottom right of the screen. That way you can zoom. It won't help you much with this article, but it can be handy sometimes.

If the author is reading this, I’d like to ask him to please run his article through a spell checker. All the effort to write a nice article is spoiled by having dozens of typos and spelling mistakes.

never realized how awful some of my own posts were with such errors until I ran my favorite ones through Grammarly after having paid for it last year to help me with writing a book draft (I abuse the hell out of commas apparently too, I blame Reddit).

Worse, some of them I'd sent to people dozens of times to read and there were these horrible typos cringe.

I found some of his grammatical choices to be questionable as well.

How so? It's a nice single column of test on my mobile.

It's a wide column of text on a larger screen - usually text is centred in the middle.

Use reader mode

thanks for writing.

Is it me or someone else noticed that the font is too tough to read?

I am the author of kcall.co.uk/ssd/index.html and I trust that the site is now back up and running without further 'sleeping' episodes. The reason for the interrupts was, as some have surmised, that the site was mentioned here and in other places and that caused a surge in hits that exceeded the host's usage limit. Whilst I am pleased that so many people have shown an interest in my efforts it does mean that I have had to move the site to a more generous host.

It is apparent from the first sentence and the website itself that this article is, or arises from, tha musings of shall we say an amateur. It's not professional because I'm not a professional. Whilst some of the comments in this thread are helpful, some are baffling. To the person who couldn't get past the header bacause Comic Sans offended him, the following 8058 words were in Century Gothic, which is perhaps not so offensive. I have however accepted the comments on readability and changed the header and all the rest of the text to Calibri, made the font larger, and changed the line spacing to web standards. I hope it is now more soothing and readable. As for the claim that the text is 'spoiled by having dozens of typos and spelling mistakes' I'm not sure what he is reading (or smoking). Apart from the removal of one stray colon, the current text is entirely unchanged - it is the same file. I don't claim to be perfect and I'm sure that in such a large essay there are some typos and other errors, but to say that there are dozens is patently untrue. Just let me know where they are and I will happily correct them.

Apart from all that, I am surprised that there are so few comments about the veracity or otherwise of the contents. I had great difficulty in finding material that was up to date and actuially delved into the technicalities of SSDs, without baffling me - as much did.

I shall end by thanking those who found the article interesting and I hope that some of it at least has been of some help.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact