Hacker News new | comments | show | ask | jobs | submit login

They shape their reads and writes to work well with SSD characteristics. Randomly reading and writing to an SSD (whether directly or through mmap) would be slower and wear out the SSD more quickly.

Modern SSDs simply don't work like this, they all internally use some variant of log-structured storage, such that regardless of the user's write pattern a single continuous stream is generated and only one method is needed to distribute modified pages across available flash. This means an infinite loop that rewrites the first 128kb of the device with random data will eventually fill (most of) the underlying flash with random data (128kb because that's a common erase block size).

Write patterns still matter until you're talking about like megablock granularity. Mmap will swap out pages at random (relative to disk layout) and page granularity is far smaller than a megablock. It's certainly possible for controllers to handle this properly, and I don't want to tell you that they never will, but even the very expensive PCI-E flash we use at fb demonstrated this "bad behavior".

Are there standard practices for securely erasing any random SSD without having to look up it's implementation details? Or is this the sort of thing you just use a shredder for?

Encrypt it and store the key anywhere except on the drive. To erase, simply destroy the key. Many motherboards come with a tamper-proof key storage device you can reset on command (the TPM). There's a SATA secure erase command, but its been shown multiple vendors have managed to botch its implementation. So if you can't make the encryption approach work, shredder is probably still best bet

Standard practice in government and large enterprise is still physical destruction, for exactly the reason you mention.

http://www.monomachines.com/shop/intimus-crypto-1000-hard-dr... or you can get a service to come out and do it on site.

Does it make any difference for append-only writes vs. in-place modifications?

Can you cite anything for this? Wear leveling is not the same as log-structured storage.

Indirection in a log-structured form is the best way to increase write IOPs and optimize write amplification. More sophisticated SSDs actually have multiple log heads, for data with different life cycle properties.

You get a write amp of 1 until the drive is filled the first time. After that, it's a function of 1) how full the drive is (from the drive's point of view—this is why TRIM was invented) 2) the over provisioning factor 3) usage patterns, such as how much static data there is 4) how good the SSD's algorithms are 5) other (should be) minor factors, such as wear leveling

Source: I used to be a SSD architect.

Just to further illustrate your point, the relevant section from the readme file:

    There have been attempts to use an SSD as a swap layer to implement SSD-backed
    memory. This method degrades write performance and SSD lifetime with many small,
    random writes. Similar issues occur when an SSD is simply mmaped.
    To minimize the number of small, random writes, fatcache treats the SSD as a
    log-structured object store. All writes are aggregated in memory and written to
    the end of the circular log in batches - usually multiples of 1 MB.

It would be trivial to batch writes to the mmap'ed region. Reads would still benefit from OS caching.

How do you batch writes to mmap-ed region?

    ptr = mmap(..., len, ..)
    /* do stuff with ptr */
    msync(ptr, len)
No sane OS will pay attention to an mmapped region while it isn't under memory pressure, so dirty pages are effectively buffered until you explicitly tell the OS to start writeback.

Linux flushes dirty pages in mmapped regions to disk every 30 seconds by default whether you like it or not (see vm.dirty_expire_centisecs). Unlike FreeBSD, Linux doesn't yet have support for the MAP_NOSYNC flag to mmap(2).

Good point, though I think for a machine averaging even 40mb/sec this only amounts to one 'batch' every 1.2gb receiving an extra sync. Linux recently changed so that pdflush doesn't exist at all, when the dirty list fills, a per-device thread is created that sleeps for 5 seconds before starting work, so maybe it's a bit less than 1.2gb.

Does this mean that I have to msync multiple mmap'ed chunks in order to batch write? For example

    msync(ptr1, len1)
    msync(ptr2, len2)
    msync(ptr3, len3)
Where [ptr1, ptr1+len1], [ptr2, ptr2+len2], ... are the chunks within a big mmap'ed region where changes occur and need to be written to disk for persistence.

Or do I just msync the whole region then hope and pray that the OS will do the right thing?

Shouldn't it be possible to make the linux kernel behave like this with the swap partition on SSD, to get this benefit for all programs?

The number of write cycles to any block of flash is limited. Once the write counter for a block has hit the manufacturers hardcoded limit, the SSD will not trust that block to have data written to it anymore.

The whole point of this piece of software is to be smart about how and when it flushes data so it can minimize impact on the write counter.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact