

fatcache - Memcache on SSD - buttscicles
https://github.com/twitter/fatcache

======
tveita
> False positives from SHA-1 hash collisions are detected after object
> retrieval from the disk by comparison with the requested key.

I was curious why they would bother, but it seems this isn't quite accurate.

What happens is they first use 32 bits from the SHA-1 hash to find the hash
bucket, then they scan for the full SHA-1 of the key. They do not check for
actual SHA-1 collisions.

edit: Also on the subject of hashes, the readme suggests switching to MD5 as a
possible way to reduce entry size. That is unnecessary; SHA-1 can be truncated
to whatever size you're comfortable with.

------
jonemo
If you find this interesting, you might also find this article in
Communications of the ACM interesting:

Michael Cornwell: Anatomy of a Solid-State Drive

Communications of the ACM, Vol. 55 No. 12, Pages 59-63

[http://cacm.acm.org/magazines/2012/12/157869-anatomy-of-a-
so...](http://cacm.acm.org/magazines/2012/12/157869-anatomy-of-a-solid-state-
drive/fulltext)

Alternative link in case the first is paywalled for people not on a university
campus: <http://queue.acm.org/detail.cfm?id=2385276>

It goes into quite some detail on how SSD storage works on a system level and
how it differs from hard disks.

------
arielweisberg
If you are willing to give up range scans and constrain yourself to fitting
all keys in memory there is a lot you can do. See SILT
(www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf) and Bitcask
(downloads.basho.com/papers/bitcask-intro.pdf).

Since this is a cache I really dig skipping any kind of cleanup/compaction
step for deleted/expired keys.

I played around with a similar thing except as a K/V store and the performance
and density is pretty amazing. With a 64 byte key and 1.5k value (compressed
from 2k) I was getting 85k inserts/sec and several hundred thousands reads/sec
with a quad-core Sandy Bridge i5 and a 128 gigabyte Crucial M4 on SATAII.

~~~
justincormack
Fitting all keys in memory is a pretty harsh constraint though. With your
64byte to 1.5k you are talking about a factor of 20, so for a 1TB SSD you need
64GB RAM, which costs about as much as the SSD.

~~~
efuquen
"so for a 1TB SSD you need 64GB RAM, which costs about as much as the SSD."

In my experience this is not true. A 1 TB SSD, even commodity hardware, is
pretty expensive. And that's not even talking about high performance SSD's
like Fusion IO, where you're now talking about 10K+. Where do you see 64 GB of
RAM costing that much?

~~~
justincormack
Micron unveils 1TB SSD for $600, which is about what 4x16GB ECC RAM is...

[https://www.computerworld.com/s/article/9235277/Micron_unvei...](https://www.computerworld.com/s/article/9235277/Micron_unveils_its_first_1TB_SSD_for_under_600)

Fusion IO is probably as expensive as RAM, so only makes sense for apps that
really need persistence, ie databases.

------
pornel
> False positives from SHA-1 hash collisions are detected after object
> retrieval from the disk by comparison with the requested key

Is that check really necessary?

To have _1 in a trillion_ chance of having accidental SHA-1 collision they'd
have to store 1.7*10^18 keys, and mere key index of that would require 54000
petabytes of RAM.

~~~
gus_massa
Comparing the sha-1 is much faster than reading the ssd, so it's almost free.

Accidental sha-1 collision is probably not a problem, but in a few years [1]
it will be possible to crate sha-1 collisions and use that as an attack. It
looks difficult, but supposes that with the correct string an attacker can
retrieve the cached information of another user, for example
sha1("joedoe:creditcard")=sha1("atacker:hc!?!=u?ee&f%g#jo").

I don't know if they are using randomization, because the collision can be
used (in a few years) as a DOS atack [2]

[1]
[http://www.schneier.com/blog/archives/2012/10/when_will_we_s...](http://www.schneier.com/blog/archives/2012/10/when_will_we_se.html)

[2] <http://www.gossamer-threads.com/lists/python/dev/959026>

~~~
speakr
Your example describes a preimage attack on SHA-1, not a collision attack.
Even with a working collision attack you are probably still far away from
taking "some.other.input" and creating a sha1("some.other.input") =
sha1("johndoe:creditcard").

For instance MD5 collisions are really easy to create but for preimage attacks
on MD5 there is still no better approach than just doing brute force.

------
sigil
Curious, why not just use an mmapped file for the store a la varnish? If the
file lived on the SSD you'd get in-memory caching for free from the OS.

~~~
zellyn
They shape their reads and writes to work well with SSD characteristics.
Randomly reading and writing to an SSD (whether directly or through mmap)
would be slower and wear out the SSD more quickly.

~~~
hosay123
Modern SSDs simply don't work like this, they all internally use some variant
of log-structured storage, such that regardless of the user's write pattern a
single continuous stream is generated and only one method is needed to
distribute modified pages across available flash. This means an infinite loop
that rewrites the first 128kb of the device with random data will eventually
fill (most of) the underlying flash with random data (128kb because that's a
common erase block size).

~~~
jlgreco
Are there standard practices for securely erasing any random SSD without
having to look up it's implementation details? Or is this the sort of thing
you just use a shredder for?

~~~
hosay123
Encrypt it and store the key anywhere except on the drive. To erase, simply
destroy the key. Many motherboards come with a tamper-proof key storage device
you can reset on command (the TPM). There's a SATA secure erase command, but
its been shown multiple vendors have managed to botch its implementation. So
if you can't make the encryption approach work, shredder is probably still
best bet

------
wazoox
So apparently inventing your own SSD caching system is all the rage? We
already had flashcache from Facebook :
<https://github.com/facebook/flashcache/> And bcache :
<http://bcache.evilpiepirate.org/>

However this one isn't kernel-based, so it won't help your NFS server or your
postgresql engine. On the other hand it's much easier to build.

~~~
wmf
Block devices and memcached are totally different; of course you're going to
need different implementations for different APIs.

~~~
wazoox
However a block layer cache can enhance any kind of IO.

------
cowmix
The README says that memory is 1000s of times faster than SSD.

With that, how much faster is SSD over disk and then memory over SSD?

~~~
wmf
Latency Numbers Every Programmer Should Know:
<https://gist.github.com/jboner/2841832> discussion:
<http://news.ycombinator.com/item?id=4047623>

DRAM (~100 ns) is very nearly 1000x better latency than SSD (~70 us) which is
only 100x faster than disk (~10 ms).

~~~
rorrr
Actually, modern SSDs have lower latency than 70 microseconds:

26 microseconds for OCZ Vertex 4 (reads)

[http://thessdreview.com/our-reviews/ocz-vertex-4-128gb-
ssd-r...](http://thessdreview.com/our-reviews/ocz-vertex-4-128gb-ssd-review-
and-1-4rc-fw-comparison-ssd-steroids-for-your-vertex-4/6/)

No more than 20 microseconds for Corsair Neutron GTX and Vertex 4 (writes):

<http://www.storagereview.com/samsung_ssd_840_pro_review>

~~~
bigiain
While that's a significant improvement over the 70ms assumption, it's still
"around three orders of magnitude slower than dram and two orders of magnitude
faster than spinning disks", and doesn't really change any of the conclusions
you'd have drawn using the 70ms number.

------
thelarry
Cool idea, I believe aeorspike uses ssd's for key value store like this. Check
out <http://www.aerospike.com/performance/architecture/> .

~~~
23david
yep. Aerospike definitely had the first implementation of this kind of ssd-
based architecture I'd heard of. Use ram only for hot caching and metadata.
Use ssd's with custom filesystems for persistent storage. It's a good idea.

It's too bad most cloud providers consider SSD disks to be a 'premium'
feature. I guess this would work fine on custom-configured hardware at places
like softlayer and serverbeach.

------
prodigal_erik
"Slab item chunk sizes" suggests they retained memcache's external
fragmentation problem (if you mostly have big objects expiring, those slabs of
memory won't ever be reused for storing small objects, or vice versa). On
memcached with no persistence, you could recover from this by restarting (if
you could withstand the availability hit), but what do you do once you're
relying on long-lived state?

~~~
thinkingfish
This is not an issue for fatcache. slab item chunk size introduces internal
fragmentation (because items sizes usually don't match "chunk" sizes unless
you use slab profile, -z option, to match them up). The kind of external
fragmentation you described is due to the eviction strategy memcached uses,
which can be avoided by using slab level eviction strategies (Twemcache and
latest Memcached both have this support). fatcache does slab level eviction
based on write timestamps, which is the equivalent of LRC eviction in
Twemcache, and you can read about the mechanism here:
[https://github.com/twitter/twemcache/wiki/Eviction-
Strategie...](https://github.com/twitter/twemcache/wiki/Eviction-Strategies)

------
valyala
I believe go-memcached is faster than fatcache -
<https://github.com/valyala/ybc/tree/master/apps/go/memcached> . Will return
with performance numbers soon :)

------
stcredzero
Cool, but what I was envisioning when I came up with the term "fat cache" over
a year ago was client-side. This seems to be server side, unless I've
misunderstood.

<http://news.ycombinator.com/item?id=3544522>

------
benaiah
This is insane.

Awesome, useful, and cool, but insane.

I like it.

~~~
throwaway54-762
No, it's pretty sane. I work at a well-known NAS storage-appliance vendor and
we're doing something quite similar in-kernel as a caching layer for our
filesystem.

~~~
benaiah
When I said "insane", I meant more "this is not something you would expect or
that is immediately obvious". My phrasing wasn't the finest.

