Hacker News new | past | comments | ask | show | jobs | submit login
Samsung Announces Key-Value SSD Prototype (anandtech.com)
441 points by nikhizzle 15 days ago | hide | past | web | favorite | 237 comments

IBM mainframes had key-indexed hard disks back in the 1960s – CKD (and later ECKD). Each disk sector could have a key field at the start, and rather than addressing disk sectors by physical location, you could tell the hard disk "give me the data of sector with key 1234", and it would go search for that sector and return it to you. (I think, you still had to tell it what disk track to search on, and it just did a sequential scan of the disk track...)

A lot more primitive than this, of course, but funny how old ideas eventually become new again.

It's also interesting that nobody implements CKD on storage devices. This complexity runs on a massive POWER AIX box in IBM's storage systems, or in a similar commodity powerhouse running UNIX or an RTOS in similar arrays of the past couple decades.

You have a comparatively weak CPU (think ARM Cortex M series) on an SSD and drive companies have been notoriously bad at firmware development.

Frankly, I don't trust Samsung to implement a file system. I prefer to not even use their enterprise flash based on past trauma, but those are usually harder fails than the subtle fuckups they can pull off in an opaque FS.

You already are using a translation layer in the SSD, most SSDs have wear levelling, error correction, compression, etc. What your OS things it's writing to disk is very different to what ends up stored in the flash chips. I think your are fighting yesterday's war.

That is exactly the reason I am using ZFS for important data -- one should not really trust the storage devices to be fully reliable.

Is is a real pity that so few filesystems systems have integrated data checksumming -- I'd love to use something other than ZFS for this.

SSDs are not the only storage device that lies about these things. Just about the only one that doesn't is tape, and I'm not even sure about those.

Fighting opaque and unsupportable complexity will always be the current war in technology.

See also CAFS (Content addressable file store) [1], developed in the 1970s by ICL, which was at the time the UK's flagship mainframe computer company. It didn't make a huge impact on the market, probably because it was expensive to make and expensive to run. [1] https://en.wikipedia.org/wiki/Content_Addressable_File_Store

In "Moving targets: Elliott-Automation and the dawn of the computer age in Britain" Simon Lavington has documented[1] a computer built by Elliott Brothers (later Elliott Automation, then subsumed into ICL) named OEDIPUS that used content-addressable hardware storage. It went into production at GCHQ in 1954.

The computer also gets a mention in his 2011 book "Alan Turing and his contemporaries".


Thank you for sharing that. Reading material for tonight ;) I googled ECKD and it seems it is still in use and support by IBM (maybe through emulation).

It really only still exists for legacy reasons.

The original indexed file format under MVS, ISAM, heavily relied on the physical hardware keys on (E)CKD disks. In the 1970s, IBM replaced ISAM with VSAM, which didn't use that physical key feature and handled all the keys in software – and it was actually faster. (I don't know if the reason for it being faster was due to some fundamental flaw with the underlying idea, or just due to limitations of the particular implementation.) (Despite being deprecated, IBM continued to support ISAM until the mid-2000s, when they finally removed it from the operating system.)

IIRC, the legacy PDS (partitioned data set) format also uses it, but not the newer PDSE 1.0 or PDSE 2.0. (PDSE = partitioned data set extended). (PDS is like an archive file format, for object code it functions similar to .a files on Unix, for source code it was used to make up for the fact that originally MVS didn't really have the concept of directories, so keeping all your program's source modules in a single dataset was more convenient.)

I think the mainframe filesystem (VTOC) also uses it a bit.

There's probably a few other random things in z/OS that still need it, but the newer stuff (VSAM, PDSE, HFS/zFS, etc) doesn't really use it. I think if IBM really wanted to, they could add support to z/OS for running on industry-standard FBA disks instead of ECKD. (They did the same thing to VSE all the way back in the 1980s.) I think the main reason they don't do it, is not technical, but commercial – mainframes needing special SANs helps keep their storage business alive, if mainframes could work on industry standard SANs, they'd face more competition in that area.

This involves simply exposing how the spinning rust drive works internally to software.

On most magnetic drives there is no inherent relationship between the sector number and its physical location on the track and the drive simply waits for the sector with correct sector number in its header to come under the head.

> most magnetic drives there is no inherent relationship between the sector number and its physical location on the track

This is false. Sectors are approximately in block order. You can write a predictor which can predict the latency to read a block based on the last block read using that information.

You can also take the lid off a drive and watch it with a high speed camera to see where each bit of data gets stored.

On Prime minicomputers back in the 80's, there was a disk controller order to format a track. That's what broke it into sectors and put sector ID fields into each sector, in whatever way the disk controller wanted.

A read was in 3 parts parts: first do a seek over the right cylinder, then select a head, then read a sector. That least part read everything spinning by until it saw a sector with the correct sector ID.

Sectors were not in block order because if you read sector 1 then tried to read sector 2, it would have already spun past the head, meaning sequential sector reads would require a complete revolution of the disk.

So instead, sectors were interleaved with a skip. There were 9 sectors on a track, so the sector IDs might be written as: 0 3 6 1 4 7 2 5 8. You could read 3 sectors per disk revolution instead of 1.

There usually is some relationship, but does not have to be.

> but funny how old ideas eventually become new again.

I dislike this meme - things are generational, not cyclical.

This continues the trend we see in processors, that in a post-Moore's law environment, rather than trying to push physical limits for performance improvements, we're branching out into hardware optimized for specific purposes. Very neat to see it on the storage side, and something I don't think anyone could fathom back in the spinning disk days.

I think it's happening somewhat in the software space, where we pursue speed improvements by penetrating through sometimes literally decades of accumulated cruft layers to build something more modern. It is ironically harder precisely because it's easier; we've got a lot more software layers because it's so much easier to add another one. Wayland is probably one big example of it, but I often have this feeling in my Go code, too, where instead of a framework on top of a framework on top of a very framework-y core language like Python, we more-or-less just get on with it, both at the level of the code I type, and in the generated assembler as well. I imagine Rust programming feels much the same way when working with Rust-native libraries, I just don't have direct experience.

It's not all Electron apps out there.

Also, when I call something "accumulated cruft", it's not just vague "bloat" concerns I'm referring to; you have the problems of several layers of leaky abstractions, and those leaky abstractions may well date from an era of radically different hardware. It's true this can cause "bloat", but it causes other problems, too.

QUIC/Http3 could be another software example of this.

There is too much cruft in the TCP/IP + HTTP stacks, so the solution is to implement a new protocol on top of the lowest abstraction that can be practically used (IP+UDP) in user space; at least until kernels implement it.

I will delegate judging if this is a good idea to more knowledgeable parties though.

Isn't QUIC/HTTP3 the opposite? It's moving up the layers of abstraction, by moving something that would previously have been managed at layer 4 up to a higher layer.

Running everything that used to be done in the network layer up to the application layer and putting everything on top of HTTPS is the opposite of what the parent is advocating for.

If we consider it relative to HTTP1 just to heighten the contrast, it takes the layers of: TCP connection management and negotiation, plain text framing protocols, protocols too limited to deal with the greatly-increased use of or need for multiple streams at once, SSL layering its own negotiation in, rather hacky stories around websockets/SSE (which are usually used as text protocols framed inside of text protocols framed inside of text protocols... not really the best setup), and probably a couple of other things. All of these things were designed for a different era at the very least, and often designed independent of each other. HTTP3 slices through the whole thing and creates nearly an entire protocol stack for its needs, using only the minimum it needs from IP to work over the internet at all.

My software engineering instincts tell me this is all a bad idea and those layers should be kept separate as they were, if indeed they weren't already intermingled more than they should be (e.g., SSL & SNI; "why should SSL have to know anything at all about how HTTP deals with host names?" say my instincts). However, I intellectually realize that my software engineering instincts are not calibrated for the size of the HTTP world, nor the amount of resources that can be thrown at this problem. It's a great example of slicing through many layers that make sense independently, but just never quite went together properly, to make something that's going to perform better without 30 years of mismatching layers stacked on top of each other.

(I'm not saying HTTP3 is perfect, only that it is a good example of someone at least trying to do what I was talking about.)

I guess it's kind of both.

It circumvents the established network and software stack and builds a new protocol on the lowest-level abstraction that can be used practically (IP + UDP) because everything else would require hardware and kernel support, which would take many years to manifest.

Nobody commercially perhaps, but I've been fathoming it for a while. I kept wondering when we would see a RAID controller with enough juice to run PostgreSQL directly. With a key-value store in existence maybe some more people will be looking into such things.

Sun Microsystems' motto was "the network is the computer" but over the last fifteen or more years there's been a steady trend in the opposite direction.

The computer is a network. I sometimes wish this would be embraced more thoroughly, not just in the way you mention, of offloading work to specialized processing units, but of building a network fabric inside the box. It's one of the things we never copied down from classical mainframe design.

>Sun Microsystems' motto was "the network is the computer"

Just a note, That trademarks now belongs to Cloudflare [1]

[1] https://blog.cloudflare.com/the-network-is-the-computer/

Well, for one of the world's largest CDN's I have to say this is pretty appropriate.

> Very neat to see it on the storage side, and something I don't think anyone could fathom back in the spinning disk days.

Seagate had a product called Kinetic that was an Ethernet-connected key-value spinning disk. We had a 1U box with a number of 4000GB drives in our lab for an evaluation.

It was interesting but we didn't find a use for it. Product is dead now as far as I know.

We're nowhere near ‘post-Moore’ in persistent storage, though. Although Dennard scaling for CPUs has faltered (if not Moore proper), storage is still getting faster and denser at the same time... it just does the speed upgrades in discrete jumps. SSDs were massively faster than HDDs, and when Optane DIMMs actually become available, those will be another huge factor faster than SSDs. (And if not Optane, then QuantX or MRAM or NRAM or ReRAM or...)

CISC is back with a vengeance!

Honestly, this is more like the revenge of the mainframes.

Having enough logic in all your storage and peripherals that the hardware itself presented a nice, directly usable API was the thing that differentiated mainframes in the era when minicomputer were cheaper and started getting faster cpus. (That and "no-one's ever gotten fired for buying IBM".)

This was my impression as well.

I've always wondered why we didn't have memory with lookup operations implemented in hardware.

I think some network chips have associative operations like this in hardware (find route matching ...), it would be interesting to have a generally available solution.

I would be surprised if caches and virtual memory TLBs weren't build based on associative memory. The thing with associative memory is that it's a simple logic in principle, but becomes a sprawling array of and gates when implemented in hardware. This takes it out of the competition for high density memory chips.

This is a sort of punctuated optimization that is natural in any technology.

When a fundamental technology is evolving rapidly, not much effort is put into optimization. It instead goes into keeping the momentum of evolving the fundamental technology.

When it hits a physical limitation, everyone scrambles to now optimize.

Until the next generation of fundamental tech shows up and evolves rapidly again.

Think about air travel and how every airline is now all about fuel efficiency.

A single new technology eg: a new battery chemistry that is say, 2x as energy dense as the best right, now comes along, electric air propulsion will become a major thing. And airlines will be scrambling to offer faster, quieter options that may not necessarily be the most efficient.

Maybe slightly off-topic but I've always struggled to understand the appeal of key-value stores like Redis and DynamoDB. I tried using it once as a substitute for shared memory in an embedded device, but you lose a lot of information (like type) and it seems like you can't represent or query complex data structures like nested structs without serializing it out to some data format first and then storing that in in the store (but at that point it's starting to get slow/complex and you can't easily examine what's in the database and so it's easier to keep using shared memory).

But clearly I'm missing something because everybody loves them and uses them. Is this a technology I would only use when building websites at scale or something?

I was in your camp for a long time until it clicked while attending an AWS summit presentation by the head DynamoDB evangelist/wizard Rick Houlihan.

He said (paraphrasing) "a lot of people repeat this mantra that DynamoDB (and by extension, NoSQL in general) is flexible--which couldn't be further than the truth--it's not flexible, but it's extremely efficient."

He went on to elaborate that relational DBs are not going away, and if 1) your data will never be huge and 2) you just can't predict the query patterns in the future of your app--a relational DB is still the way to go.

However, if your data is both huge and you can spend the up-front careful planning of the queries you'll need to support and can afford the risk of future migration efforts that could be needed if your query needs outgrow your model--then DynamoDB is a slam dunk.

If you fall into that category, then you'll need to spend the time learning the "advanced patterns" of DynamoDB which include key overloading, adjacency-list pattern, materialized graph pattern, and hierarchical key pattern--and then compose those patterns into a custom "schema" (in the parlance of relational DBs).

I'm building an app right now and it took me about 30 hours to collectively enumerate the 40-odd queries I'll need to support and the 10 iterations of overhauls of my design. But boy, was it worth it, because the DB is going to easily be the cheapest m'fu* component of my app. Compare that to (taking an extreme alternative) paying for something like MS SQL server which is colossal money sink. Even compared to open-source like Postgres, my setup here will be probably < 1/10th the cost and if, for whatever reason, I see a drop in traffic or I need to shut down for a period, my DB isn't metering for compute--only storage.

In a drawn-out answer to your question of "would (I) only use (this) when building websites at scale or something?" the answer is generally "yep, probably".

Finely put!

I'm not a big fan of KV stores either, but I accept that they exist because there's an origin story for these things that makes a lot of sense. External storage of shared state fixes problems for shared-nothing and shared-everything programming environments.

In Java it gives teeth to the "write once, read many" pattern. By avoiding side effects you can scale your team larger. But it's a constant drain on resources to police this convention. There's always someone who thinks their reason to violate the rules is valid. Pushing the data out of process increases the cost of modifying data that should be read-only, and violations advertise loudly. You can't achieve it by obfuscation, intentional or otherwise.

Practically though, at the time KV stores were coming into existence, you could buy hardware with so much memory that Java couldn't keep up. They hit a GC wall that was causing serious problems. And running multiprocess was just anathema. If you push all of the long-lived objects out to the KV store not only does memory drop like a rock but what's left over is 'young' objects, and GC often optimizes for short-lived objects. In a way it becomes a self-fulfilling prophecy. By making in-process caching expensive they made it unwelcome.

In this same timeframe, latency on network cards became lower than latency to your drive controller. Reading something out of memory on another machine in the same data center was faster than getting it off of your own disk.

Meanwhile in an environment where all tasks are pretty isolated from each other, in-process caching is ungainly. Either you suffer with a low hit ratio, you do some sort of bizarre traffic shaping, or you do caching at the ingress point so you don't make the requests at all. But that informs your engineering priorities to such an extent most people don't like. But I see it as one of those "uncomfortable" things that Continuous Delivery advises you to face head-on. Cache invalidation is hard, yes. Wear a helmet.

So the shared-nothing people also liked these tools, which means you have two groups that probably normally wouldn't talk to each other pulling for the same team.

> In this same timeframe, latency on network cards became lower than latency to your drive controller. Reading something out of memory on another machine in the same data center was faster than getting it off of your own disk.

Was this before SSDs?

That era started before SSDs went mainstream, and it's not over. RDMA has much lower latency than NAND flash, which is why NVMe over Fabrics is getting so popular. Fetching data from a locally attached SSD isn't much faster than fetching it from a different server in the same rack.

To put some numbers on that, typical TLC flash is around 50 microseconds to access, optane and specialized SLC flash are under 10, and an infiniband connection between machines is around 1 to 1.5 microseconds.

There's a very easy to understand use-case in storing data where you don't really care about the type or querying capacity. Imagine you've got an API request coming in; your server generates the response, and then you cache that response in a TTL'd KV store where the key is some deterministic hash of the appropriate information from the request (method, path, body, requestor, etc). Future requests just use the cache for a bit.

Here, we don't care what gets stored there, because our API server isn't really "using" the data that gets returned; we just throw it back at the client. We also don't care about complex querying. That's where KV stores shine.

Its just a different pattern for different use-cases, at least in DB land. If you need complex querying, then KV probably isn't for you. But, if you need ridiculously fast lookups and perfect horizontal scalability, then you could equally say that SQL isn't for you, and KV is.

Key-value stores satisfy a set of use-cases, as do NoSQL, and SQL.

Key-value is great when you have a key-value pair that you'd like to cache. Specifically in web workloads, maybe you'd like to cache the API response (value) to a certain API query (key) such that its returned without processing power from your backend. Or maybe user permissions per action, where they don't really change as they're reliant on the user's role.

I'm not entirely sure what a valid use-case would be for an embedded device, but if there's any kind of SQL query you would make on an embedded device that returns a response that hardly changes, you might also spin up some kind of key-value store such that it's returned without (well, faster than) SQL query time.

It probably doesn't make sense to completely replace a SQL database if you're trying to represent relational data.

I've been looking at Redis recently as a potential solution to some caching I might want to do, and one hurdle I encountered almost immediately is:

Suppose I have some function, get_members_of_group(dt, groupname). It returns a set of strings. There are arbitrarily many groups, so I can't just cache all of them in one key. Someone might make a new group at any time, and groups are never deleted or changed, they just age out.

So far, this seems great. I make get_members_of_group_from_redis() first check redis for a key group_members_{dt}_{name}. If it's there, return it. If it's not, get it, cache it, return it.

But groups can also be empty.

Redis doesn't let me store an empty set. If a group comes back empty, the fact that it's empty can't be cached because you can't tell the difference between "not cached yet" and "cached but empty".

I've googled it a bit and all of the workarounds (sentinel key/value pair, storing "empty" at that key so the SMEMBERS fails with an error, etc) are hacky and make me not want to use Redis.

Is this a me problem and everyone is OK with slamming their db with hits if a cached function returns empty, or is everyone just ignoring all non-str datatypes and storing everything as JSON? Or is it just that weird to have a function that might return an empty set that nobody else has this problem?

Yes, Redis requires more manual tracking of such things. What you have to understand is that it is incredibly fast: You can easily do hundred of thousands of sismember lookups within the time it takes to serve a usual request.

> or is everyone just ignoring all non-str datatypes and storing everything as JSON

I doubt it. I've been surprised by how many of the datatypes I have found uses for.

I'd guess that most people don't see a sentinel value to distinguish between empty and unknown as a serious problem.

Supposedly you can use a Lua script - although that’s a somewhat big hammer for such a simple problem...

Isolate the hack in a function with a snarky comment.

Sometimes speed is worth it.

DynamoDB is useful for services that are likely to have a very large amount of load and require minimal maintenance. For example, Mozilla uses DynamoDB to back it's browser push notification sending service. It would be better if it supported SQL, like Google Spanner, but it's still useful.

Redis is useful because it can do so goddamn much. Caching, task queuing, hyperloglog, geospatial indexing. Set it up once, and it can replace a very large number of cloud services. And it has a lot of options for replication, backups, clustering, etc. It is like a Swiss army knife.

You're not wrong about anything.

   Is this a technology I would only use when building websites at scale or something? 
Basically, though I'd amend that to read "websites that might need to scale at some point."

For a lot of projects, shared memory is a perfect choice. But if you might need to scale up to multiple instances at some point, shared memory obviously won't do. And since it's not like there's a huge penalty for using Redis, a lot of times it's just convenient to use it from the beginning.

> And since it's not like there's a huge penalty for using Redis

There is compared to shared memory.

If you actually need to scale and don't have an in-memory cache layer over Redis you're likely going to get hurt.

The real solution here is to simply write your code with an appropriate amount of abstractions, and then start with the simplest / cheapest backend that fits your needs. When it comes time to add Redis, or shared memory, or both, only a small amount of your code needs to be adjusted for the new backing store.

I find Redis is very useful when you need fast, temporary access to memory in a distributed system. I have used it several times as a distributed lock, or as storage for real-time computations in a distributed system that only use primitive values, for example aggregating page views into different clusters (hashes or sets in Redis) and then saving the result to DB.

Is Redis safe to use for a distributed lock?

The Redlock algorithm suggested for use with Redis has been the subject of some of criticism:


Kleppman is the author of Designing Data-Intensive Applications and generally someone whose opinion I trust in such things.

I love Redis and use it extensively at $dayjob, but I stick to Consul for managing distributed locks. Consul was built from the ground up to handle such things; Redis handles it as a bit more of an afterthought / consequence of other features.

True, not a native feature of Redis. I'll put Consul on my list of things to check, thanks for contributing to my knowledge.

Regarding the article, I realized I used the lock for efficiency and not correctness, and it might be why I never encountered issues.

Antirez claims it is. Others are far more critical of his claims.

Personally I'd strongly suggest using a proper consensus implementation such as etcd, consul, zk.

If implemented correctly, there are guarantees of safety and liveness: https://redis.io/topics/distlock

No, it relies on time.

You should use Zookeeper, etcd or Consul

The article mentions RocksDB, which is a great backend used in other databases. It's key value, but it can be used to build other databases, for instance graph databases.

Key-value can also be just part of a database. Like having a key-value index.

For making redis a little more usable beyond pure key-value operations take a look at their tutorial on "Secondary Indexing": https://redis.io/topics/indexes

If you care about typing and complex querying though, you might be better off using SQLite or similar.

Stack Exchange as one example aggressively uses Redis as a caching layer for practically its entire service.

Redis is very fast, dependable and relatively easy to use. The alternative, other than a competing product, is pretty much that you have to write your own custom replacement Redis in some manner. That might be fine at a small scale for fun or in an experimental product that isn't meant to go into production; otherwise you want something well proven, that many other engineers you can hire are likely to have experience with (it's hard to over-emphasize that last point).


A filesystem is a KV store; if you can solve a problem by storing data in files, then you can use a faster and more scalable KV solution.

I think it makes more sense to think of KV as a minimalistic distributed filesystem, rather than as a limited kind of memory or a stripped SQL database.

Samsung didn't make this for the long-tail of consumers who will never need hardware specific solutions for their KV stores - it's for Facebook, Google, etc who build hyper-optimized KV stores that operate at huge scale.


Redis is not a key:value store. That's memcache. Redis a data structure store with native operations that can be performed on those data structures.

I'd put it as "redis is not just a key:value store", because it certainly is one and more.

Redis lets you store and manipulate some basic data structures like linked lists, maps, and sets -- so a value in Redis doesn't necessarily have to be a big, serialized blob (though I bet that a lot of people use it this way, as if it was a drop-in replacement for Memcached).


In addition to the many good answers, it's also used as a backend to SQL databases (like RocksDB which is mentioned in the article).

> but you lose a lot of information (like type)

"We have perfect support for both kinds of type: strings and and JSON blobs."

How about for storing centralized cache ? Reddis has been very helpful for us in that.

It's great when you need to share memory between pods, containers, machines, etc.

Values can be references, and you can query data structures with traversal.

Would be interesting if this evolves into a full filesystem implementation in hardware (they talk about Object Drive but aren't focused on that yet). Some interesting future possibilities:

- A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc. Imagine having a single disk that could boot any computer operating system without having to manage partitions and boot records!

- Significantly improved filesystem performance as it's implemented in hardware.

- Better guarantees of write flushing (as SSD can include RAM + tiny battery) that translate into higher level filesystem objects. You could say, writeFile(key, data, flush_full, completion) and receive a callback when the file is on disk. All independent of the OS or kernel version you're running on.

- Native async support is a huge win

Already the performance is looking insane. Would love to get away from the OS dictating filesystem choice and performance.

> Would love to get away from the OS dictating filesystem choice and performance.

Not wishing to be a wet blanket, but do you really believe having disk manufacturers dictating filesystem choice and performance will actually be an improvement?

They already have a huge impact on performance and in my experience most people don't care which filesystem they use as long as the performance is good.

What most people care about is not what developers care about. I sure hope drives will not dictate what kind of file system we have to use.

Current filesystems are software-defined on top of block-based storage, I don't see a reason why we can't still have software-defined filesystem layers built on top of key-value-based storage. The key-value storage just lets you potentially keep a more meaningful lookup id versus block storage so could offload a little bit of work from the filesystem layer, while keeping the exact same filesystem API in place.

> They already have a huge impact on performance and in my experience most people don't care which filesystem they use as long as the performance is good.

Some of us care about not losing metadata randomly too... or suddenly having hardlinks get duplicated...

Absolutely. SSDs have gone from 100+MB/s random (already a huge improvement over the spinning rust of the time) to maxing out PCIe x4. Over the same timeline, what have filesystems done?

- Apple made APFS, which seems to show roughly the same performance as HFS+ on SSD hardware. Some benchmarks show faster, some show slower, but it's certainly no 5X improvement across the board.

- btrfs and ZFS have some amazing features, but remain niche and don't seem to yet have wide deployment. Linux distros seem locked on EXT4.

- Windows has NTFS (and ReFS for the server)

At a consumer level, I care about: a) reliability b) performance c) power. SSD manufacturers so far have delivered hugely on all three. I'd love to see how much more they could deliver by moving a lot of filesystem functionality onto the device. Sure the first few iterations might suck like early SSDs did, but in a few years?

>Absolutely. SSDs have gone from 100+MB/s random (already a huge improvement over the spinning rust of the time) to maxing out PCIe x4. Over the same timeline, what have filesystems done?

This is the most "apples to oranges" comparison I've ever seen!

Filesystem do something totally different than hard disks or SSDs, and you can't replace one with another. One stores data in hardware form, the others defines the format, reliability (e.g. journaling), querying, metadata, attributes, security, etc of the data.

The point is there, though: hardware has more room to improve than software does, when it comes to IO-heavy tasks.

Err... Firmware has a history of being closed, buggy and quickly non-supported by the manufacturer. Block level abstraction exposed to the client is as brain-dead as it gets - and it is a good thing. Just imagine that your SSD had an S3-type interface for five years. And you only could use the tools that Amazon released five years ago. No updated versions. No bug fixes. And no "backend changes" to workaround bugs/etc and that was your only copy of the data.

We'd have hacks to support all of our OS level concepts like writable, readable, listable, executable, hidden, system, archive. How would someone implement tail, what about keeping something open for writing? I'm not sure any of this stuff is thought through except by OS writers. If we suddenly put this task to device manufacturers, would tail suddenly never work again? And if that doesn't work, what kind of monstrosity will happen if we simply attempt to start windows or linux on it?

And even though our OSes have bugs, at least we can reasonably debug and patch them. The same cannot be said of completely proprietary and invisible firmware.

Uh, alot of this is in firmware on the controllers of the SSDs. And often it's buggy as fuck to the point that Linux has blacklists hardcoded to deal with crappy firmware in SSDs.

I confess I'm not sure why people are talking about bugs now. I was talking about how hardware can access higher levels of performance and efficiently guarantee more things (eg. early persistence guarantees) than software can.

If the firmware has severe bugs (which is incredibly common with all firmware) and is hard to debug and patch, the performance is irrelevant. Not to mention that there are many examples of existing firmware bugs where the performance of "simple" hard drives tanked -- adding more code to already-buggy firmware that has a history of performance problems is unlikely to improve performance (let alone reliability).

This is a rather different claim to saying there wouldn't be a benefit. I'm not saying filesystem accelerator SSDs would necessarily be in sum a good thing, just that in principle it seems like the kind of thing that should perform better than today's abstractions.

Because the hardware is implemented with software.

You are rather missing the argument. If KV SSDs are much faster than doing the same with a software DB on top of a block abstraction, is that not proof that these are inequivalent? GPUs run software too, but that doesn't mean it's in anyone's best interests to run that on the CPU—not all software is the same, precisely because the hardware is different.

The more I think about it, the more I'm thinking that allowing SSDs to present a higher level API than block storage would be an incredible win. Consider:

- Currently, SSD controller firmware goes to incredible lengths to present something that is not a contiguous block device (due to the way flash works) as a contiguous block device to the OS. Likely this includes a huge amount of logic for storing and updating the mapping between logical blocks and actual hardware storage location. Since the control APIs currently are very low level, the controller basically has to buffer everything, transform that into what the flash needs to do, and then execute, not unlike how a CPU decodes instructions and converts to uops or reorders. Presumably this is quite difficult as early controllers had all sorts of problems such as performance tanking when the disk got close to full.

- The filesystem then pretends that the very non-uniform storage provided by the SSD is actually a uniform block and introduces a layer of metadata in a bespoke format to determine the actual disk locations of files, directories, and their associated metadata. Along with that, it can include lots of very nice extra features such as encryption, journaling, hashing (for integrity and de-duplication), CoW, symlinks, hardlinks, directory hardlinks, etc.

- It seems likely then that were the storage API to be raised to a higher level (files and metadata, rather than block storage), the SSD controller can implement every one of those features with superior reliability, performance, and power usage. We've already seen where software encryption went to hardware encryption -- the result is that the encryption is mostly transparent to performance. What would happen if a file's extents mapping was simply the physical location mapping the SSD controller is already doing? If if could transparently perform hashing on read/write or keep a proper journal of file operations? Handle CoW transparently?

And all of that implemented in a way that those just become standard storage features rather than requiring a certain OS to run.

I'd really like to go other direction, and have SSDs expose non-uniform storage.

Right now, if my filesystem dies, I have a ton of tools and manuals for recovery. This is the format is well known, and diagnostics tools are easily available.

If my SSD dies, it's just gone. In two cases I had, the drive just was not allowing any file reads at all -- I had zero chance to recover the data. The drive internals are totally opaque.

So no, I would prefer my SSDs to be super dumb -- I do not trust the manufacturers to get recovery tools right.


I'd love to see something like Open Channel SSD's ( http://lightnvm.io/ , https://openchannelssd.readthedocs.io/en/latest/ ) becoming mainstream.

Spec at http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf

> This specification defines a class of SSDs named Open-Channel SSDs. They are different from traditional SSDs in that they expose the internal parallelism of an SSD and allow the host to manage them. Maintaining the parallelism on the host enables benefits such as I/O isolation, predictable latencies, and software-defined non-volatile memory management, that scales to hundreds or thousands of workers per terabyte.

>We've already seen where software encryption went to hardware encryption -- the result is that the encryption is mostly transparent to performance

Yeah but are you really sure that hardware encryption is safer? Are you sure that it's not secretly saving the key somewhere where it can be extracted? Are you sure that the hardware encryption works at all and that it's not just writing the data unencrypted?

In fact there are so many examples of hardware encryption being objectively broken[1]. I wouldn't trust it at all, and treat it as plain-text storage that probably has more firmware bugs that will lead to data loss.

[1]: https://news.ycombinator.com/item?id=18382975

You do realize its all software at one point right ?

It just that instead of running on main cpu, it would run on SSD controller's microchip. And will probably be binary only, that people wont be able to inspect or fix.

Samsung has good hardware, but not that great software.

I'd rather have the opposite. SSD hardware exposing their internals, to os developers.

The big thing is that you can choose which file system you want. On linux and want a btrfs filesystem? No problem. Oh you want to read a NTFS volume? Just download the ntfs packages and you're good to go.

BTRFS, NTFS, ZFS, EXT4, and APFS/HSF+ don't really matter for general users sure, but each has certain advantages and limitations that make the flexibility should you need it very useful.

Key word being "on linux". On the other 99.9% of devices, I have no such choice and the filesystem required by the OS simply presents a layer of incompatibility. Modern OSs all run on essentially the same hardware (same CPU architecture even!) but store the same data in different, incompatible ways. I want the filesystem to join the list of things that are no longer OS-specific.

I want my storage hardware to have a higher level API than "block device". I don't want to format a drive in some OS-specific format, I want a built-in format that I can access from anything. I want this format to be durable, well specced so that I can easily access historical data. I want embedded devices to support richer storage functionality than afforded by FAT32+.

Consider that before S3, we stored files on servers in bespoke ways where we had to worry about permissions and filesystem differences (path too long!) and encryption and durability and a hundred things that we now... don't. Instead we have a simple API with dozens of conforming backend providers, including the ability to do it yourself. Imagine if computer storage experienced a similar revolution?

But S3 is so unified because has heavy limitations -- only basic file ops, no consistency guarantees. It actually has less capability than FAT32 in some ways (For example, no attributes, no way to say "create file but do not overwrite it")

Sure, someone can make super-limited drive which only does 3 basic ops (read, "upsert", list). Would it be useful for general purpose computing? I doubt it.

> It actually has less capability than FAT32 in some ways

I'm not sure I'd agree with that:

- "no attribute": S3 does actually have quite sophisticated ACLs and permissions that can be set at the object level

- "no way to say "create file but do not overwrite it": there are a few ways you can do that. Versioning, bucket policies, etc.

In fact I disagree with both yourself and the GP with your assumption that S3 is simple. If you want to do anything "enterprisey" with S3 then you absolutely do need to be aware of the maximum size of allowed objects (these days it's 5TB but it wasn't so long ago when it was only around 5GB). Then there's bandwidth costs (which means you need to also take into account VPC traffic and thus S3 endpoints) and ACLs (soooo many different places you can set permissions on objects - it's easy for someone untrained to get totally lost). There's different storage options in S3 too - depending on the frequency you want to request the data.

S3 is only superficially simple - but then the same can be said for any file system. If you just want somewhere to dump files then any fs will work. But when you start needing to performance tune and do all the other considerations that the GP highlighted with local server storage, then S3 quickly becomes as complicated as every other option out there.

ZFS, exFAT, and UFS are all supported on at least NT, Darwin, Linux, and I believe the major BSDs. I will happily grant that only exFAT is likely to work for embedded devices, but you do have real options.

Nowadays most SSD failures happen in the SSD controller, not in the physical storage. I wouldn't trust SSD manufacturers to write reliable software. At least not without waiting 3 years to weed out the obviously bad brands.

The people you have building incremental changes with great momentum aren't the people you want working on a revolutionary change with a hugely different problem space, at least from a business perspective.

Yea agreed. Also if hardware manufacturers did a bigger share of the file and storage system, they could in theory add some hardware optimisations for the most important filesystem software requirements. The kind of things you can't do in software alone.

Yes! -- because they at least have an incentive to make it work with more than one operating system. Even if it's not great, at least it'll be compatible.

It's insane that there still isn't a standard file system. They all do 99% the same thing now. Making them incompatible is just a pissing contest, and the users lose.

They most certainly don’t do 99% the same thing, especially now. We’re seeing an advent of new CoW filesystems (ReFS, APFS) alongside the ones that have been around for a while (ZFS, BTRFS). These all differ on multiple axes: handling RAID/volume management at the FS layer (which all do significantly differently), inherent/conventional filename behavior (e.g. 2 byte characters vs single byte, Unicode normalization), direct support of various POSIX filesystem features, bitrot prevention. Plus all the other common filesystems still in use from the last few decades (ext4, xfs, ntfs).

That’s not even including facts of how operating systems use filesystems that aren’t technically enforced in the filesystems itself (e.g. case on Windows). There is a lot more going on here than a pissing contest at user’s expense. In fact, if e.g. Windows threw in the towel and started using UTF-8 file names with case sensitivity tomorrow, the users would be the first to yell that half their software no longer works!

Making it work isn't the same as not making it utter shite. Samsung has been pretty problematic with resolving firmware issues.

So basically, we need another exFAT, but for SSDs?

> A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc.

I don't see how this is related to have hardware key/value. The only thing affecting filesystem support is os developers supporting filesystems.

> having a single disk that could boot any computer operating system without having to manage partitions and boot records!

Again, not related to object stores at all. You still need your UEFI to be able to find your boot images on the disk.

> receive a callback when the file is on disk. All independent of the OS or kernel version you're running on.

Hardware interupts are always going to have to be handled by the OS.

> Native async support is a huge win.

All I/O is already asynch on the hardware level. And again, I don't see the relation to the technology actually discussed in the article.

I'm really not sure about your first point.

> A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc.

Fat32? There's plenty of newer file systems, for the most part if they aren't cross-platform by default then either the vendor has their own solution (Apple) or your system is fairly modular by design (linux).

> Imagine having a single disk that could boot any computer operating system without having to manage partitions and boot records!

The BIOS/UEFI Still needs to know what to boot, so the drive must be partitioned somehow. You'll want to store that data somewhere, and now you have a boot record.


I'm not these drives solve either of those issues. Now the other ones you mentioned, those I agree with.

One of the first things that happened with the advent of k/v databases is that devs immediately re-implemented relational joins in software.

I suppose it’s only fitting that the top-rated HN comment is someone suggesting that a k/v SSD should be used to reimplement...filesystems.

When you say "in hardware", it's not really what you think. These drives generally have low-power general-purpose CPUs in them, like perhaps a little ARM Cortex-M. Any filesystem implemented in the drive will just be regular code running on a different CPU.

I really don't see huge wins here. This will just increase the surface area for bugs, with less ability to fix them, and less ability to recover data when things go wrong.

I definitely do not trust drive manufacturers to write high quality software. It's just not one of their core competencies.

The CPUs may not be as little as you think. Samsung's latest consumer NVMe SSDs - 970 line - can have up to 6 watt TDP at load (and have metal heatspreaders even), and if I remember correctly, the controller is like an 8-core ARM chip that runs at 3 GHz or something. That's more powerful than an Android phone.

> A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc. Imagine having a single disk that could boot any computer operating system without having to manage partitions and boot records!

You would still have to manage namespaces, so that would be partitions in disguise, and you would still have to manage some special object name for "the thing you load and run to bootstrap the operating system", so that would be boot records in disguise, and you would still have OS specific ways to store their OS specific meta data that represents OS specific semantics, so that would be more or less filesystems in disguise. The only remaining thing would be basic interoperability ... or in other words: What FAT provides now.

> - Significantly improved filesystem performance as it's implemented in hardware.

Obviously, no such thing woud be implemented in hardware, but rather just in software on a processor on the disk device. So, it's still software, just with the guarantee that you can not possibly fix it, improve it, or adapt it to your needs.

> - Better guarantees of write flushing (as SSD can include RAM + tiny battery) that translate into higher level filesystem objects. You could say, writeFile(key, data, flush_full, completion) and receive a callback when the file is on disk. All independent of the OS or kernel version you're running on.

Which has exactly nothing to do with the topic at hand. Any "normal" SSD could do all of that. The fact that storage devices more often than not don't care about correctness has nothing to do with whether that is possible, and everything with whether you can build a cheaper product that gets better benchmark results if you don't care. Having an even more complex interface, if anything, is likely to make manufacturers cheat even more on this.

> - Native async support is a huge win


> Already the performance is looking insane. Would love to get away from the OS dictating filesystem choice and performance.

So, you would like to get away from being able to implement any filesystem you like, or choose one of a dozen or so that fits your needs best, because that free choice in your mind somehow translates to "dictating your choice", to a situation where the manufacturer of the hardware dictates the filsystem you use (that is, the filesystem that is implemented by the filesystem driver running on the processor on the disk hardware), because ... having no choice of filesystem gives you the freedom to choose, I suppose?

> A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc. Imagine having a single disk that could boot any computer operating system without having to manage partitions and boot records!

Furthermore, I highly doubt this would be the reality. I imagine SeagateFS, Western Digital HyperFS, paying more for a better FS.

I wouldn't be surprised if we get Filesystem DRM, maybe Filesystem subscriptions, you can only backup Seagate drives to genuine Seagate backup systems, and that costs extra.

But maybe I'm just cynical....

You'd also want a way to identify related groups of files and handle permissions, I suggest a prefix convention so every file that starts with "/home/flukus/" belongs to me and isn't modifiable by other users. /s

From a high level users perspective file systems are already key value stores so this isn't going to change anything. There may be some gains from moving to hardware, but given hardware manufacturers track record with buggy, broken, closed source software I wouldn't be too keen.

> A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android etc.

"Future possibilities" ... in 2019.

It frustrates me to no end that there are some areas where programmers are willing to go to extreme lengths to be perfectly compatible (C/C++, Unicode, IEEE FP, ethernet, TCP/IP/HTTP, HTML/CSS/JS, PNG/JPEG, ...), and in other areas it's just accepted that everybody is completely incompatible and users have to deal with the mess (newline characters, SQL dialects, filesystems, syscalls, opcodes, graphics APIs, driver models, some multimedia codecs still, ...).

I can put a "Pile of Poo" character in a text file and have it work correctly everywhere from my telephone to a server in Finland -- but I can't put that text file on a disk and expect to see the file on two PCs that happen to be running different operating systems.

I wish we were far enough along that I could complain we can't (efficiently) run one compiled program on any operating system -- there's really no technical reason we can't do that -- but we're not even close. We can't even look at a list of data files on a disk.

What is everybody working on?! We don't need any more web 2.0. We need to fix 50 years of historical accidents and incompatibilities. In 1969, Richard Hamming said "Today we stand on each other's feet", and AFAICT nothing since then has changed. Nobody cooperates with anybody else. As Alan Kay once said, "The real computer revolution hasn't happened yet". Adding more JavaScript trackers is not going to help us get there.

OK, rant over. Sorry.

Those JavaScript trackers pays for WebAssembly. Right not it's about the closest to the holy grail for cross-platform applications. The (for-profit) companies pushing the state of the art in browser technologies all happen to run on ads.


>filesystem implementation in hardware

there is no hardware, its just someone else's embedded firmware

A cross-platform filesystem that you could read/write from Windows, macOS, Linux, iOS, Android

I wonder if that is possible to implement right now... It’s simple to adapt something like a raspberry pi to become a usb device, pluggable into another computer. It could emulate a USB drive, but depending upon the host computer’s OS, it could present the raw drive data as a different file system, fat/ntfs/ext4/whatever, and translate the host’s reads and writes back into the appropriate reads and writes of an internal ‘universal’ filesystem.

You’d have to make compromises to cope with FS-specific features (like how to translate users/groups onto a more basic FAT representation of the storage, etc) but in principle it could work.

...except I’m not sure if a usb device can detect the OS of the system it is plugged into?

I think that's partly what MTP was supposed to be, without the os detection - a way to present the files to a host OS without needing to expose the underlying FS

> ...except I’m not sure if a usb device can detect the OS of the system it is plugged into?

Great idea, and if done inside an FPGA or just using some microcontroller without involving a full OS into the stack, it would be absolutely perfect. But hey, isn't that essentially what they are doing but they are changing the "API".

When it comes to fingerprinting the host OS from a USB device: https://media.blackhat.com/us-13/US-13-Davis-Deriving-Intell...

If you are willing to have your transport layer be IP, then the disks could expose NFS. USB has the ethernet device class, so a usb device could like an ethernet adapter. There would need to be a discovery phase when a device is inserted, probably using something like zeroconf.

I'd rather not have yet another poorly cooled, poorly programmed chip that's prone to catastrophic failure (most common failure of SSDs is the controller, and USB 3.0 controllers are known to fail a lot) in the middle.

It seems unlikely that the performance would be significantly better from implementing a filesystem in hardware, especially since current filesystems can store data in system memory as a cache, which is always going to be lower latency and higher bandwidth than an external device. It is also nice to be able to decide at runtime how much memory to allocate to the filesystem cache vs other uses.

I think you'd still need bit-level data adressing to make sure you can do stuff like RAID to make sure you can handle drive failures with parity, or you need to always 1:1 the data.

This is huge. It's been obvious for a long time that the standard block-device abstraction isn't a great fit for SSDs. This development finally gives us a better abstraction that will immensely improve performance of a wide variety of applications (possibly even SQL databases).

Why isn't a block device a good fit for an SSD?

NAND flash has two inherent sizes: pages and erase blocks. NAND flash pages are several times larger than the 4kB virtual memory pages that your OS prefers to deal with, and erase blocks are a few orders of magnitude larger than that. There's a lot of complexity involved in making it possible to do 512-byte or 4kB IO to NAND flash.

Block sizes can often be set to at least 65KB, does this not help the problem?

It's the erase blocks that cause the real problems. If a page is 8KB or 16KB then raw writes would hit the IOPS cap before the bandwidth cap, but that wouldn't be a major source of issues. With erase blocks at a megabyte or larger, they introduce enormously complex remapping and garbage collection requirements. You can't do a whole lot about those problems with a block device API.

It helps a bit, but keep in mind that just because your filesystem is doing most of its allocation in 64kB chunks, doesn't mean it is issuing only 64kB IOs to the drive itself.

Additionally it doesn't mean that your 64KB chunks aligned to your current partition are also aligned with the 64KB boundary of the underlying block device.

Interesting. I remember a short period where Seagate was really hyping their own ethernet connected key/value hard drives back in 2015[1]. It seems like that project died and this appears to be a completely different API (and not ethernet based).

[1]: https://www.theregister.co.uk/2015/08/17/seagate_kinect_open...

Network attached HDDs were killed by ethernet. In their infinite wisdom, ethernet decided there was no need for speeds between 1GbE and 10GbE. They finally caved to reality with 40GbE, 25GbE, and 50GbE. They eventually ratified 2.5GbE, but it was too late for network drives.

The transfer rate of an HDD is well over 1Gbps.

> In their infinite wisdom, ethernet decided there was no need for speeds between 1GbE and 10GbE. They finally caved to reality with 40GbE, 25GbE, and 50GbE.

I’m not sure that’s accurate. There were no intermediate steps to get to 10GbE, it was a straight jump from 1GbE and it was relatively easy.

To get to 40GbE, it involved running four lanes of 10GbE. To get to 100GbE, it involved running four lanes of 25GbE. 50GbE is just two lanes of 25GbE.

We’ve gone from a single serialised stream to multiple parallel streams in order to reach next order speeds. This is magnified when you start looking at 400GbE and 800GbE services that operate on 8 lanes (QSFP-DD or the confusingly named OSFP).

>it was a straight jump from 1GbE and it was relatively easy.

It wasn't relatively easy, because you can't run 10GBASE-T over Cat5e, and because a 10GbE NIC was as expensive as the rest of the drive put together.

Today you can run 2.5GBASE-T over Cat5e, but the standard was too late, and there was very little demand between 1GbE and 10GbE, so there are virtually no controllers. What controllers you can find are usually 10GbE controllers that can also do 2.5GbE. Until 2Gbps home internet and LANs become popular, there's no reason to expect change.

>There were no intermediate steps to get to 10GbE

That's the problem. Disk transfer rates can approach 1.5Gbps, so a 1GbE is a serious bottleneck. But 10 GbE hardware was significantly more expensive. What am I going to do, build a JBOD with a dozen disks and 1x10GbE, or put a dozen disks onto the network with 10GbE interfaces?

So instead you need 2x1GbE on every disk, which complicates management and doubles cabling, switches, and cost.

>We’ve gone from a single serialised stream to multiple parallel streams in order to reach next order speeds.

Which is hilarious when you consider PATA.

10gbit networking gear is massively more expensive than 1gbit

Its finally beginning to fall in price, Microtik are doing relatively inexpensive routers with 10gig SFP ports.

If I were going to sell a network attached spinning disk at scale, then I would probably target 2.5GBASE-T. The problem there is that no routers support it.

3 years back WD put out a special variant of their Ultrastar He8 drives that ran a Ceph OSD on the drive itself that supported 2.5 Gbe to each drive over a SATA port with an odd and incompatible pinout.


One thing to note, in that test they were using a chassis that limited the drives to two 1 Gbe connections but that's not a limitation of the drives. Now that WDLabs has fizzled out who knows what became of that platform, it was actually quite interesting but it wasn't really directly using the drive to natively support object storage, internally it was still presenting itself as a block device to the Ceph daemon running on it.

Exactly. It's a beautiful concept: the server is roughly half the TCO per gigabyte. That can decrease significantly with better architectures, as shown by this.

But making it 2x1GbE is a huge complicating factor. And 2.5GbE is way out of the cost curve per bps.

I don’t understand the comments saying that this obsoletes databases, or that this is a good substrate for relational databases.

The interface is for random access. Quite a few database optimizations depend on sequential access, I.e. accessing records in key order following a random access. This is why B-trees are so important. Sequential access in key order does not appear to be a possibility with this technology.

> The interface is for random access.

I'm not sure where you are getting this idea. How keys are organized would be up to the device, and devices could support multiple schemes. Clustering keys in a sorted order and fast in-order iteration seems like an essential requirement.

Just scanning over the specification [1], I see an iterator interface for key groups (6.4) and a setting for ordering (5.4.3)

Edit: just to clarify, I agree that this would in no way obsolete databases. But software DBs could utilize key/value disks internally.

[1] https://www.snia.org/sites/default/files/technical_work/KVSA...

Very interesting presentation.

Slide 43 is for write performance though. RocksDB does a lot of clever things with writes, like delayed and batched flushing with a WAL for recovery. Would be interesting how that benchmark was written.

Other slides with read benchmarks show very significant performance improvements.

But comparing directly to a software DB seems inappropriate to me anyway. We can't expect (and definitely wouldn't want) a hard disk to offer something as complex as RocksDB/LLDB/etc.

I would rather imagine that the software DBs start taking advantage of key/value disks internally.

Is that basically like having B-trees on the keys? I guess you would need also an API to retrieve multiple values at once, like `MGET` in Redis.

Aggregations and the like require sequential access, but random access is very valuable for RDBMS where in most cases the vast majority of operations are random (write a record, pull a record, update an index, etc). Many of the systems in play have an extraordinary number of optimizations for sequential access -- assuming terrible random access -- but would still hugely benefit from something like this.

However those who say it obsoletes anything beyond the most trivial of KV stores are way off the mark. It's a possible optimization for all sorts of database systems, certainly including RDBMS systems (many of which are layered over a KV store or sorts). There are disk systems for Oracle DBs that can run a subset of SQL right at the storage later, filtering by predicates, doing index searches, etc.

In addition to what the other commenter said: sequential access is considerably less critical for SSDs than it is for magnetic hard drives. It's still somewhat faster, but only by maybe 2x or so, compared to a factor of several hundred for a magnetic drive.

Yes, but...

You might have many, many sequentially related records in one 4k block of an index retrieved from a SSD. Maybe 200. Then in turn you can retrieve those index blocks sequentially and get a performance improvement.

In turn, when you're doing a merge or bitmap scan over the index, this can make a really big difference.

I agree. If all you can do is ask for a key, then you can't optimize the order in which you fetch them. That does matter, because if you fetch keys in the same block you can load the block once. Maybe the new optimizations the SSD can do with this level of abstraction will make up for the difference in performance. Or maybe they can try to organize keys by sorting or something like that.

These have been built many times, and I have even been involved in the design of one (that was ultimately scrapped). The value proposition of these products is not what many people seem to be assuming, so I will elaborate. Under the hood, most implementations are just mildly modified LevelDB/RocksDB/etc running on an ARM processor.

In theory, there should be no performance advantages to embedding the storage engine this way. But there is -- back to that in a moment. Not only will a properly optimized storage engine run just as fast on the CPU but there are performance advantages to doing so for applications like databases. If you are designing a state-of-the-art storage engine for complex and high-performance storage work, these devices are not for you, you can always do better with bare storage.

The key phrase is "properly optimized". The I/O scheduling and management underlying a typical popular KV storage engine is actually quite far from properly optimized for modern storage hardware, with significantly adverse consequences for performance and durability. The extent to which this is true is significantly underestimated by many developers. From the perspective of the storage manufacturers, more and more theoretical performance of their product is being wasted because the most popular open source storage engines are incapable of taking advantage of it as a matter of architecture. The kind of architectural surgery required to address this is correctly seen as something you can't upstream to the main open source code base.

The software in these devices is typically an open source storage engine where they ripped out the I/O scheduler, storage management, and whatnot, replacing it with one properly optimized to take advantage of the hardware. This could be done in software but storage companies aren't in that business. Their hope is that people will use these devices instead of LevelDB etc, with the promise of superior performance that justifies higher cost.

In practice, these devices never seem to do well in the market. People that are using the KV stores these are intended to replace are the kind of people that do not have particularly performance-sensitive applications, and therefore won't pay a premium. And it adds no value, and has some significant disadvantages, for companies with serious storage engine implementation chops or software storage-engines that are well-optimized for this kind of hardware.

tl;dr: These are like in-memory databases. A simple way to improve the performance of applications instead of investing in hardcore software design and implementation but providing no other value.

> The kind of architectural surgery required to address this is correctly seen as something you can't upstream to the main open source code base.

Why is this?

What's the advantage of this over a simple hardware interface (which should be really simple since I thought NVMe was basically just a PCIe node) that directly exposes the flash and then let the application/filesystem/whatever layer handle it?

Nobody wants to deal with raw flash. It's awful.

The error rate of raw flash is incredibly high -- it takes loads of error correction to make it reliable, especially with modern MLC and QLC memories -- and the structure of flash erase blocks means that it doesn't support random-access writes. A proper flash translation layer, implemented in hardware, means that software can forget about all the strange features of flash and use it as general-purpose block storage.

Well, on the other hand, better to have the ability to deal with it rather than have everything obfuscated in unfixable proprietary firmware…


If anyone else is interested in understanding the internals of flash memory, see: https://www.youtube.com/watch?v=s7JLXs5es7I (I was curious to see why flash memory has error rates and wears out on multiple writes.)

You may need a basic background in solid state physics for some parts.

I know how flash works. Linux, for example, supports all the "weirdness" of flash with an MTD module which is low-resource enough for single-core sub 1GHz routers with 16MB and 32MB of RAM to handle. I don't see how that is a burden to a 32-core or higher amd64 architecture CPU in a database server whatsoever.

And SSDs already abstract that without adding a key-value store on top of it.

The Linux mtd stack is incredibly primitive by today's standards. It's primarily targeted at small SLC flash devices, like the ones you're describing that might be found in a small embedded device, not the large MLC (or beyond) devices that are used in modern storage.

It's telling that high-end embedded Linux devices, like Android phones, typically use storage devices which implement their own translation layer, like eMMC or NVMe devices, not raw flash.

> Linux, for example, supports all the "weirdness" of flash with an MTD module which is low-resource enough for single-core sub 1GHz routers with 16MB and 32MB of RAM to handle.

Managing 16MB of SLC NOR flash is trivial for a 580MHz MIPS core with 32MB of DRAM. Managing 16TB of TLC NAND is completely different. There's a reason that SSDs almost always have 1GB of DRAM for every 1TB of NAND.

I don't think my router's 16MB of flash are optimised for performance. Additionally, i will need an implementation for windows, an implementation for Linux, one for openBSD. Then there are many different kinds of flash memory on the market, do you really want to debug why your OS is loosing data with a new SSD that uses some weird flash memory?

It usually isn't implemented in hardware, but in firmware(software) on the storage device. It's still just as messy, but now no one can fix it.

Are modern systems much different from the initial M-sys TrueFFS?

The article explains it. You offload a lot of the processing required by the CPU onto the SSD, and you minimize read writes for the SSD architecture, reducing emulation requirements.

Adding more dedicated hardware that does what you can do on the hardware you already have is not necessarily an advantage.

Not necessarily, but often enough that the most expensive part in a mid range or high end gaming PC is an accelerator card meant to offload certain forms of computation that the CPU can't do as fast...

Or perhaps the h.264/h.265 codecs built into modern CPUs and GPUs?

And this isn't at all a new phenomenon, either. We've been using accelerators and coprocessors (as they were often originally known) for decades.

Special purpose silicon is almost always faster and offloading stacks of IO cycles from the main cores is definitely a great thing.

Latency matters. You can execute instructions much faster from firmware on the SSD and higher abstraction level instruction usually translates to many low level instructions.

CPU offload. No futzing with blocks. Just send a key and value.

Then unplug the drive, fly it to Amsterdam, plug it in and read away.

Never understimate the bandwidth of a plane full of SSDs hurtling through the sky?

Can someone tell me what the use case for this would be?

Quite a few databases lately are using RocksDB as their backend. Really, relational databases are just a key-value store with a well-optimized query planner on top of it.


  -CockroachDB (a spinoff of Google's Spanner)
  -MySQL's MyRocks (Facebook)
  -YugaByte (Postgres compatible sharded SQL)
  -Ceph (Opensource alternative to Amazon S3)
I am guessing this is looking into improving the performance even further.

It's removing a layer of complexity from your storage: just as the article explains (does anyone read the articles? Anyone?) there's currently a lot of overhead in SSDs mapping their underlying model into pretending to be hard drives with 512 byte or 4K sectors, just like 40 year old spinning platter.

Your filesystem is then layering a bunch of work on top of that to map the things you care about - files - into a bunch of fragments and metadata into those 4K chunks. You would gain the ability to do things like:

1. Throw away all the spinning disk emulation code in the SSD.

2. Align the FS level primitives with the storage: if you've set your RAID chunks to 64K per array member (for example), store a 64K object in one write, not break it into 16 x 4 K blocks. If your ZFS filesystem is set to 1 MB records, write 1 MB objects to disk, not many 4 K chunks.

3. Variable sized objects mean your filesystem could simply dispatch whole files as objects: if the FS knows your photo is a 20 MB file and your source code file is 1K, it no longer has to break the photo into many blocks, or waste a whole 4K block on a 1K file, it writes a 20 MB object and a 1K object.

4. Applications could access the storage even more directly where it makes sense: Postgres, for example, stores large records via the toast mechanism, where very large column in a row will be stored as a separately to the rest of the table (so as not to blow out the table files). You could extend that special case to simply address the storage directly, and not bother with filesystem overhead at all.

You can see for yourself in the third paragraph. Streamlines key-value storage and might replace database backends.

Why is this worth paying a premium for, though? Maybe a SQL SSD could get some adoption, but a simple KV store? How would that ever be a bottleneck (they are relatively easy to make blazing fast, unlike SQL) in your application? How many people/corporations actually want something like this?

> How would that ever be a bottleneck

Disk I/O is always the bottleneck.

Disk IO is far from the only bottleneck, or necessarily even the most common one.

Source: My day job has been as a DBA for ~15 years.


In web apps

I have been working on a hobby database that uses a key-value store as the backend. Currently, it is structured as a binary tree such that an append-only log is possible (ideal for handling updates on a traditional SSD). This is a very complex and time-consuming part of the implementation.

With this KV drive, I could potentially 'hardware accelerate' my database backend and all of the complexity of log-structured merge trees falls away. It would just be a handful of calls into a KV SSD library that handles the magic of getting a KV pair to disk. All I would need to worry about are the higher-order primitives I can build on top of a KV store. Using hashing schemes and clever metadata structures, you can encode virtually unlimited amounts of information into a 256 bit key. One simple flat store of KV items can contain all of your indexes, objects, scripts, jobs, settings, etc. Additionally, using hashed metadata keys lends itself well to sharing the store across unlimited drives and nodes. 256 bits comprises an unimaginably large key space, and SHA512 is actually even faster to execute on most x86 64-bit hardware if you can eat the 2x overhead on keys (which can be safely assumed to be inexhaustible through the heat death of the universe at this point).

The hash table for mapping keys to values will be implemented in hardware on the device, rather than needing to come off the SSD to the CPU.

Now your SSD can be web scale too!

Any application using LevelDB/RocksDB as their backend, e.g. Bitcoin.

This is a cool idea, and I'm excited to see what kind of perf wins it produces. It seems like the other solution would be to expose all the details of the SSD to the OS: wear leveling, GC etc, and make that a driver concern. Probably a lot harder to get right, but more debuggable and more opportunities to tune for your specific workload.

That's what the Open Channel SSD concept is. It's been getting a lot of attention in the past few years, but it seems like many potential users still balk at the idea of having that thin of an abstraction layer. The open-channel stuff has been influencing the addition of other new features to the NVMe protocol that expose most of the information an open-channel SSD would, but don't break compatibility by requiring the host to use that information.

Several vendors are also supporting the Zoned Storage concept of making SSDs that have similar IO constraints to shingled magnetic recording (SMR) hard drives. Those constraints aren't a perfect match for the true characteristics of NAND flash memory, but it does handle the problem of large erase blocks.

I understand how it can achieve better performance by bypassing the file system.

But I'd like to compare this to hypothetical key-value software that stores its data directly on a partition (instead of in files). Isn't this essentially the same thing? The only difference that I can see is that the software would be much harder to update, and you can offload some CPU on to the processor on the drive.

Am I looking at this correctly? I don't get why you would want this to be a hardware device.

As someone on the periphery of comp-sci my whole life, I find myself wondering, is this similar in concept to content-addressable memory?

No, because we are addressing it by key.

How is this exposed to userspace? What interface?

It is accessible directly to applications through the SNIA KVS API (and Samsung has it's own API as well). There is no filesystem in the middle, if you are using the KV-controller.

On the one hand I think this is very cool. On the other, I think it's funny that Redis can fit in my RAM and not on my SSD.

So now we will have yet another computer IN THE SSD too? Is this a good idea? https://boingboing.net/2018/11/12/cpus-in-your-cpus.html

This has been the case for a while already. Even spinning-platter hard disks had some reasonably powerful CPUs to implement caching, bad-block remapping, and SMART data retention (among other features); SSDs pushed this to another level by demanding higher performance and more complex remapping.

Chances are that this Samsung device isn't even using different hardware from their standard SSDs -- just different firmware.

[1]: https://spritesmods.com/?art=hddhack&page=1

Your current SSD already has a beefy computer with lots of ram to handle the flash translation layer.

Raw flash has very onerous constraints on how it can be written: (numbers are from recent QLC chips)

The smallest unit that can be accessed is the individual page (64kB). Those pages belong to blocks (18MB). Each page can be either dirty or clean. You can read any page, but you can only write to clean pages (turning them dirty). To turn dirty pages back to clean, you have to erase them, and you can only erase the entire block containing the page at once. Also, each block can only be erased a certain amount of times (500) until it can no longer be written to.

As you can see, the API this provides is just awful, especially considering what kind of operations (random read/write of 4kB) operating systems offer to programs. So, in order to provide an api operating systems can actually use, there is a massive garbage-collected translation layer on top of the raw flash that makes it usable.

The idea of key-value SSDs is that given that the block layer api is entirely artificial, it might not be the best choice to present to the OS.

The most active area of R&D in the storage industry at the moment is stuff falling under the umbrella term "computational storage". Everyone is looking into ways to make SSDs smarter, and move some processing closer to the storage where it can be done more efficiently or at least free up CPU time.

Some of the companies making enterprise SSD controller ASICs have decided to throw in extra ARM Cortex-A53 or similar cores for the customer to put software on, giving the application access to the data without a PCIe bottleneck. Some companies are putting machine learning accelerators on the SSD. Some are adding dedicated compression or crypto engines to transform data at wire speed. Some are just putting an FPGA on the drive, or implementing the controller on an FPGA and leaving leftover LUTs for the customer's use.

Almost any idea you can come up with about how to move storage hardware past the hard drive-like block storage paradigm has been at least prototyped and demoed at Flash Memory Summit.

It's like NFS 3 (NeFS) in hardware! (NeWS for disks: A PostScript interpreter in the kernel as a file system API.)


Network Extensible File System Protocol Specification (2/12/90)

Comments to: sun!nfs3 nfs3@SUN.COM

Sun Microsystems, Inc. 2550 Garcia Ave. Mountain View, CA 94043

1.0 Introduction

The Network Extensible File System protocol (NeFS) provides transparent remote access to shared file systems over networks. The NeFS protocol is designed to be machine, operating system, network architecture, and transport protocol independent. This document is the draft specification for the protocol. It will remain in draft form during a period of public review. Italicized comments in the document are intended to present the rationale behind elements of the design and to raise questions where there are doubts. Comments and suggestions on this draft specification are most welcome.


Although it has features in common with NFS, NeFS is a radical departure from NFS. The NFS protocol is built according to a Remote Procedure Call model (RPC) where filesystem operations are mapped across the network as remote procedure calls. The NeFS protocol abandons this model in favor of an interpretive model in which the filesystem operations become operators in an interpreted language. Clients send their requests to the server as programs to be interpreted. Execution of the request by the server’s interpreter results in the filesystem operations being invoked and results returned to the client. Using the interpretive model, filesystem operations can be defined more simply. Clients can build arbitrarily complex requests from these simple operations.

There is already a computer in the SSD, replete with DRAM, multcore CPU, custom math acceleration (depending on SSD design and vendor), etc.

The HP EX920 has a dual-core Arm Cortex.[1] The Samsung 970 Pro has the "Samsung Phoenix NVMe controller".

1: https://www.tomshardware.com/reviews/hp-ex920-ssd,5527.html

It won't be much different than what the current generation storage devices contain.

You already need to upgrade SSD firmware these days. Linux fwupmgr has support for few of them.

I try not to think about the fact that horrible proprietary firmware is in everything now.

One of the comments to the article mentions https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa...

How flexible are the limits on key and data size? I imagine a lot of apps would need more than just 255 byte keys and 2MB values. Is there an efficient way to virtually increase the value size, at least?

But who will be the first to use a Key-Value SSD to implement a file system?

Samsung has already published a Ceph backend for their KV SSD, so you can use CephFS to get a POSIX filesystem backed by these drives.

How would you do a backup of this KV store if it's not exposed as a filesystem or block device? You wouldn't (it's for ephemeral caching?), or you'd just walk the entire keyspace?

There are iterators in the API I believe.

This is really cool! I find it very interesting that moving the key-value store directly into the hardware could result in performance gains.

I suppose these wouldn't be as generally useful?

SNIA KV stack gets rid of the filesystem, and the disk is directly accessible by applications through the API. Hence, it won't be generally useful. The target is a very specific segment of the SSD market.

They need to make write once ssd /usb drives. So you can backup photos and not accidentally delete or be ransom-wared for them!

This is incredibly interesting. Can't wait to see more application specific NVME hardware come out!

What would a RAID-equivelant setup look like over this archetecture?

Well the general idea of a key value store is you are referring to variable size values not 4k sectors. So you'd need a higher level service to ensure object redundancy across SSDs and likely a scrub function to make sure that key/values that you expect are still there, still readable, and have the correct value.

RAID driver would use it as a retarded object store based block device.

The absolutely last thing that any fs designer would want is to outsource to hardware vendor implementation of anything other than a basic operation. We already saw this in hardware RAID which sucks donkey balls compared to software RAID.

This is a pure value add play in a commodity market.

KV-SSD replaces the filesystem with Abstract Device Interface(ADI), that sits between the KV API exposed to the applications directly and the KV device driver. According my understanding, although some modern filesystems combines RAID and the file system, these are separate independent abstractions. However, it would be interesting to explore how RAID will work with KV-SSDs. You can take a look at a commercial implementation of Samsung's Mission Peak system to benchmark KV-SSDs (I am not sure whether the following one is an officially approved implementation or not) - https://www.broadberry.com/performance-storage-servers/cyber...

How does the cost of these compare with existing SSDs?

Same hardware, different firmware. Cost will be basically the same if the idea catches on.

A hello from the Samsung SSD developer :)

Eh, what a title. Wouldn't it be better if it said key-value ?

Maybe it's a cheap SSD that's important for Samsung's business.

Or a slash; key/value or an indirect key->value.

I was really hoping they were going to finally be selling an SSD that was an order of magnitude less costly than spinning rust. That's about what it would take for the tradeoffs in reliability to be worth it.

QLC-based SSDs are coming very close to parity with cost/capacity of traditional hard drives. Terabyte SSDs are now sub-$100, which isn't that bad at all.

Note that the reliability concerns about SSDs have seemed mostly overblown, after years of abuse and testing.

I've seen the prices on QLC SSDs. I feel like currently they MIGHT reach the same price as spinning rust. At that point it's a trade between seek time and reliability.

Within that context I still see spinning rust as the longer term archive / bulk storage media. The QLC storage might provide a good precaching layer in more complex systems, and might eventually reach it's seemingly intended price point (rather than being a small discount to TLC drives that are more mature and higher performing).

The 'order of magnitude' I'm hoping for might be as small as binary (half) for me to consider it... but at 10X the storage per price it's where I hope something with no moving parts and less drive housing should be.

The reliability of flash is fine for everything except leaving unplugged for years. And if you want to do that you don't use drives, you use tapes. It's not a tradeoff at all when you compare to hard drives.

Whining about titles is white-noise that contributes zero value to the discussion, and it's absolutely rampant on HN.

It definitely is not, because mods often change bad titles when it is pointed out.

Which also just happened in this submission.

The fact that the mods humor you doesn't make it not white noise, and in fact it makes them part of the problem since it encourages it.

It is incredibly tedious to see people arguing about titles on every single topic instead of actual discussion. On some topics with less discussion it is literally the entirety of the discussion.

Are you saying that having clear and accurate titles for submissions has zero value? I think most of HN (and even the official guidelines here) would disagree very strongly.

Ok.... I can just store the mapping of content key to blkid. Then I can load the mapping into memory. Then I have a key value store on an SSD using cheap, available hardware.

Obviously, key-value stores aren't a new thing... but pushing that abstraction to hardware offers far better latency.

No it doesn't. The SSD lookup is measured in microseconds. The memory lookup for the mapping to blkid is measured in nanoseconds.

Unless they're putting some expensive, power-hungry CAM in these devices, they're doing exactly what I described above.

The only real advantages could be variably-sized values, and offloading CPU cycles. But a map lookup isn't exactly expensive, especially for an I/O heavy system. And variably-sized blocks aren't particularly relevant, since DBs are pretty good at packing in data.

There's not much there to justify an unusual, expensive device that's far more complicated than the alternative.

> The SSD lookup is measured in microseconds. The memory lookup for the mapping to blkid is measured in nanoseconds.

Some people care about write performance, too.

It's the same in both cases.

You write a dirty mapping in memory. You persist the value, and you persist the mapping. Then you flip the dirty bit.

Moving the abstraction to hardware level gets rid of the multi-layer indirections that maps the key-value pairs to actual physical page. It replaces 3 layers of mappings with 1 layer. That reduces CPU load caused by the multi-layer indirections of software KV stores.

You haven't eliminated them. You've just moved them into a dedicated co-processor.

Co-processors are cheap when they're ubiquitous, like DMA. Co-processors are expensive when they're in custom hardware, like K-V stores. I'll get more performance/$ by using standard SSDs than you will by buying specialized hardware. And because the workload is IO dominated, we'll probably both get the same absolute performance from the same server (that differs only in storage devices).

You'll only recoup those CPU cycles back if you bin pack CPU heavy workloads next to IO heavy workloads, which is rarely desirable for storage services, because it adds a great deal of variance. But you just spent shit loads of money eliminating variance by going to SSD.

I see these as pointless technology because you still have to deal with data protection explicitly at a higher level. Given that, what does KV per drive really get me?

You'll get benchmarks that show higher numbers compared to non-KV SSDs and Samsung may use them in its marketing. But yes, key-value APIs are not going to make durability (and security) any easier. It's easier with lower level APIs, i.e. raw flash or some primitive mapping around raw flash, not higher level APIs.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact