Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Fast key-value stores: An idea whose time has come and gone (ai.google)
335 points by godelmachine 21 days ago | hide | past | web | favorite | 140 comments

I have some real beefs with this paper. Their number about how long it takes to encode or decode a protobuf is wrong and misleading. It seems to be based on this benchmark[1] which encodes a huge repeated int32 stuffed with random numbers. This does not resemble a key-value workload at all. In a KV system you would have something like key and value as bytes fields. It would be extremely simple. By contrast this "benchmark" is the worst possible case for protobuf because encoding random data as varint guarantees that the average field takes 9 bytes instead of 8 and hits the slowest possible path in the codec. The whole paper rests on this number, so the conclusions are crap. They are not even consistent with the practical performance of memcacheg which the authors should have been very familiar with. 1: https://github.com/hq6/ProtobufBenchmark/blob/master/Benchma...

> In a KV system you would have something like key and value as bytes fields.

Isn't that kind of custom interchange format what they're complaining about when they say that remote stores just push the complexity to the client?

In the kv stores I’ve used the client and the server are mostly the same process, or started out that way.

I don’t know many sane people who want to use the kv store as a system of record, and even the people who expect it to be exhaustive (all possible keys) make me doubt their sanity for that and other reasons.

So far, everyone has a bit of code that looks for a key and, if it is missing, performs the work necessary to build the payload. The code that creates the payload is never more than a couple function calls away from the retrieval code.

I don't know but how does complaining that it takes a microsecond to encode 256 random numbers lead to the conclusion that remote KV caches are unworkable? That's just a non sequitur.

> a huge repeated int32 stuffed with random numbers.

Is this not exactly the behavior you'd see if you used UUIDs as your keys? Asking honestly.

If you knew your keys were uniformly distributed you would never use a varint encoding to store them because you'd stand a very high chance of encoding them into a field longer than a primitive integer. A varint can only hold 28 bits of number in the first four bytes, so your odds of getting 5-byte output is 15/16 i.e. very likely. If you really had to encode them I would use either two fixed64 fields, the experimental fixed128 type, or 'bytes' with the exactly-36-byte-long string representation of a UUID. In no case could I imagine packing a huge vector of random numbers into a protobuf int32 field.

The point is that Protobuf has variable length ints by default. That’s an optimization for many common use cases, but slower and larger for random data, including GUIDs. Use Protobuf’s fixed ints for those.

Not to mention protobufs have awful performance compared to more modern alternatives in use today like Flatbuffers, Thrift, Cap'n Proto, SBE.

In the case of Google's own Flatbuffers, the layout is going to be far more performant.

I think it's irrelevant. In fact the protobuf might be the best choice. If it was just defined as so:

  bytes key = 1;
  bytes value = 2;
... your overhead can be as little as 4 bytes and you can alias the memory of the key and value (using a type like std::string_view) instead of copying it. It takes a few nanoseconds to decode a message like this.

> protobufs have awful performance compared to more modern alternatives in use today like Flatbuffers, Thrift, Cap'n Proto, SBE

Do you have a source on that? Genuinely curious.

Hi, I wrote Protobuf v2 (the version everyone uses) and Cap'n Proto.

I don't know if I'd say Protobuf has "awful" performance. It's certainly much better that text-based formats like JSON. But the format is rather branch-y. You have to process it byte-by-byte, because e.g. integers are encoded in a variable-width encoding where each byte contains 7 bits of data plus 1 bit to indicate if this is the last byte. This results in a compact encoding, but takes a lot of cycles to encode and decode. Moreover, since everything is variable-width, in order to find any one field of the message, you must scan through all previous fields, parsing them one by one.

Cap'n Proto, FlatBuffers, and SBE all use "zero-copy" encodings, meaning the data is laid out on the wire in a format that is easy for a CPU to use directly. This means, for example, that integers are fixed-width, and fields are located at fixed offsets. This is must faster to parse (or even use in-place without parsing at all), but does result in somewhat larger encodings. (But then, you can always layer on independent compression when bandwidth matters more than CPU.)

My understanding is that Thrift is closer to Protobuf and contemporaneous with it, so I don't know why GP included it the list.

For simple protocols protobuf decoding has no taken branches. I.e. if you only use the first 15 field numbers (all your tags are 1 byte) and if all the types are the expected types, and if all the variable-length items are < 128 bytes long then you can decode the message without taking any branches. In C++. Most of the other languages have simpler and slower codecs.

This is the hot path in C++[1]. A really large amount of work has gone into protobuf C++ performance in the last 3 years or so.

1: https://github.com/protocolbuffers/protobuf/blob/master/src/...

And all your integer fields must be < 128, right?

Yes, I suppose the branches in Protobuf can be pretty predictable. Still, you do generally have to examine each byte individually.

Sure. In this specific case of a kv store it's hard to imagine how to simplify it dramatically from protobuf. As a proto you might have: tag-length-key-tag-length-value. Instead you could store the key and value lengths in host format using 8-16 bytes: length-length-key-value. It's not _dramatically_ faster to decode this, and you traded away extensibility to get a marginal speedup.

Sure, I was speaking in general, not specifically about the key-value case.

I think most serialization frameworks are likely to be overkill for such a use case, spending more time on setup than actual parsing.

Also note that storing the value (and maybe the key) with proper alignment might make it easier to use the data in-place, saving a copy.

Hit 'y' before copying the link; the line numbers have already shifted.

Which version of Thrift? Apache Thrift looks roughly the same as Protobuf on our benchmarks. Perhaps this is fbthrift?

> It seems to be based on this benchmark[1] which encodes a huge repeated int32 stuffed with random numbers

However, if your entire world consists of pushing around images to GPUs/TPUs, that's pretty much the only workload you ever need to optimize from your larger ops system.

Edit: not sure that this particular use case is exactly what's going on in that benchmark, now that I look more closely. Also, I wouldn't want to assert that it's a use case that matters a ton.

Encoding an image as a repeated sequence of variable-length integers would obviously be a poor choice.

Except.. that's pretty much how raw images are represented. Substituting "integers" for "bytes" if you want channel resolution, or leaving integers if you have 32bit pixel depth.



You overlooking the critical term: variable-length.

is anybody talking to GPU through memcached GETs/PUTs?

This paper is very light on details. It defines "RInK architecture" as something which uses stateless application servers w/ key-value store as backend (Redis/Memcached). Section 3 then shows that it's faster to use a stateful application server, but skips any details about scaling/durability which makes the comparison kinda strange. Are they really just comparing a single-threaded in-memory application server with a scalable system of stateless application servers?

Their solution (LInK) is a key-value system "as a library": You link it into your application servers and you can then store data (they become stateful). The system will automatically handle sharding, replication, consistency and so on. From your perspective you're working directly on memory. You still need to implement marshalling, but this is only used for sharding/replication across nodes (which the system handles completely for you).

They say nothing about how this system handles data consistency/replication/durability, so it's a bit tricky to assess it. They also have this nice note: "(since, for performance, persistence of state is not guaranteed)".

Summary: This is a system which couples your application logic with your persistence layer (in the same OS process). This can give you great performance as they co-side with each other.

Looks like a very cool project, and I'm actually surprised there not more "database as a library"-projects out there.

> Summary: This is a system which couples your application logic with your persistence layer (in the same OS process).

I don't think the paper is talking about the persistence layer. Rather it's focusing on the cache layer.

From the paper:

"Modern internet-scale services often rely on remote, in-memory, key-value (RInK) stores such as Redis and Memcached. These stores serve at least two purposes. First, they may provide a cache over a storage system to enable faster retrieval of persistent state. Second, they may store short-lived data, such as per-session state, that does not warrant persistence."

If you look at Fig.2 in the paper, the database (persistence) layer is always there. What the paper is advocating is to move from an out-of-process cache server like Redis to an in-process cache library like Java Caffeine

So for the canonical use case of session state in a load balanced server farm, is the session state being replicated across every single one of these in-process caches? Or, if a request is routed to a server without a copy of the session state, does the in-process cache have to go find a peer server with a copy of the session state?

This seems like an easy “solution” to show how it speeds up the happy path while not fully addressing the real-world tradeoffs.

Stateless application servers come and go without a care in the world. You can reason very simply about the cost and effects of bringing them up and down.

Add in a KV store with sharding, replication, a discovery protocol, a heartbeat protocol, a sync and recovery protocol, etc... and put it all in-process on every application server?

Have fun monitoring and debugging this.

I much prefer the 3-tier system of a local RAM cache, a Redis cache, and a persistence layer. As much state as possible is idempotent so you can cache it locally as well as in Redis. The load balancer can best-effort route back to the same app server, and your KV lookups automatically check local RAM first. But the central KV store handles replication and is easy to monitor and scale out if needed.

The paper assumes that you have a good amount of related context. I'd recommend at least reading the Slicer paper mentioned several times (https://www.usenix.org/system/files/conference/osdi16/osdi16...).

How's in-process issuing the same library much different than using a separate process? What's complicated there in monitoring/debugging that's different that's not otherwise?

I think that debugging with one less process is a good thing. And monitoring with one less process boundary to traverse is easier as well.

For me the most important part was in the abstract: autoscaling. It’s not just about going to zero but going to hundreds of thousands. I can remember my two most painful times with memcache:

1. A client scaled rapidly & we had repeated hits on the same key. We were saturating the NIC in the memcache server. We had to addd indirection for some data and introduced an local in-memory layer in front of memcache.

2. We had to extend the ring size to give memcache better CPU utilization. We had to teach all our code to handle a migration process (reads were fall through to the new ring and writes purged both). We couldn’t just turn off memcache because a total dump would overwhelm the database.

> I'm actually surprised there not more "database as a library"-projects out there.

For distributed systems, you tend to need quite a bit of configuration (how to find the peers, authentication, things like namespaces).

And then if the communication is TCP-based, you tend to keep connections open to avoid the handshake latency when you need to retrieve a value. Which sounds like something you typically don't want running inside a library, but rather as a separate process.

And then you've basically arrived at the memcached design.

So the crux of this paper is "if you have perfect auto-sharding as a service, it's better to cut out a middleman and have it shard stateful app servers, rather than just sharding a KV store layer." In many ways, this paper is an extension of the 2016 paper it references, which introduces their internal sharding-as-a-service, Slicer: https://www.usenix.org/system/files/conference/osdi16/osdi16...

Once you separate out Slicer, this result is pretty obvious in theory - in fact, in that Slicer paper they mention machine learning as a driving example, where it absolutely makes sense to colocate application logic with user-specific models cached in memory.

Honestly, this paper, with its one-paragraph implementation section, reads almost like clickbait - "let's say something controversial, release an un-reproducable paper that adds very little to the literature besides 'this is something cool you can do if you have a tool only Google has,' btw do you want to work for Google?" The Slicer paper itself has much more detail and IMO is a much better read.

SQLite (:memory:), lmdb, berkeleydb, memcached-over-unix-socket, APCu (php), EHCache (java)..

I'm sure there are others; I don't get what this paper presents that is new

Oh, sorry, I was imprecise: I meant "distributed database as a library". As in, you compile/bundle your application logic together with the persistence/replication/transaction logic and that is what you're running in a cluster.

Datomic is a distributed database written in Clojure, and is best described as "cqrs-as-a-service". It allows you to run a "peer" inside your local Java VM, which essentially does exactly this: it caches the parts your local vm needs, and if not it looks elsewhere.


This is do-able with FoundationDB. You could distribute your application colocated with the database nodes and write a router to redirect requests to the machine which owns the data you're looking for. Then your application logic is computing against data on the same machine.

FoundationDB exposes the key-range to node mapping to you to make this router possible.

The BeamVM languages (Erlang, Elixir) seem to have this in the form of Mnesia, and libraries that are built on top of it, such as riak-core.

Ha! I was just about to ask about that.

‘I bet Elixir already has something like this’.

Riak uses Mnesia?

Not that I know, but it uses ETS to handle state, and the global process registry to store peer information, as far as I know.

Also, be aware that OP mentioned riak-core, which is a different thing from Riak. riak-core is a library to build dynamo-style distributed applications, which gives you virtual nodes, consistent hashing and data migration as building blocks.

Riak KV uses leveldb, but as @ergl mentioned, riak-core is essentially a hash-ring and replication (aka dynamo) layer.

EHCache has tons of capabilities and is bundled directly into your app('s war file)

Apache Geode, Hazelcast, how some people are using Kafka. I am just listing a few others, please don't consider this endorsement.

Apparently I have designed a few production "LInK" systems over the years (now that someone else has named it), with relatively sophisticated algorithms for load balancing and replication. These architectures are great for fast-twitch mixed workload use cases, rapid mutation rates concurrent with relatively light, low-latency analytic queries. Think operational real-time data models.

Conceptually, these are similar to the simple database kernels used as caches inside BI systems like Tableau but extended to support write workloads and scale-out. These may have a large amount of storage in them but only to increase the quantity of data each server can handle (i.e. swap space). Storage is not for durability, other systems handle that. The primary advantages of this architecture is it dramatically reduces data motion and wasteful computation, and you can often push complex data model logic to this layer since it implicitly contains more application context.

All "databases as a library" (e.g. SQLite) are designed around a much simpler set of constraints. The combination of high write rates and continuous background resharding/rebalancing across servers requires low-level architectural support that you can't graft onto a database kernel that was not explicitly designed for it. To start with, any query execution has to have sophisticated topology awareness in a system where the topology is constantly changing due to resharding/rebalancing and you can't rely on global consensus for everything. The library you are linking into your application contains all the clever algorithms for navigating a data cache that behaves like this, you don't want the app developer to have to reason about it.

The level of dynamic data model "fluidity" in the underlying organization and algorithms means that a mature design has little in common with traditional in-memory databases even though there are many superficial similarities. This isn't something you grind out over the weekend, there is a lot complex and non-obvious theory behind robust designs.

(I agree that the paper is not very clear about what they are talking about. This is my interpretation of the paper based on my recognition of the claims and architecture.)

Do you have any reference for your load balancing and replication algorithms?

On the topic of pruning logic more complex than put & ge in the application, to me that is kind of like want to add iterable functions to a language that only has for loops. That was enough for me to appreciate the paper. I was recently working on adding some query logic to an app that talks to Redis, and it definitely felt like more work than I cared for. At a minimum I would have liked to see that logic go into a more feature rich Redis client

> I'm actually surprised there not more "database as a library"-projects out there


Reminds me of the talk "Building Scalable Stateful Services" by Caitie McCaffrey about stateful app servers for Microsoft Halo (the multiplayer shooter game). She explained that stateless app servers are wasteful and the user's data should be cached in an instance and routing should be sticky to always reach that instance, and once it is warm performance is good. If the instance goes down, the user data will get loaded into another instance and will get stickied there.




We are an all or nothing sort. If we use statefulness and sticky sessions we almost always tend to use it in a way where migration to another server is next to impossible.

If we had better tools for the moderate position then life would probably be easier than at the extremes.

While the idea of migration is useful for resilience in the face of hardware failure, it’s more attractive from a standpoint of elastic scalability.


We are still living in a bubble where we believe that cloud providers aren’t going to oversubscribe their hardware the way ISPs have been doing for decades. As the ratio of private server rooms declines, that bubble starts to wear thin.

We are still living in a bubble where we believe that cloud providers aren’t going to oversubscribe their hardware

AWS doesn't oversubscribe their main class of cloud servers, but they will sell you a cheaper t-series if you're willing to accept throttling under heavy CPU use.

They used to have "noisy neighbor" issues with some of their older instance types where another server on the same physical hardware could monopolize the disk/network bandwidth (but not the CPU), but they've resolved that now, every server can use it's published resources (except for network which can burst up on 10 Gig on some instances types, but sustained except on large instance types that can sustain 10Gig)

They've built their business model on selling dedicated resources, and I don't see that changing.

If we use statefulness and sticky sessions we almost always tend to use it in a way where migration to another server is next to impossible.

I'm developing a server cluster with statefulness and "sticky sessions" (1) for a multiplayer game. Migration to another server is a feature I'm working on, and already have prototyped in demo form.

(1) -- "Sticky sessions" are actually Player agents which live in a particular server.

Meanwhile, the rest of the CRUD world is still using local RDBMS in place of local key-value stores.

I understand the point the abstract is trying to make, but the title is sensationalized at best.

I remember a headline on HN from a month or two ago. Something like "You're not Google".

Turns out that a run of the mill RDBMS will fit 99% of the problems you're trying to solve.

Edit: https://news.ycombinator.com/item?id=19576092

What makes an RDBMS "run of the mill" and a key-value store not? If anything key-value stores tend to be much simpler, both conceptually and in terms of implementation.

A SQL DBMS (run of the mill) gives you a well thought out data model (the SQL model, derived from the relational model) that implements your business logic in terms of a well designed, well understood data model.

Conceptually, the relational model has relations and atoms. The algebra is built on the primitives conjoin, disjoin, project and rename. You can't get much simpler while being as expressive.

The SQL model adds a good deal of complexity, but at least it's standardized and you can generally ignore the complexity you don't need.

A KV store is conceptually simple if you stick to using it as a KV store. Once you try to implement any kind of schema, you have to build that from scratch. As you add business logic to it, you wind up inventing an ad hoc data model, which is likely to be conceptually more complex than either the SQL or relational model.

You may not be aware of the complexity of your ad hoc model, but math will inevitably remind you.

Redis with GraphBLAS [1] gives you a "well thought out model", exteme performance, and all the goodness of a linear algebra relational model under the hood, which can easily run on CPUS/GPUS/TPUS or a combination thereof. Pairs well with tensor models too.

For more context and depth, see previous discussions...


[1] GraphBLAS http://graphblas.org

You may not be aware of the complexity of your ad hoc model, but math will inevitably remind you.

I think this idea needs to be expressed more clearly in CompSci and programming. What is the exact nature of the complexity explosion you are talking about here? Is there something analogous to the increase in complexity going from regular expressions to stack machines? What math are we talking about here? It seems like everyone just explains the relational algebra, then leaves it right there.

> It seems like everyone just explains the relational algebra, then leaves it right there.

Fair point.

> What is the exact nature of the complexity explosion you are talking about here?

I'm thinking in terms of the entities of Occam's razor: "entities should not be multiplied unnecessarily." "Entities" is pretty abstract, we can't identify what complexity is directly, but what we can do is imagine, "what if we tried to build a mathematical model that captures a real life application?"

If we did that, and had a mathematical description of a thing we wrote, we can formalize it by trying to reduce it to some minimal set of axioms.

And then, your more complex rules are derived from those axioms, and if you get your math right those complex rules will be consistent. If you're very clever, you can make it reasonably intuitive.

If you have something that's very complex, what you'd observe after modelling it is you have mostly axioms and very few rules are able to be derived from those axioms. That is, the rules are just the rules and there's no broader reason for them to be so, or deeper consistent patterns. And, maybe some of those rules wind up being contradictory, and they may lack orthogonality.

The relational algebra, being an algebra, is a set of operatations that are closed over the universe of relations, so it's very nicely orthogonal and reduces to a small set of primitives. As relations can be visualized as "tables" they're relatively intuitive, and using techniques such as normalization you can also structure around potential anomalies that add unwanted complexity.

You can also control where your complexity goes. An integrity constraint can enforce some rules about what may be in a relational variable, and if that's enforced, all code that wants to use data from that relation can simply assume that the data maintains that structure. Thus the complexity can be centralized in the system.

Now, that's in the ideal world with a True Relational DBMS, we have to deal with SQL and vendor-specific SQL at that, so we have a rather complex underlying model. But it's typically Pretty Good.

When we build such a model ad hoc, we go by our intuition and are constrained by the market. Our intutition leads us to use more complex structures, that is, structures that if they were expressed mathematically would use a larger set of axioms to describe them. Then, as we want those structures to interoperate, we are unwittingly merging two sets of axioms.

Worse, because we're typicaly expressing it in application libraries, we wind up repeating code, so the complexity is spread all over the application and it especially multiplies if "conventions" are copy-pasta'd.

Hopefully that explains the nature of the explosion of entities / complexity; let me know if I can expand on anything.

Hopefully that explains the nature of the explosion of entities / complexity; let me know if I can expand on anything.

It does not satisfy me. One can show that a regular expression or finite state machine is limited in specific ways, as compared to a stack machine or a Turing machine. One can write proofs concerning the number of states a specific machine can be in, given an input of a certain length. The explosion in complexity can be quantified, as can the impact on the effectiveness of testing. By comparison, "using techniques such as normalization you can also structure around potential anomalies that add unwanted complexity," is just an aphorism.

Of course there are times (such as encrypt/decrypt requirement) where using predictable paths as keys, and encrypted json as values can be easier. Also, a K/V store can easily have the underlying system swapped out... replace with Mongo/Redis/PostgreSQL/RocksDB/LevelDB ... you keep your app interface the same at a higher level, and swap out the lower layer as needed for a given environment.

Somebody please hard-code this comment to the top, because it is spot on.

An RDBMS can handle considerably more tasks due to the power of the relational tools, and there’s a huge existing library of tools built around that model. That means that there are far more projects which can be implemented using only an RDBMS than only a KV store, and in the modern era both are easy cloud options with SSDs pushing caching past the level that most projects will ever need.

There's even databases that are focused on increasing the capabilities of rdbms.

Point is that sqlite could take people way further than they think. Postgres incredibly far. By the time you need a caching layer where key value stores shine best you have many options you can do and resources to do them.

> Postgres incredibly far

Up to the millions of TPS actually.


I think the point is that RDBMS are considered stodgy and uncool despite the fact that they can be extremely sophisticated and powerful.

Cheap, "faster," less-reliable stores became very big around the Mongo explosion - and don't get me wrong they can have a place, particularly Memcached and Redis which are great for ephemeral data - but you can do a lot of that with a RDBMS.

There was also a point where "internet scale" was impossible to handle on regular RDBMS's without massive amounts of pain. But disk IO performance has gone up dramatically since, and capacities likewise.

So a lot of caching that many of us had to do to scale systems a decade or two ago is now not necessary until/unless you get several magnitudes up in scale, and while the number of internet users has gone up, so has competition, and the proportion of services that gets successful to need scale up to several magnitudes beyond the old chokepoints at the time is tiny.

I think you're misunderstanding what "run of the mill" means. It isn't an expression of simplicity. Anyway, look it up, and you'll have the answer to your question.

"Run of the mill" is not about simplicity, but refers to something common, average, commotidized.

Lots of dev opt for DBs like Redis or Mongo specifically because they're simpler and easier to develop for. Not for scaling. I think that can easily be a trap though because they are very easy for simple use cases, but get hairy really quickly when you start needing complex relationships.

You can use RDBMS for anything that requires complex relationships and map them to unique identifiers in the Mongo documents.

In most cases though, Mongo fits use cases, we never try to use it as a wholesale RDBMS replacement.

Agreed... Mongo also works well for denormalizing data structures into something that renders faster to the browser. A single key/document->json->get->parse-> render is much faster than a query across many joins and/or sub-queries.

As an aside, if you're doing it anyway, it's also a good point to dump to .json.gz in S3 (or similar) as a secondary backup system.

The authors advocate for run of the mill rdbms plus their sharing system.

Fast key-value stores are really great for caching layers, as the main source of truth on data not as much. Whether these are local in-memory or external like Redis/memcached/mongo etc.

Any good large scale system will have a caching layer for in memory or external cache store, even small/medium apps may use them for performance and ultimately cost when it comes to cloud, putting more on cpu/local than external databases. It is really what works for the system the best as with anything in engineering, sometimes needed, sometimes not, it depends.

Memcached really started out as a caching layer, it is right there in the name. Once NoSQL started the caching initial focus was lost for a bit, but I think fast key-value stores are returning to mostly being cache layers, which work great with any system, including RDBMS backed of any size.

Cloud key-value stores or local in-memory are almost required to compete on speed/performance even for marketing (search results) and even cloud costs of going to the database as well as taking weight off the database which is usually the hardest hit.

Caching layers are useful in all systems/platforms, usually local in-memory first, then shared/external if costs/budget/performance needed. You can even do local caching on the instance, and a remote key-value store, both in front of the database for backup on performance/speed, though cache busting should be implemented from the start when data changes to prevent stale data.

the real irony is that RDBMS are all built on fast k:v stores :)

> the real irony is that RDBMS are all built on fast k:v stores :)

Not if the RDBMS is a column store. Or if the storage model is based on log-structured merge trees. There are many ways to map the relational model to storage. Even for the traditional row-oriented case in InnoDB there are structures like B+-tree indexes that do not directly correspond to a K:V store.

The real point is that you don't want to reimplement a poor, ad-hoc RDBMS inside your application.

What do you mean, a filesystem?

Pretty much all RDBMSs can be considered a set of key-value collections, and their storage backends often conceptually are pretty much that. Assuming a row-oriented storage engine, a simple approach is:

* A main key/value collection consisting of primary key (key) mapping to a serialization of the row (value)

* One key/value collection per index consisting of each index function applied to the row (key) mapping to the primary key (or some internal object id; the value)

It's a bit of a simplification as an RDBMS needs to maintain a sort-order and allow e.g. range queries, and need interfaces that allow reading/writing lots of rows with low per-row overhead, and so it's not that they'd be a good match for just any key-value store.

I think it's a reference to RocksDB, and maybe also InnoDB

I’m not sure what you mean? Aren’t most databases built on assumptions of continuous disk space mapped by indexed arrays? How does k:v fit into the picture?

The most common high-level use of a DBMS is the pair "given this key, get me the record that is its value" and "store this record under a new key". It's not exactly equal to the usual k:v usage (because it uses meaningless keys), but it's very similar.

That said, I don't think this is the usage scenario most DBMS are optimized for, because despite being common, it's a very fast case, so there is relatively little to gain. At the same time, this is the only usage of k:v storage, so you bet it's optimized for that.

Stateful applications with the linklet abstraction seem a nightmare to operate just like plain old stateful webapps that keep sessions in memory (requiring sticky sessions and annoying users when shut down).

I understand the dislike for using in-memory databases as a json store (by reading/writing entire objects at once), but if you're using Redis this way, you're doing it wrong.

Another thing that the authors authors seem to forget is that "rinks" also provide coordination across multiple instances. Locks, atomic operations, transactions, even pub/sub is useful to get the right amount of coordination between instances (see my take on preventing cache stampedes https://github.com/kristoff-it/redis-memolock).

Once you move to their new version of stateful instances you get all the operational downsides and lose all the coordination features. "auto sharding" is not going to be enough to solve all coordination problems.

Calendars are a good example of a domain that is tricky to model as a bunch of kv operations, but nothing that can't be reasonably decomposed by applying DDD + the repository pattern + transactions + lua scripts (in the case of Redis). And if you end up with something that is computationally expensive, then it means that you need a custom data structure and maybe an algorithm optimized for it, and this is where Redis Modules come in.

The real domain of "rinks" is not your business domain, but the implementation domain of data structures, algorithms, asymptotic complexity etc. Thinking that the focus at that level should not be on compsci, but on the business domain, will inevitably produce bad performance.

> I understand the dislike for using in-memory databases as a json store (by reading/writing entire objects at once), but if you're using Redis this way, you're doing it wrong.

I guess... is it so terrible to use a dictionary/hash-type data structure with values more than one level deep? Redis doesn't provide a data structure for that.

> Redis doesn't provide a data structure for that.

Both Redis and Aerospike provide more than enough mechanisms for one to roll their own composite hierarchy. Both provide server side scripting via Lua as well so the database can maintain consistency and prevent chatty access to the store.

KV database modeling is just as rich as its relational counterpart, but if one isn't using a KV store for a compelling reason, use Postgres.



I think in the paper they are talking about serializing the entire object as a single string key that needs to be loaded and written all at once even to do simple things.

Nested data structures are definitely going to be better than that but they also do have some caveats. If you really want to nest arbitrarily, take a look at RedisJSON, but I would recommend to stick with vanilla structs as much as you can.

More often than not you can quench your thirst for nesting by good use of key prefixes.

Microsoft have been pushing this kind of thinking for a while with Service Fabric. If you buy in completely and use both the framework and the infrastructure you get structures which are in-memory and replicated for you.

A couple of the .Net guys we hired preached that stateless architecture is a little old-fashioned - over time I've come to agree. A lot of things can be shoe-horned in to a stateless world but become much easier in a stateful one.

We've been building this way ever since we moved to Elixir which makes distributing state across your application easy. Memory on application servers is severely underutilized

> Memory on application servers is severely underutilized

That's highly dependent on your choice of language and how you're running your app.

Came here to add this. At least based on this paper, I feel better about where things are headed with my choice of using elixir/erlang/OTP as the primary stack for our web services, for exactly this reason.

While we don't have the scale yet to need to cluster out and handle these kinds of issues, I've told my cofounder, that thanks to Elixir, I'm not terribly worried about the costs when does happen (not as worried as I would be if we were using something like ruby)

Can you explain more?

Systems like Erlang/OTP, Akka, Orbit and Orleans/Service Fabric are built on an actor model where domain objects of the system (e.g. Users, Accounts, Invoices, etc.) exist within the cluster and have a an address so they are like having a bunch of mini servers. These servers typically keep their state in memory so they can respond to query messages quickly. Plus the application can unload idle (or no longer necessary) actors and restore their state when they are needed again. It's very similar to the Linked in-memory key-value idea mentioned in the paper.

I think it is an anti-pattern to design a system where each domain object is a process. Sometimes data is just data and should be managed as such.

I think what makes actor models so nice is the explicit ownership of state. It is not possible to declare "var x = 1" in one file, and access x in another file. You always have to retrieve state explicitly, otherwise it won't be accessible within the scope of your function.

We do this for several key elements of a high throughput service. It keeps the critical path local only.

The flow is to dual write to our rdms and a local process. The process either stores in ETS or an process state map. It also broadcasts out the event to the cluster. When a node in the cluster gets a message, it executes on it. Process startup is restored from the DB which is still authority.

Generally this is really easy to do in Elixir. Maybe 50 lines of code for the system described (minus DB code), much of it definition lines.

Erlang/OTP provides a system of stateful processes that communicate via message passing, with a focus on resiliency and fault tolerance.

These processes can seamlessly interoperate while running on multiple machines.

This makes building stateful application servers a pretty natural fit.

probably just using ETS out of the box

This is interesting, but I think the authors don't talk enough about CPU and memory utilization. To me, the "classic" Google distributed systems architecture puts different tasks in different logical servers (doesn't matter if they're separate physical servers or not), which gives those servers more predictable memory and CPU usage, which in turn enables tighter bin-packing of jobs in the datacenter. The price they pay is needing a really, really fast in-datacenter network, but in the past they've been okay with this.

The paper proposes putting application-specific processing and memory caching on the same host, which might give the combined server less consistent CPU usage and therefore lower utilization, but will also eliminate the network hop from application server to in-memory cache. It seems intuitively reasonable to me to give up some CPU utilization in exchange for eliminating an all-to-all network connections stage, but I would like to see a real cost and speed comparison.

An interesting take. When borg was written the main machine class was dual-dual-core opteron. Now I imagine the typical borglet has dual Haswell or Skylake CPUs with I guess between 40 and 88 cores. Do you think (or do you have data that indicates) the typical Google container/vm has grown to keep pace with the machine size, or do you think the mainstream container is still 1CPU/4GB?

No idea, I don't work there any more. It would be really interesting to know though.

Provocative title but fundamentally the paper is "what if we just run the application on redis and skip the app server." Its not like no one has tried sticky sessions before.

It seems like the paper is simply arguing we should bang our heads against this yet again without a solid reason why we would succeed this time. Sharding/routing wasn't the only problem with sticky sessions.

That is definitely what I took from the article. Nothing but a shock-title. Data stores were created because previous solutions were not working as well..

Generational oscillation: stateful -> stateless -> stateful -> stateless -> ...

This paper isn't great.

Yes, I can use a custom in-memory data structure, write it in Go and cluster it using some nice Raft replication. It's not all that hard and it's much, much faster than Redis. (we do this at Stream for activity feeds and chat)

For most apps doing this is totally impractical. If you're using PHP, Python, Node or Ruby for a small to large (anything that doesn't have 100 million plus users) the overhead of implementing the pattern the authors suggest is just too large.

I think Redis will only keep on becoming more popular...

I am confused. The major problem is claimed to be un/marshalling. But they take no closer looks. Advise an architecture without it just to introduce another with it.

Seems like protocol overhead will stay.

Yeah, the only thing we can really learn from this paper is that the authors aren't very familiar with protobuf. Anybody who claims it takes 10 microseconds to decode a 1KiB protocol message is either intentionally constructing the worst-case message or doesn't know what they are talking about.

> Yeah, the only thing we can really learn from this paper is that the authors aren't very familiar with protobuf.

This is stupid. Three of the four authors are engineers at Google.

That should not surprise you. When Microsoft was at its dumbest, its corporate identity was that they were hiring the best of the best.

Now Google has that baton. Wonder who they’ll pass it to...

I did not intend to express agreement with the OP's stupid comment.

If marshalling is the problem, fix marshalling. Erlang (built in distribution), CapnProto, Flatbuffers, etc


The authors are the owners of Google Slicer, a sharding system that enables stateless frontends to connect to sharded data stores (user stickiness).

And so basically, the paper's tl;dr can be: It's time for everyone to stop using well-established best practices on X, and start using our product Y.

It's more like: He got much better results with Y than well-established practices (that many people disagreed were best) X, so we packaged it into a product.

It may be still be a marketing lie, but it's not something to dismiss automatically, mostly because a lot of people disagreed the practice was "best" all the time.

This does not seem a very smart thing to do for any standard application especially without a well tested LInK library.

When it was necessary and was easy, I had local in memory caches for some specific hot data acting as another level of caching. In the simplest case it us just one variable + timestamp. Otherwise usually a Map or a simple library will do it.

But substituting your caching layer with in process caching, it is usually very risky. In cases where the system cannot work without a hot cache, in memory caching can easily lead to extened downtimes of the whole system.

First of all, the deployment process needs to be reengineered so the new version of the app can load the cache of the current running process.

Then if there is a bug that crashes the process, you end up losing all the cache (unless like Redis you also save it on disk).

Until there is a well tested library that can do that or the system really need a system like this, I am going to keep using Redis and use a second in memory caching where it is easy and low risk to do.

I have noticed that in most use cases of redis/memcached/etc, one will abstract it into some domain specific data-structure. At scale that abstraction becomes a service of it's own (job servers, caches, pub sub, etc.) at which point it doesn't make sense to keep the data separated from the service, because all it does is it adds an extra layer of indirection. I presume at google it's really not that big of a deal to write a fast priority queue service that keeps things in it's own memory and talks protobuf directly.

Relatedly I also noticed that teams that overuse memstores predominantly use application languages that have a terrible concurrency story (e.g. python), fall back to multiprocessing where creating anything stateful is not as obvious as keeping it in memory, so they offload that state to redis.

As a maintainer of KeyDB I think there is some truth to this in the form of naive get/set queries. More of the processing should be done server side to avoid wasteful network traffic and the latencies that entails.

Memcache was released in 2003, we didn't start to see server SSDs until 2008. For many places, if you put the database on SSDs, it's fast enough.

It strikes me that you could easily satisfy this by running a Redis process local to your application node as a cache and, indeed, you probably should instead of trying to re-implement functionality yourself.

Also redis is cool for pub/sub.

If your rate of churn is low. It’s fairly easy to get a server with excess memory, especially if you’ve pushed state out of your services, such that having a kv sidecar isn’t that onerous. But then you get cache invalidation issues you have to find a solution for.

Makes total sense. Here is a presentation from AWS reInvent from PlayStation folks: https://www.slideshare.net/AmazonWebServices/aws-reinvent-20...

You just can't beat memory

> Instead, data center services should be built using stateful application servers or custom in-memory stores with domain-specific APIs, which offer higher performance than RInKs at lower cost.

We've been using Sirius [1] (on the JVM) for this.

[1] https://github.com/comcast/sirius

I'm working at ScyllaDB we have observed that long time ago and even blogged about it back in 2017 https://www.scylladb.com/2017/07/31/database-caches-not-good...

There is db with integrated application server that mail.ru was writing about in 2016 - https://medium.com/@denisanikin/how-to-save-one-million-doll...

Isn't a stateful service basically the actor model with optional persistence, like in Akka.Persistence model?

I'm not entirely sure I understand this paper.

Are the frontend servers hitting the auto shard service with say, a key, and it returns a list of backend server(s) to hit where it knows that key is stored in memory?

Wouldn't the auto sharding service come with the usual challenges of HA, load balancing, integrity etc?

This looks like it is trying to build a technical rationale for keeping user data on servers, in contrast to the privacy-centric approach being pushed by their main competitor in the mobile space, which advocates keeping user data local to the user device for privacy reasons.

Well. They may not of intended to do so. But they convinced me more than I already was, that data stores like Redis are needed. This article sounds like they already decided they had a problem with key value stores and tried to rationalize their bias desperately...

"Application logic as a library" is the right way to think. We can't even make "threads into a library", let alone "database as a library".

The challenge is to design safety so badly written application logic doesn't crash the database.

Thank you. I finally have a good reference to point folks to when I get the question about why I'm not using redis. Although not as fun as the "web scale" response to mongo questions.

> I finally have a good reference to point folks to when I get the question about why I'm not using redis.

What are you using instead?

.net memory cache. It is not distributed, but we have a single tenant architecture so it works well

I need a place to store values organized with just a bit of logic and a nice api while my workflow runs. I don't want the overhead of a file system or psql... enter redis.

ITT: the mix of haters, supporters, and curious observers is the exact type of reaction that the HotXXX series of workshops is intended to surface.

I don't understand. I skimmed the paper but it sounds like NoSQL but using the application servers themselves as the data store. ?

You're not far off, I'd analogise as

"redis, but compiled into your app, with all the necessary bits to make that work like you'd hope (consistent hashing/replication), (native data structures)"

When I suppose the separation architecture of computing layer and store layer is go without saying, others begin talks about stateful service.

The title is click bait in my opinion as the article (as others noted) is a proposal for:

"... LInKstore is a high level abstraction over an auto-sharder that provides a distributed, in-memory, key-value map withrich application objects as values, rather than strings or sim-ple data structures".

This comes from Section 4 of the .pdf which has the hard core content. The issue is interesting.

k/v mem store is only supposed to be a read cache. it's not supposed to handle business logic, and it's not supposed to persist storage. if you need any of that, use an rdbms.

> custom in-memory stores with domain-specific APIs

CIMSDA is the new CORBA?

Only skimmed the paper, but to me this sounds like an incoherent description of Reliable Actors on Azure.

I don't think it benefits the paper to position this model directly against something like Redis.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact