Isn't that kind of custom interchange format what they're complaining about when they say that remote stores just push the complexity to the client?
I don’t know many sane people who want to use the kv store as a system of record, and even the people who expect it to be exhaustive (all possible keys) make me doubt their sanity for that and other reasons.
So far, everyone has a bit of code that looks for a key and, if it is missing, performs the work necessary to build the payload. The code that creates the payload is never more than a couple function calls away from the retrieval code.
Is this not exactly the behavior you'd see if you used UUIDs as your keys? Asking honestly.
In the case of Google's own Flatbuffers, the layout is going to be far more performant.
bytes key = 1;
bytes value = 2;
Do you have a source on that? Genuinely curious.
I don't know if I'd say Protobuf has "awful" performance. It's certainly much better that text-based formats like JSON. But the format is rather branch-y. You have to process it byte-by-byte, because e.g. integers are encoded in a variable-width encoding where each byte contains 7 bits of data plus 1 bit to indicate if this is the last byte. This results in a compact encoding, but takes a lot of cycles to encode and decode. Moreover, since everything is variable-width, in order to find any one field of the message, you must scan through all previous fields, parsing them one by one.
Cap'n Proto, FlatBuffers, and SBE all use "zero-copy" encodings, meaning the data is laid out on the wire in a format that is easy for a CPU to use directly. This means, for example, that integers are fixed-width, and fields are located at fixed offsets. This is must faster to parse (or even use in-place without parsing at all), but does result in somewhat larger encodings. (But then, you can always layer on independent compression when bandwidth matters more than CPU.)
My understanding is that Thrift is closer to Protobuf and contemporaneous with it, so I don't know why GP included it the list.
This is the hot path in C++. A really large amount of work has gone into protobuf C++ performance in the last 3 years or so.
Yes, I suppose the branches in Protobuf can be pretty predictable. Still, you do generally have to examine each byte individually.
I think most serialization frameworks are likely to be overkill for such a use case, spending more time on setup than actual parsing.
Also note that storing the value (and maybe the key) with proper alignment might make it easier to use the data in-place, saving a copy.
However, if your entire world consists of pushing around images to GPUs/TPUs, that's pretty much the only workload you ever need to optimize from your larger ops system.
Edit: not sure that this particular use case is exactly what's going on in that benchmark, now that I look more closely. Also, I wouldn't want to assert that it's a use case that matters a ton.
Their solution (LInK) is a key-value system "as a library": You link it into your application servers and you can then store data (they become stateful). The system will automatically handle sharding, replication, consistency and so on. From your perspective you're working directly on memory. You still need to implement marshalling, but this is only used for sharding/replication across nodes (which the system handles completely for you).
They say nothing about how this system handles data consistency/replication/durability, so it's a bit tricky to assess it. They also have this nice note: "(since, for performance, persistence of state is not guaranteed)".
Summary: This is a system which couples your application logic with your persistence layer (in the same OS process). This can give you great performance as they co-side with each other.
Looks like a very cool project, and I'm actually surprised there not more "database as a library"-projects out there.
I don't think the paper is talking about the persistence layer. Rather it's focusing on the cache layer.
From the paper:
"Modern internet-scale services often rely on remote, in-memory, key-value (RInK) stores such as Redis and Memcached. These stores serve at least two purposes. First, they may provide a cache over a storage system to enable faster retrieval of persistent state. Second, they may store short-lived data, such as per-session state, that does not warrant persistence."
If you look at Fig.2 in the paper, the database (persistence) layer is always there. What the paper is advocating is to move from an out-of-process cache server like Redis to an in-process cache library like Java Caffeine
This seems like an easy “solution” to show how it speeds up the happy path while not fully addressing the real-world tradeoffs.
Stateless application servers come and go without a care in the world. You can reason very simply about the cost and effects of bringing them up and down.
Add in a KV store with sharding, replication, a discovery protocol, a heartbeat protocol, a sync and recovery protocol, etc... and put it all in-process on every application server?
Have fun monitoring and debugging this.
I much prefer the 3-tier system of a local RAM cache, a Redis cache, and a persistence layer. As much state as possible is idempotent so you can cache it locally as well as in Redis. The load balancer can best-effort route back to the same app server, and your KV lookups automatically check local RAM first. But the central KV store handles replication and is easy to monitor and scale out if needed.
I think that debugging with one less process is a good thing. And monitoring with one less process boundary to traverse is easier as well.
1. A client scaled rapidly & we had repeated hits on the same key. We were saturating the NIC in the memcache server. We had to addd indirection for some data and introduced an local in-memory layer in front of memcache.
2. We had to extend the ring size to give memcache better CPU utilization. We had to teach all our code to handle a migration process (reads were fall through to the new ring and writes purged both). We couldn’t just turn off memcache because a total dump would overwhelm the database.
For distributed systems, you tend to need quite a bit of configuration (how to find the peers, authentication, things like namespaces).
And then if the communication is TCP-based, you tend to keep connections open to avoid the handshake latency when you need to retrieve a value. Which sounds like something you typically don't want running inside a library, but rather as a separate process.
And then you've basically arrived at the memcached design.
Once you separate out Slicer, this result is pretty obvious in theory - in fact, in that Slicer paper they mention machine learning as a driving example, where it absolutely makes sense to colocate application logic with user-specific models cached in memory.
Honestly, this paper, with its one-paragraph implementation section, reads almost like clickbait - "let's say something controversial, release an un-reproducable paper that adds very little to the literature besides 'this is something cool you can do if you have a tool only Google has,' btw do you want to work for Google?" The Slicer paper itself has much more detail and IMO is a much better read.
I'm sure there are others; I don't get what this paper presents that is new
FoundationDB exposes the key-range to node mapping to you to make this router possible.
‘I bet Elixir already has something like this’.
Also, be aware that OP mentioned riak-core, which is a different thing from Riak. riak-core is a library to build dynamo-style distributed applications, which gives you virtual nodes, consistent hashing and data migration as building blocks.
Conceptually, these are similar to the simple database kernels used as caches inside BI systems like Tableau but extended to support write workloads and scale-out. These may have a large amount of storage in them but only to increase the quantity of data each server can handle (i.e. swap space). Storage is not for durability, other systems handle that. The primary advantages of this architecture is it dramatically reduces data motion and wasteful computation, and you can often push complex data model logic to this layer since it implicitly contains more application context.
All "databases as a library" (e.g. SQLite) are designed around a much simpler set of constraints. The combination of high write rates and continuous background resharding/rebalancing across servers requires low-level architectural support that you can't graft onto a database kernel that was not explicitly designed for it. To start with, any query execution has to have sophisticated topology awareness in a system where the topology is constantly changing due to resharding/rebalancing and you can't rely on global consensus for everything. The library you are linking into your application contains all the clever algorithms for navigating a data cache that behaves like this, you don't want the app developer to have to reason about it.
The level of dynamic data model "fluidity" in the underlying organization and algorithms means that a mature design has little in common with traditional in-memory databases even though there are many superficial similarities. This isn't something you grind out over the weekend, there is a lot complex and non-obvious theory behind robust designs.
(I agree that the paper is not very clear about what they are talking about. This is my interpretation of the paper based on my recognition of the claims and architecture.)
If we had better tools for the moderate position then life would probably be easier than at the extremes.
While the idea of migration is useful for resilience in the face of hardware failure, it’s more attractive from a standpoint of elastic scalability.
We are still living in a bubble where we believe that cloud providers aren’t going to oversubscribe their hardware the way ISPs have been doing for decades. As the ratio of private server rooms declines, that bubble starts to wear thin.
AWS doesn't oversubscribe their main class of cloud servers, but they will sell you a cheaper t-series if you're willing to accept throttling under heavy CPU use.
They used to have "noisy neighbor" issues with some of their older instance types where another server on the same physical hardware could monopolize the disk/network bandwidth (but not the CPU), but they've resolved that now, every server can use it's published resources (except for network which can burst up on 10 Gig on some instances types, but sustained except on large instance types that can sustain 10Gig)
They've built their business model on selling dedicated resources, and I don't see that changing.
I'm developing a server cluster with statefulness and "sticky sessions" (1) for a multiplayer game. Migration to another server is a feature I'm working on, and already have prototyped in demo form.
(1) -- "Sticky sessions" are actually Player agents which live in a particular server.
I understand the point the abstract is trying to make, but the title is sensationalized at best.
Turns out that a run of the mill RDBMS will fit 99% of the problems you're trying to solve.
Conceptually, the relational model has relations and atoms. The algebra is built on the primitives conjoin, disjoin, project and rename. You can't get much simpler while being as expressive.
The SQL model adds a good deal of complexity, but at least it's standardized and you can generally ignore the complexity you don't need.
A KV store is conceptually simple if you stick to using it as a KV store. Once you try to implement any kind of schema, you have to build that from scratch. As you add business logic to it, you wind up inventing an ad hoc data model, which is likely to be conceptually more complex than either the SQL or relational model.
You may not be aware of the complexity of your ad hoc model, but math will inevitably remind you.
For more context and depth, see previous discussions...
 GraphBLAS http://graphblas.org
I think this idea needs to be expressed more clearly in CompSci and programming. What is the exact nature of the complexity explosion you are talking about here? Is there something analogous to the increase in complexity going from regular expressions to stack machines? What math are we talking about here? It seems like everyone just explains the relational algebra, then leaves it right there.
> What is the exact nature of the complexity explosion you are talking about here?
I'm thinking in terms of the entities of Occam's razor: "entities should not be multiplied unnecessarily." "Entities" is pretty abstract, we can't identify what complexity is directly, but what we can do is imagine, "what if we tried to build a mathematical model that captures a real life application?"
If we did that, and had a mathematical description of a thing we wrote, we can formalize it by trying to reduce it to some minimal set of axioms.
And then, your more complex rules are derived from those axioms, and if you get your math right those complex rules will be consistent. If you're very clever, you can make it reasonably intuitive.
If you have something that's very complex, what you'd observe after modelling it is you have mostly axioms and very few rules are able to be derived from those axioms. That is, the rules are just the rules and there's no broader reason for them to be so, or deeper consistent patterns. And, maybe some of those rules wind up being contradictory, and they may lack orthogonality.
The relational algebra, being an algebra, is a set of operatations that are closed over the universe of relations, so it's very nicely orthogonal and reduces to a small set of primitives. As relations can be visualized as "tables" they're relatively intuitive, and using techniques such as normalization you can also structure around potential anomalies that add unwanted complexity.
You can also control where your complexity goes. An integrity constraint can enforce some rules about what may be in a relational variable, and if that's enforced, all code that wants to use data from that relation can simply assume that the data maintains that structure. Thus the complexity can be centralized in the system.
Now, that's in the ideal world with a True Relational DBMS, we have to deal with SQL and vendor-specific SQL at that, so we have a rather complex underlying model. But it's typically Pretty Good.
When we build such a model ad hoc, we go by our intuition and are constrained by the market. Our intutition leads us to use more complex structures, that is, structures that if they were expressed mathematically would use a larger set of axioms to describe them. Then, as we want those structures to interoperate, we are unwittingly merging two sets of axioms.
Worse, because we're typicaly expressing it in application libraries, we wind up repeating code, so the complexity is spread all over the application and it especially multiplies if "conventions" are copy-pasta'd.
Hopefully that explains the nature of the explosion of entities / complexity; let me know if I can expand on anything.
It does not satisfy me. One can show that a regular expression or finite state machine is limited in specific ways, as compared to a stack machine or a Turing machine. One can write proofs concerning the number of states a specific machine can be in, given an input of a certain length. The explosion in complexity can be quantified, as can the impact on the effectiveness of testing. By comparison, "using techniques such as normalization you can also structure around potential anomalies that add unwanted complexity," is just an aphorism.
Point is that sqlite could take people way further than they think. Postgres incredibly far. By the time you need a caching layer where key value stores shine best you have many options you can do and resources to do them.
Up to the millions of TPS actually.
Cheap, "faster," less-reliable stores became very big around the Mongo explosion - and don't get me wrong they can have a place, particularly Memcached and Redis which are great for ephemeral data - but you can do a lot of that with a RDBMS.
So a lot of caching that many of us had to do to scale systems a decade or two ago is now not necessary until/unless you get several magnitudes up in scale, and while the number of internet users has gone up, so has competition, and the proportion of services that gets successful to need scale up to several magnitudes beyond the old chokepoints at the time is tiny.
In most cases though, Mongo fits use cases, we never try to use it as a wholesale RDBMS replacement.
As an aside, if you're doing it anyway, it's also a good point to dump to .json.gz in S3 (or similar) as a secondary backup system.
Any good large scale system will have a caching layer for in memory or external cache store, even small/medium apps may use them for performance and ultimately cost when it comes to cloud, putting more on cpu/local than external databases. It is really what works for the system the best as with anything in engineering, sometimes needed, sometimes not, it depends.
Memcached really started out as a caching layer, it is right there in the name. Once NoSQL started the caching initial focus was lost for a bit, but I think fast key-value stores are returning to mostly being cache layers, which work great with any system, including RDBMS backed of any size.
Cloud key-value stores or local in-memory are almost required to compete on speed/performance even for marketing (search results) and even cloud costs of going to the database as well as taking weight off the database which is usually the hardest hit.
Caching layers are useful in all systems/platforms, usually local in-memory first, then shared/external if costs/budget/performance needed. You can even do local caching on the instance, and a remote key-value store, both in front of the database for backup on performance/speed, though cache busting should be implemented from the start when data changes to prevent stale data.
Not if the RDBMS is a column store. Or if the storage model is based on log-structured merge trees. There are many ways to map the relational model to storage. Even for the traditional row-oriented case in InnoDB there are structures like B+-tree indexes that do not directly correspond to a K:V store.
* A main key/value collection consisting of primary key (key) mapping to a serialization of the row (value)
* One key/value collection per index consisting of each index function applied to the row (key) mapping to the primary key (or some internal object id; the value)
It's a bit of a simplification as an RDBMS needs to maintain a sort-order and allow e.g. range queries, and need interfaces that allow reading/writing lots of rows with low per-row overhead, and so it's not that they'd be a good match for just any key-value store.
That said, I don't think this is the usage scenario most DBMS are optimized for, because despite being common, it's a very fast case, so there is relatively little to gain. At the same time, this is the only usage of k:v storage, so you bet it's optimized for that.
I understand the dislike for using in-memory databases as a json store (by reading/writing entire objects at once), but if you're using Redis this way, you're doing it wrong.
Another thing that the authors authors seem to forget is that "rinks" also provide coordination across multiple instances. Locks, atomic operations, transactions, even pub/sub is useful to get the right amount of coordination between instances (see my take on preventing cache stampedes https://github.com/kristoff-it/redis-memolock).
Once you move to their new version of stateful instances you get all the operational downsides and lose all the coordination features. "auto sharding" is not going to be enough to solve all coordination problems.
Calendars are a good example of a domain that is tricky to model as a bunch of kv operations, but nothing that can't be reasonably decomposed by applying DDD + the repository pattern + transactions + lua scripts (in the case of Redis). And if you end up with something that is computationally expensive, then it means that you need a custom data structure and maybe an algorithm optimized for it, and this is where Redis Modules come in.
The real domain of "rinks" is not your business domain, but the implementation domain of data structures, algorithms, asymptotic complexity etc. Thinking that the focus at that level should not be on compsci, but on the business domain, will inevitably produce bad performance.
I guess... is it so terrible to use a dictionary/hash-type data structure with values more than one level deep? Redis doesn't provide a data structure for that.
Both Redis and Aerospike provide more than enough mechanisms for one to roll their own composite hierarchy. Both provide server side scripting via Lua as well so the database can maintain consistency and prevent chatty access to the store.
KV database modeling is just as rich as its relational counterpart, but if one isn't using a KV store for a compelling reason, use Postgres.
Nested data structures are definitely going to be better than that but they also do have some caveats. If you really want to nest arbitrarily, take a look at RedisJSON, but I would recommend to stick with vanilla structs as much as you can.
More often than not you can quench your thirst for nesting by good use of key prefixes.
A couple of the .Net guys we hired preached that stateless architecture is a little old-fashioned - over time I've come to agree. A lot of things can be shoe-horned in to a stateless world but become much easier in a stateful one.
That's highly dependent on your choice of language and how you're running your app.
While we don't have the scale yet to need to cluster out and handle these kinds of issues, I've told my cofounder, that thanks to Elixir, I'm not terribly worried about the costs when does happen (not as worried as I would be if we were using something like ruby)
I think what makes actor models so nice is the explicit ownership of state. It is not possible to declare "var x = 1" in one file, and access x in another file. You always have to retrieve state explicitly, otherwise it won't be accessible within the scope of your function.
The flow is to dual write to our rdms and a local process. The process either stores in ETS or an process state map. It also broadcasts out the event to the cluster. When a node in the cluster gets a message, it executes on it. Process startup is restored from the DB which is still authority.
Generally this is really easy to do in Elixir. Maybe 50 lines of code for the system described (minus DB code), much of it definition lines.
These processes can seamlessly interoperate while running on multiple machines.
This makes building stateful application servers a pretty natural fit.
The paper proposes putting application-specific processing and memory caching on the same host, which might give the combined server less consistent CPU usage and therefore lower utilization, but will also eliminate the network hop from application server to in-memory cache. It seems intuitively reasonable to me to give up some CPU utilization in exchange for eliminating an all-to-all network connections stage, but I would like to see a real cost and speed comparison.
It seems like the paper is simply arguing we should bang our heads against this yet again without a solid reason why we would succeed this time. Sharding/routing wasn't the only problem with sticky sessions.
Yes, I can use a custom in-memory data structure, write it in Go and cluster it using some nice Raft replication. It's not all that hard and it's much, much faster than Redis. (we do this at Stream for activity feeds and chat)
For most apps doing this is totally impractical. If you're using PHP, Python, Node or Ruby for a small to large (anything that doesn't have 100 million plus users) the overhead of implementing the pattern the authors suggest is just too large.
I think Redis will only keep on becoming more popular...
Seems like protocol overhead will stay.
This is stupid. Three of the four authors are engineers at Google.
Now Google has that baton. Wonder who they’ll pass it to...
It may be still be a marketing lie, but it's not something to dismiss automatically, mostly because a lot of people disagreed the practice was "best" all the time.
When it was necessary and was easy, I had local in memory caches for some specific hot data acting as another level of caching. In the simplest case it us just one variable + timestamp. Otherwise usually a Map or a simple library will do it.
But substituting your caching layer with in process caching, it is usually very risky. In cases where the system cannot work without a hot cache, in memory caching can easily lead to extened downtimes of the whole system.
First of all, the deployment process needs to be reengineered so the new version of the app can load the cache of the current running process.
Then if there is a bug that crashes the process, you end up losing all the cache (unless like Redis you also save it on disk).
Until there is a well tested library that can do that or the system really need a system like this, I am going to keep using Redis and use a second in memory caching where it is easy and low risk to do.
Relatedly I also noticed that teams that overuse memstores predominantly use application languages that have a terrible concurrency story (e.g. python), fall back to multiprocessing where creating anything stateful is not as obvious as keeping it in memory, so they offload that state to redis.
Also redis is cool for pub/sub.
You just can't beat memory
We've been using Sirius  (on the JVM) for this.
Are the frontend servers hitting the auto shard service with say, a key, and it returns a list of backend server(s) to hit where it knows that key is stored in memory?
Wouldn't the auto sharding service come with the usual challenges of HA, load balancing, integrity etc?
The challenge is to design safety so badly written application logic doesn't crash the database.
What are you using instead?
"redis, but compiled into your app, with all the necessary bits to make that work like you'd hope (consistent hashing/replication), (native data structures)"
"... LInKstore is a high level abstraction over an auto-sharder that provides a distributed, in-memory, key-value map withrich application objects as values, rather than strings or sim-ple data structures".
This comes from Section 4 of the .pdf which has the hard core content. The issue is interesting.
CIMSDA is the new CORBA?
I don't think it benefits the paper to position this model directly against something like Redis.