Hacker News new | past | comments | ask | show | jobs | submit login
KeyDB: A Multithreaded Redis Fork (github.com)
186 points by ddorian43 10 days ago | hide | past | web | favorite | 77 comments

What we are doing instead: just started a Redis Proxy project (designed by Fabio and me, coded by Fabio Nicotra, so we started yesterday). We will start abstracting a Redis Cluster this way so that you can talk with to it like it was a single instance, but this is not enough. We also are planning for:

1. Tooling to make it simpler to start a Redis Cluster in a single host if all you want is to leverage all the cores.

2. Smartness in the AOF rewriting so that instances on the same host will do a better use of the disk.

3. A caching mode for Redis Cluster. This is useful for deployments where most accesses are idempotent and you have for instance a cluster of 10 physical hosts running, for instance, 32 Redis instances each. When an host is no longer reachable we don't want failover in that case, but just reassignment of hash slots to the other masters. Think at consistent hashing but on the server side.

Two years ago I had this private branch that I wrote in a few days that threaded Redis, with N event loops on N threads. After realizing that it worked and scaled, I considered how many bugs such Redis would have, how harder it would be to add features, and how incompatible is to have a single huge instance with Redis persistence method. In that moment and after consulting with other engineers I decided that for a few years our strategy will be Redis Cluster. Maybe I'll change idea in the future? Maybe, but first let's explore what can be done with a shared-nothing architecture that keeps the promise of continuing to deliver software that hardly crashes ever.

EDIT: Obviously the proxy itself is multithreaded :-D

EDIT2: Btw jdsully already sent a PR to the original Redis project, so I guess a smart developer exposed to the Redis core is great anyway!

The problem with clustering still becomes lower queries per GB as instances can’t share data. Redis itself runs in RAM so storage is at a premium.

One of my main reasons for doing multithreading, and FLASH in the first place was to make Redis work well for much larger value sizes.

I really think we have different use cases in mind.

You mean the use case where you have a single very "hot" key? Yes, that's a limitation, but in that case replicas can scale read load very well. But indeed this is a special case where threads are a better model: yet this is a very niche use case that anyway ends being inherently unscalable because the object to manipulate is a single one, so accesses end being serialized in one way or the other.

I understand the point about Redis on Flash. Redis Labs is also doing a lot of things with this model. I believe it is super useful for some folks. Yet I believe what makes Redis "Redis" is that there are no tricks: it runs the same always whatever the access pattern is. I want to stress this part over the ability to work with bigger data sets. Moreover I believe that persistent RAM with huge amounts is in the horizont now.

> yet this is a very niche use case

I've always thought that one of the sensible basic data structures that Redis could support operation on is an (int->int) digraph.

There are a lot of things you can build on top of "digraphs as a toolkit object"—just see the lengths people go to to get PostGIS+pg_routing installed in Postgres, just to do some random digraph queries to their (non-geographic) RDBMS data.

IMHO digraphs very similar to Redis Streams in that sense: they're something that might seem like a single high-level "use-case" at first glance, but which is actually a fundamental primitive for many use-cases.

However, unlike a hashmap, list, set, etc., a digraph is necessarily—if you want graphwise computations to run efficiently—one large in-memory object with lots of things holding mutual references to one-another that can't just be updated concurrently in isolation. But, it is likely that people would demand multiple writers be able to write to disjoint "areas" of the graph in parallel, rather than having to linearize those writes just in case they end up touching overlapping parts of the graph data.

If I understand Redis Streams, it was created because there's no real need for things like Kafka to do their own data-structure persistence. Data-structure persistence is not the "comparative advantage" of a thing like Kafka; network message-passing with queuing semantics is. High-level infrastructure pieces like Kafka should rely on lower-level infrastructure pieces like Redis for their data persistence, and then they can just do what they do best. Is that accurate?

If so, then I would argue the same thing applies to graph databases: persisting and operating on data-structures is not really their job; it's just something they do because there isn't anything to delegate that work to. Redis could be such a thing. But not if it's single-threaded.

This is, just from the top of my head, a use-case (set of use-cases?) that needs threading (or at least "optimistically threaded, with fallback to linearization") for writes. And it's not really all that niche, is it? Not any more than message queues are niche.

It seems like RedisGraph would fill this need. It builds upon Redis to provide a graph database. https://oss.redislabs.com/redisgraph/

I think this is an interesting idea.

What I did a year ago was just make Redis cluster able to do Lua scripting transactions on arbitrary keys across the cluster. With rollbacks, like my 2015 patch.


Instead of reinventing a bad wheel, you could have something better.

This sounds really interesting. Is this work available publicly? Could you comment on how you did this and what the pros and cons were (surely a compromise had to be made somewhere)?

It's not not available for public consumption. After finishing, I realized 2 things: SSL/TLS was necessary for nontrivial cluster solutions, and very few people use Redis Cluster in practice (the multi-key issues being the primary reason).

In 2015 I released a patch to allow for transactions with rollbacks in Lua scripts; it used dump/restore to snapshot keys. This idea was extended to be the cluster solution I released last year. The primary compromise is that as implemented, it more or less relied on the equivalent of a temporary key migration. You want a transaction on server X to involve keys from servers X, Y, Z? You migrate the keys to X explicitly, run your transaction, then migrate them back.

Caveat: big keys are slow to snapshot, migrate, prepare for rollback, etc.

I've got incoming changes to my own fork that allow for fast state transfer for large structures (useful on multiple machines, or just heavily threaded + transactions), more or less eliminating the caveats. But that's also not yet publicly available, and I've spent 5-6 of the last 7 weeks sick (maybe 4 full workdays in that period), so I don't have an ETA on any of it.

This approach was used successfully in bespoke data services like Roshi https://github.com/soundcloud/roshi, the design seems sound, nice to hear.

Any reason you need more than ha proxy? (assuming you get the part about spinning up on a server for multicore support out of the way?)

Redis Cluster can be abstracted away only understanding the Redis protocol and how clients should communicate with the different Redis instances. It's not just a matter of sending TCP data to multiple ends.

Is abstracting a master-slave-sentinel setup also in the scope for the proxy or only Redis Cluster ? For use case where replicas are used to scale reads, a proxy shipped with Redis could be useful.

The proxy will still not allow to run a LUA script that span keys that live in different hosts right?

Right, the usual Redis Cluster limitations initially but for MGET that probably will get support via multi-fetch.

These honest, technically correct answers where you acknowledge limitations, but highlight pluses are terrific.

A really nice contrast to the typical hand-waving, defensiveness, etc, I get from a lot of vendors, product owners, etc.

Is it the same proxy that is used in redis enterprise today?

Nope, a different project created from scratch from the OSS project. Yet sponsored by Redis Labs like all the rest of the developments Fabio and I are doing.

Redis being single-threaded has always been its single most important advantage in my opinion. It eliminates whole classes of possible bugs and makes the software much simpler.

I’ve managed to get a few fixes upstreamed that are more inline with Antirez’s vision. I hope to continue that in the future.

The beauty of open source is choice, and getting to try new things.

That's not a good enough reason not to write multithreaded software. You can apply your reasoning to almost any software, and it will fail all the same. Performance and latency matter. A lot.

> That's not a good enough reason not to write multithreaded software

Possibly not, but it's a very good reason not to use multithreaded software, when a simple single-threaded alternative is present.

It's a very practical decision in my case: I'd rather rely on something that I know eliminates entire categories of bugs, if performance is acceptable (it is for me). Also, if it breaks, I can dive in, understand it and fix it. Simplicity should not be underrated.

memcached is multi threaded and I've never had an issue with it.

Multi-threading is not some crazy new fangled idea. A huge portion of the software you use every day is multi-threaded. It's not that hard to get right. I don't see why the redis maintainers don't have the skills to do it.

Retrofitting multi-threading into an existing design, is actually a pretty big challenge. Especially if you're going with shared everything multi-threading as is common in C. I won't name and shame, but I've started using a well known "internet scale" project that recently added multithreading, and have run into three major locking errors already.

These are smart people, they've written a great program, but they didn't make the right locking decisions.

The approach antirez mentioned in a sibling thread of a multithreaded proxy process that communicates to multiple single threaded instances sounds much easier to implement without ending up with data in the wrong places. Although, even that doesn't necessarily need to be multithreaded either. A master process doing healtchecking and sending status to the worker processes, that are using a eventloop would probably work just as well.

Heh, I did this back in 2012: http://thredis.org/ (Not only is it threaded, it also supports SQL).

And Alibaba did it a couple years ago with Pedis! https://github.com/fastio/pedis (looks like the project is still active, but I don't really know of anyone else who uses it.)

Good name too.

For anyone wanting an overview of how KeyDB works, its at the bottom of the Readme.

"KeyDB works by running the normal Redis event loop on multiple threads. Network IO, and query parsing are done concurrently. Each connection is assigned a thread on accept(). Access to the core hash table is guarded by spinlock. Because the hashtable access is extremely fast this lock has low contention. Transactions hold the lock for the duration of the EXEC command. Modules work in concert with the GIL which is only acquired when all server threads are paused. This maintains the atomicity gurantees modules expect."

So it is still single threaded in the semantics that people expect (which is great news). And should also not expose users or the codebase to nasty threading bugs.

> In addition we wanted open source implementations of features currently only available in proprietary modules. We feel a fork is the best way to accelerate development in the areas of most interest to us.

This is probably the most important part of the fork.

I beg to differ: a much better route would be to send PRs, or even add in the fork itself, modules APIs support to do what those modules do (there are Redis Labs modules that cannot be modeled via the current module interface). I did this error with Disque and now I'm implementing Disque back in terms of Redis modules, because keeping a fork in sync is a nightmare.

Considering that modules are (like the name suggests), detached components, how's that relevant with the fork? If anybody wants to build an alternative to Redis Labs' (or anybody else's) modules they can already do so.

In that vein we have FLASH support baked in. It works a bit differently than Redis Enterprise (uses File System snapshotting), but it’s also a lot less complicated.

What tradeoffs have you made by doing so?

The trade off is the kernel decides what is paged into RAM and what stays on disk. Less control, but also less overhead. This is a similar approach to Squid.

I don’t think users are going to be happy with that approach in the long term. It’s well advised to control eviction behavior yourself. (Varnish learned this lesson ages ago, notwithstanding phk’s original paper.)

InnoDB’s buffer pool documentation is instructive: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.h...

If you're running a Redis instance as the only thing on the VM instance it's on (very likely in my experience), is there a reason to reimplement all this policy logic in Redis, rather than simply tuning the caching policy of the existing OS kernel to make it perform well for Redis (maybe at the expense of performing well for any other process you might run under that same kernel)?

In theory that makes sense, but I'm not aware of any mechanism that allows you to tune a kernel's VMM to suit an arbitrary workload/access pattern. Are you?

this is kind of a pain the container world

Certainly we can do smarter things if we know an access is going to result in a page. But we're not in a position to take advantage of that until green threads are in place.

Kafka also leaves the caching to the OS.

... which is great until a consumer or replica decides to replay a topic/partition from the beginning, and forces all the recently-written data to be evicted, adversely impacting performance for all the well-behaved consumers :(

> forces all the recently-written data to be evicted

That's unlikely unless 1) your hosts are seriously underpowered and 2) you have no per-consumer quotas set

Can you clarify what you mean by underpowered? Not every site wants to replace terabytes of disk with NVMe storage to make paging faster - especially sites that have a lot of throughput and long retention demands.

Second, even if quotas could be used - and there are plenty of sites where they cannot - there’s still a broker-replacement scenario that needs to be accounted for. Most folks don’t want to intentionally impede the recovery rate of a failed broker.

This seems to keep the important bit about Redis single threading, that is that you have exclusive access to the db in a script or `multi`. That property allows for a lot of things that wouldn't be possible otherwise. Also leads to very predictable performance characteristics.

That said, Redis has always been faster than I've needed, and that's leaning on it pretty heavily, so I kind of agree with antirez that adding this complexity may not come with much benefit. Will be interesting to see where this goes.

According to the README all access to the data is effectively single threaded. Only the network work is actually multi-threaded.

> Access to the core hash table is guarded by spinlock. Because the hashtable access is extremely fast this lock has low contention. Transactions hold the lock for the duration of the EXEC command.

This will be improved in future versions. One of my major goals is to make KeyDB work well for large value sizes, >1MB. There’s no reason you shouldn’t be able to cache images as well as text.

FLASH support was the first step, and we’ve removed a bunch of unnecessary copies along the way but there is still more work to do.

FWIW - my notes on how locking was done in Thredis: https://github.com/grisha/thredis/blob/master/README-THREDIS

Thanks! I did look at thredis originally, but the github seemed to not have been maintained at the time.

I find it really surprising that it took someone a month (I think the author stated this on Reddit) and yet redis has been so hesitant to do it. What are the downsides at this point?

I don't think it potentially taking a long time was the primary reason for not making redis multithreaded. In the redis manifesto it is stated that one of the guiding principles is to avoid complexity if they can.

Also not that the benchmarks shown seem to be an apples and oranges comparison. If you want to compare the performance if redis vs a multithreaded fork the proper comparison involves running a redis cluster (one node per core) versus the multithreaded single node.

All else being equal I'd prefer a multithreaded approach as KeyDB is pursuing but that's just because it makes it easier to more easily utilise system resources of a single (virtual) machine.

I’m kinda against running a “cluster” on the same machine. That was a primary motivator for doing this.

It’s offloading complexity from the developers to the users which I don’t think is the right approach.

Also clusters have limitations over single instance redis. You’ll also get more Queries/GB with a Multithreaded approach.

How is a cluster more difficult for users in 2019? Most devs I know never even set up Redis. It just works through Docker/Ansible/cloud.

Have you benchmarked cluster VS locks and threads approach?

It took a month mainly because the redis codebase is so well organized. So kudos to Antirez for that. I can’t deny adding locking makes the code a bit more complicated - but there are more users than developers.

That codebase really has exceptional organization/comments.

Multithreaded code is a lot more difficult to reason about once the codebase gets large.

Redis's codebase is a little too readable compared to the complexity of the things it does. This isn't.

My experience is that it is much more a matter of architecture.

This is a lock free key value store in a single file - it doesn't have the network protocol, but any threads dealing with the network IO could use it without doing anything differently.


antirez recently provided a full response http://antirez.com/news/126

He says most setups are network or memory bound, so I don't understand this reasoning:

"Well, that’s the plan: I/O threading is not going to happen in Redis AFAIK, because after much consideration I think it’s a lot of complexity without a good reason. Many Redis setups are network or memory bound actually. Additionally I really believe in a share-nothing setup, so the way I want to scale Redis is by improving the support for multiple Redis instances to be executed in the same host, especially via Redis Cluster."

The RAM would have to be a multiple of the dataset, because its shared-nothing, running right into one of the two common limits, right?

I would assume a cluster on a single node would partition the data, rather than replicate it. Without an in-depth knowledge of redis, I can't really predict if 4 instances sized for 1GB would have more overhead than 1 instance of 4GB, but I'm willing to hazard a guess that it would be fairly close. On the network side, it's a wash if you're already bound by the interface line rate.

This reminds me a bit of the K/kdb approach of just starting another process for almost certainly the same reasons.

Multithreading is great, getting data to walk across sockets on multi-socket machines can be a challenge. Source: have dual-socket E5-2670.

Multithreading eliminates 99.8% of snapshot overhead: https://www.slideshare.net/secret/sLTnWqcPV7G0HK

Can also help reduce snapshot create and load time by 20-70% or so: https://www.structd.com/

But hey, I'm right behind you :)

If using `db-s3-object`, does KeyDB uses that as a seed on startup?

Eg: Running KeyDB on AWS, the EC2 instance fails and is replaced by another one running the same config. Will it read the last S3 dump on startup?

This is a nice project.

Hardcoding "aws" in the core seems like an odd choice though. Wouldn't it be better to make it agnostic and provide some sort of a trigger where a an external script or util to handle backups? That is, why S3 and why not any other service?

Not yet implemented. I’m hoping to get that fixed today.

I mostly use slaves for recovery with the database dumps as a last resort. So I overlooked that.


Do you have any special config to sync to replica before accepting incoming connections in a straightforward manner?

Yea we have external management software we wrote. We’ve been looking into releasing this as well, but it’s not cleaned up yet.

Looking forward to that. In the meanwhile I'll wait and test the S3 option when it's live.

I just implemented S3 loading in this change: https://github.com/JohnSully/KeyDB/commit/a45f212693956a6fb1...

Its currently in the unstable branch, once it gets shaken out a bit I'll release 0.9.3 with it included.

How fast are writes compared to normal redis? Is this being used in production anywhere?

Why scrolling on their website https://eqalpha.com/ is so broken?

> Unlike most databases the core data structure is the fastest part of the system. Most of the query time comes from parsing the REPL protocol and copying data to/from the network.

That's the TL;DR; of "What tradeoffs are required?"

From what I can tell, they extracted an embedded redis DB and wrapped it in a multi-threaded protocol layer. So execution and behavior of Redis is identical, but all the "not actually a DB" parts are faster and more concurrent.

Author here. We added fine grain locking to the redis codebase. The goal is to expand what’s concurrent over time.

Currently network IO and query parsing are multithreaded. Access to the core hash table is under lock. However network IO and query parsing take up most of the time in Redis queries. Hash tables are pretty fast. The goal is to add a reader/writer lock here in the future.

The lock itself is also interesting. We used a ticket lock which is a variant of a spin lock. This ensures fair locking and reduces the variance in latency dramatically. Standard POSIX locks are too slow.

I've always wondered what the best concurrency approach would be for a fully compatible fully concurrent Redis. The single-threaded nature makes it naturally expose some strict semantics to the user, which is tricky. What are you planning on doing to push past the global lock for the hashtable?


This is an example that is completely lock free.

Does anyone have a link to the discussions on why redis stayed single threaded?

antirz reply about why not threaded http://antirez.com/news/126

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact