
KeyDB: A Multithreaded Redis Fork - ddorian43
https://github.com/JohnSully/KeyDB
======
antirez
What we are doing instead: just started a Redis Proxy project (designed by
Fabio and me, coded by Fabio Nicotra, so we started yesterday). We will start
abstracting a Redis Cluster this way so that you can talk with to it like it
was a single instance, but this is not enough. We also are planning for:

1\. Tooling to make it simpler to start a Redis Cluster in a single host if
all you want is to leverage all the cores.

2\. Smartness in the AOF rewriting so that instances on the same host will do
a better use of the disk.

3\. A caching mode for Redis Cluster. This is useful for deployments where
most accesses are idempotent and you have for instance a cluster of 10
physical hosts running, for instance, 32 Redis instances each. When an host is
no longer reachable we don't want failover in that case, but just reassignment
of hash slots to the other masters. Think at consistent hashing but on the
server side.

Two years ago I had this private branch that I wrote in a few days that
threaded Redis, with N event loops on N threads. After realizing that it
worked and scaled, I considered how many bugs such Redis would have, how
harder it would be to add features, and how incompatible is to have a single
huge instance with Redis persistence method. In that moment and after
consulting with other engineers I decided that for a few years our strategy
will be Redis Cluster. Maybe I'll change idea in the future? Maybe, but first
let's explore what can be done with a shared-nothing architecture that keeps
the promise of continuing to deliver software that hardly crashes ever.

EDIT: Obviously the proxy itself is multithreaded :-D

EDIT2: Btw jdsully already sent a PR to the original Redis project, so I guess
a smart developer exposed to the Redis core is great anyway!

~~~
jdsully
The problem with clustering still becomes lower queries per GB as instances
can’t share data. Redis itself runs in RAM so storage is at a premium.

One of my main reasons for doing multithreading, and FLASH in the first place
was to make Redis work well for much larger value sizes.

I really think we have different use cases in mind.

~~~
antirez
You mean the use case where you have a single very "hot" key? Yes, that's a
limitation, but in that case replicas can scale read load very well. But
indeed this is a special case where threads are a better model: yet this is a
very niche use case that anyway ends being inherently unscalable because the
object to manipulate is a single one, so accesses end being serialized in one
way or the other.

I understand the point about Redis on Flash. Redis Labs is also doing a lot of
things with this model. I believe it is super useful for some folks. Yet I
believe what makes Redis "Redis" is that there are no tricks: it runs the same
always whatever the access pattern is. I want to stress this part over the
ability to work with bigger data sets. Moreover I believe that persistent RAM
with huge amounts is in the horizont now.

~~~
derefr
> yet this is a very niche use case

I've always thought that one of the sensible basic data structures that Redis
could support operation on is an (int->int) digraph.

There are a lot of things you can build on top of "digraphs as a toolkit
object"—just see the lengths people go to to get PostGIS+pg_routing installed
in Postgres, just to do some random digraph queries to their (non-geographic)
RDBMS data.

IMHO digraphs very similar to Redis Streams in that sense: they're something
that might seem like a single high-level "use-case" at first glance, but which
is actually a fundamental primitive for many use-cases.

However, unlike a hashmap, list, set, etc., a digraph is necessarily—if you
want graphwise computations to run efficiently—one large in-memory object with
lots of things holding mutual references to one-another that can't just be
updated concurrently in isolation. But, it is likely that people would demand
multiple writers be able to write to disjoint "areas" of the graph in
parallel, rather than having to linearize those writes just in case they end
up touching overlapping parts of the graph data.

If I understand Redis Streams, it was created because there's no real need for
things like Kafka to do their own data-structure persistence. Data-structure
persistence is not the "comparative advantage" of a thing like Kafka; network
message-passing with queuing semantics is. High-level infrastructure pieces
like Kafka should rely on lower-level infrastructure pieces like Redis for
their data persistence, and then they can _just_ do what they do best. Is that
accurate?

If so, then I would argue the same thing applies to graph databases:
persisting and operating on data-structures is not really their job; it's just
something they do because there isn't anything to delegate that work to. Redis
could be such a thing. But not if it's single-threaded.

This is, just from the top of my head, a use-case (set of use-cases?) that
_needs_ threading (or at least "optimistically threaded, with fallback to
linearization") for writes. And it's not really all that niche, is it? Not any
more than message queues are niche.

~~~
gavrie
It seems like RedisGraph would fill this need. It builds upon Redis to provide
a graph database.
[https://oss.redislabs.com/redisgraph/](https://oss.redislabs.com/redisgraph/)

------
jwr
Redis being single-threaded has always been its single most important
advantage in my opinion. It eliminates whole classes of possible bugs and
makes the software much simpler.

~~~
harryh
Multi-threading is not some crazy new fangled idea. A huge portion of the
software you use every day is multi-threaded. It's not _that_ hard to get
right. I don't see why the redis maintainers don't have the skills to do it.

~~~
toast0
Retrofitting multi-threading into an existing design, is actually a pretty big
challenge. Especially if you're going with shared everything multi-threading
as is common in C. I won't name and shame, but I've started using a well known
"internet scale" project that recently added multithreading, and have run into
three major locking errors already.

These are smart people, they've written a great program, but they didn't make
the right locking decisions.

The approach antirez mentioned in a sibling thread of a multithreaded proxy
process that communicates to multiple single threaded instances sounds much
easier to implement without ending up with data in the wrong places. Although,
even that doesn't necessarily need to be multithreaded either. A master
process doing healtchecking and sending status to the worker processes, that
are using a eventloop would probably work just as well.

------
gtrubetskoy
Heh, I did this back in 2012: [http://thredis.org/](http://thredis.org/) (Not
only is it threaded, it also supports SQL).

~~~
biggestdummy
And Alibaba did it a couple years ago with Pedis!
[https://github.com/fastio/pedis](https://github.com/fastio/pedis) (looks like
the project is still active, but I don't really know of anyone else who uses
it.)

------
redm
For anyone wanting an overview of how KeyDB works, its at the bottom of the
Readme.

"KeyDB works by running the normal Redis event loop on multiple threads.
Network IO, and query parsing are done concurrently. Each connection is
assigned a thread on accept(). Access to the core hash table is guarded by
spinlock. Because the hashtable access is extremely fast this lock has low
contention. Transactions hold the lock for the duration of the EXEC command.
Modules work in concert with the GIL which is only acquired when all server
threads are paused. This maintains the atomicity gurantees modules expect."

~~~
sitkack
So it is still single threaded in the semantics that people expect (which is
great news). And should also not expose users or the codebase to nasty
threading bugs.

------
mmastrac
> In addition we wanted open source implementations of features currently only
> available in proprietary modules. We feel a fork is the best way to
> accelerate development in the areas of most interest to us.

This is probably the most important part of the fork.

~~~
jdsully
In that vein we have FLASH support baked in. It works a bit differently than
Redis Enterprise (uses File System snapshotting), but it’s also a lot less
complicated.

~~~
otterley
What tradeoffs have you made by doing so?

~~~
jdsully
The trade off is the kernel decides what is paged into RAM and what stays on
disk. Less control, but also less overhead. This is a similar approach to
Squid.

~~~
otterley
I don’t think users are going to be happy with that approach in the long term.
It’s well advised to control eviction behavior yourself. (Varnish learned this
lesson ages ago, notwithstanding phk’s original paper.)

InnoDB’s buffer pool documentation is instructive:
[https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-
pool.h...](https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html)

~~~
derefr
If you're running a Redis instance as the only thing on the VM instance it's
on (very likely in my experience), is there a reason to reimplement all this
policy logic in Redis, rather than simply tuning the caching policy of the
existing OS kernel to make it perform well for Redis (maybe at the expense of
performing well for any other process you might run under that same kernel)?

~~~
otterley
In theory that makes sense, but I'm not aware of any mechanism that allows you
to tune a kernel's VMM to suit an arbitrary workload/access pattern. Are you?

------
nicpottier
This seems to keep the important bit about Redis single threading, that is
that you have exclusive access to the db in a script or `multi`. That property
allows for a lot of things that wouldn't be possible otherwise. Also leads to
very predictable performance characteristics.

That said, Redis has always been faster than I've needed, and that's leaning
on it pretty heavily, so I kind of agree with antirez that adding this
complexity may not come with much benefit. Will be interesting to see where
this goes.

~~~
rbranson
According to the README all access to the data is effectively single threaded.
Only the network work is actually multi-threaded.

> Access to the core hash table is guarded by spinlock. Because the hashtable
> access is extremely fast this lock has low contention. Transactions hold the
> lock for the duration of the EXEC command.

~~~
jdsully
This will be improved in future versions. One of my major goals is to make
KeyDB work well for large value sizes, >1MB. There’s no reason you shouldn’t
be able to cache images as well as text.

FLASH support was the first step, and we’ve removed a bunch of unnecessary
copies along the way but there is still more work to do.

~~~
gtrubetskoy
FWIW - my notes on how locking was done in Thredis:
[https://github.com/grisha/thredis/blob/master/README-
THREDIS](https://github.com/grisha/thredis/blob/master/README-THREDIS)

~~~
jdsully
Thanks! I did look at thredis originally, but the github seemed to not have
been maintained at the time.

------
DrJosiah
Multithreading is great, getting data to walk across sockets on multi-socket
machines can be a challenge. Source: have dual-socket E5-2670.

Multithreading eliminates 99.8% of snapshot overhead:
[https://www.slideshare.net/secret/sLTnWqcPV7G0HK](https://www.slideshare.net/secret/sLTnWqcPV7G0HK)

Can also help reduce snapshot create and load time by 20-70% or so:
[https://www.structd.com/](https://www.structd.com/)

But hey, I'm right behind you :)

------
jzoch
I find it really surprising that it took someone a month (I think the author
stated this on Reddit) and yet redis has been so hesitant to do it. What are
the downsides at this point?

~~~
elif
antirez recently provided a full response
[http://antirez.com/news/126](http://antirez.com/news/126)

~~~
MrBuddyCasino
He says most setups are network or memory bound, so I don't understand this
reasoning:

"Well, that’s the plan: I/O threading is not going to happen in Redis AFAIK,
because after much consideration I think it’s a lot of complexity without a
good reason. Many Redis setups are network or memory bound actually.
Additionally I really believe in a share-nothing setup, so the way I want to
scale Redis is by improving the support for multiple Redis instances to be
executed in the same host, especially via Redis Cluster."

The RAM would have to be a multiple of the dataset, because its shared-
nothing, running right into one of the two common limits, right?

~~~
toast0
I would assume a cluster on a single node would partition the data, rather
than replicate it. Without an in-depth knowledge of redis, I can't really
predict if 4 instances sized for 1GB would have more overhead than 1 instance
of 4GB, but I'm willing to hazard a guess that it would be fairly close. On
the network side, it's a wash if you're already bound by the interface line
rate.

------
xfalcox
If using `db-s3-object`, does KeyDB uses that as a seed on startup?

Eg: Running KeyDB on AWS, the EC2 instance fails and is replaced by another
one running the same config. Will it read the last S3 dump on startup?

~~~
jdsully
Not yet implemented. I’m hoping to get that fixed today.

I mostly use slaves for recovery with the database dumps as a last resort. So
I overlooked that.

~~~
xfalcox
Amazing!

Do you have any special config to sync to replica before accepting incoming
connections in a straightforward manner?

~~~
jdsully
Yea we have external management software we wrote. We’ve been looking into
releasing this as well, but it’s not cleaned up yet.

~~~
xfalcox
Looking forward to that. In the meanwhile I'll wait and test the S3 option
when it's live.

~~~
jdsully
I just implemented S3 loading in this change:
[https://github.com/JohnSully/KeyDB/commit/a45f212693956a6fb1...](https://github.com/JohnSully/KeyDB/commit/a45f212693956a6fb1aacf465d88e940bbbfd56f)

Its currently in the unstable branch, once it gets shaken out a bit I'll
release 0.9.3 with it included.

------
rubyn00bie
How fast are writes compared to normal redis? Is this being used in production
anywhere?

------
vishwajeetv
Why scrolling on their website [https://eqalpha.com/](https://eqalpha.com/) is
so broken?

------
phamilton
> Unlike most databases the core data structure is the fastest part of the
> system. Most of the query time comes from parsing the REPL protocol and
> copying data to/from the network.

That's the TL;DR; of "What tradeoffs are required?"

From what I can tell, they extracted an embedded redis DB and wrapped it in a
multi-threaded protocol layer. So execution and behavior of Redis is
identical, but all the "not actually a DB" parts are faster and more
concurrent.

~~~
jdsully
Author here. We added fine grain locking to the redis codebase. The goal is to
expand what’s concurrent over time.

Currently network IO and query parsing are multithreaded. Access to the core
hash table is under lock. However network IO and query parsing take up most of
the time in Redis queries. Hash tables are pretty fast. The goal is to add a
reader/writer lock here in the future.

The lock itself is also interesting. We used a ticket lock which is a variant
of a spin lock. This ensures fair locking and reduces the variance in latency
dramatically. Standard POSIX locks are too slow.

~~~
rbranson
I've always wondered what the best concurrency approach would be for a fully
compatible fully concurrent Redis. The single-threaded nature makes it
naturally expose some strict semantics to the user, which is tricky. What are
you planning on doing to push past the global lock for the hashtable?

~~~
BubRoss
[https://news.ycombinator.com/item?id=19369794](https://news.ycombinator.com/item?id=19369794)

This is an example that is completely lock free.

------
DSingularity
Does anyone have a link to the discussions on why redis stayed single
threaded?

