Hacker News new | comments | ask | show | jobs | submit login
Streams: a new general purpose data structure in Redis (antirez.com)
588 points by darwhy on Oct 2, 2017 | hide | past | web | favorite | 147 comments

One thing I like about this post is the story of how the feature came to be: Someone who understood redis very well, thinking about the problem over literally years, eventually resulting in a more targetted and goal-driven thinking, and even that "specification remained just a specification for months, at the point that after some time I rewrote it almost from scratch in order to upgrade it with many hints that I accumulated talking with people about this upcoming addition to Redis."

I've been thinking about this sort of thing for a while, wanting to maybe call it "slow code" (like "slow food"). This is how actual quality software that will stand the test of time gets designed and made, _slowly_, _carefully_, _intentionally_, with thought and discussion and feedback and reconsideration. And always based on understanding the domain and the existing software you are building upon. Not jumping from problem to a PR to a merge. (And _usually_ by one person, sometimes a couple/several working together, _rarely_ by committee).

That's an excellent point. I try to do this always, I think that waiting time and thinking more about problems has huge benefits in the final solution. But... when there is a community waiting for new features sometimes this attitude is seen as losing time, and is very hard to explain, so it is important to keep people aware of the design process. Also because via communication it is possible to receive high quality feedbacks. Btw the "slow code" term is great :-)

> And always based on understanding the domain and the existing software you are building upon

This is good. My version of "slow code" is taking note of requests from users and storing in my long term memory to stew over.

1. User A says "It would be nice to have feature X"

2. I ask for more info, what problem are they trying to solve

3. Repeat steps 1 and 2 until user's stop saying "nice to have" and start saying "must have". The stew is now ready and development can begin now that I have thought about a solution that can address everyone's problems...just in time before users get angry :)

I like this kind of development also. But I don't think it is for everyone (the person has to keep thinking on the problem for long time).

A good talk about It, from the creator of clojure:


I heard that ruby was developed like that. It had lot of `freeze & thaw cycles` of good years before finally it got released.

I feel very reflected in this "slow code" interpretation. Instinctively, I try to refrain my self to implement this new feature that looks important but not urgent. The positive outcome is not only that you have more time to think about the proper approach but also eventually, some times, the feature shows it self not needed any more.

Thought/discussion usually leads to quality. Time has nothing to do with it.

Of course it does: thought/discussion take time.

It seems people are confused. Of course everything takes physical time, but slowly thinking about something isn't any objective sign of better outcomes.

Everyone works differently, but for me thinking in a leisurely way does produce better outcomes. It's less that my actual thinking is somehow "slower", than that thinking about something and then letting it rest and come back to it, I have better conclusions than if I rushed to judgement.

I guess the phrase 'rush to judgement' is instructive. We're not just thinking, we're deciding of course.

If you can find out about a problem or issue or desire and very quickly come up with an architectural solution that will stand the test of time for years (at least without backwards-incompatibilities, which is what a redis requires), more power to you, but in my experience and observation most will not.

But I get it, you just don't agree with my basic suggestion, from your experience. That's fine. It's not really an argument about the speed of one's thinking. Nobody is "confused".

I think the salient point is that thinking takes more time than not thinking. At least in my experience, the policy is generally to push features, not to think (slowly or otherwise).

> This is how actual quality software that will stand the test of time gets designed and made ..

No not really. It's called waterfall and we had it for decades.

And it was largely abandoned in favour of agile development because despite architects spending months designing something there was always something that was left out. Or the scope changed during those months. Or millions of other things that change whilst your hidden away from your customers/end users instead of delivering them new value every week.

For me personally the truth is somewhere between agile and waterfall. Some upfront design but not the slow code you refer to.

I disagree, I think "waterfall" (at least in stereotype) would be more like coming up with the specifications in advance and then _sticking to them_ regardless of new information from the world or from the experience of developing the software. That's not the story the OP told.

I also agree with the person who responded that there is a difference between building products, and building tools for building products -- building shared libraries/code/tools may require different approaches. One of the things I've noticed about shared code, is it's _much harder_ to fix rethought decisions or change paths. "Backwards incompatibility" doesn't even exist as a thing in your local app really, at least you can always at least theoretically global search and replace when you change APIs.

I don't think we've figured out any magic bullet process for developing software. It's definitely not the stereotype of "waterfall", but I don't think the stereotype of "agile" is it either -- at _least_ when it comes to shared library/tools. And I'm absolutely convinced that those shared tools which do well were almost always designed carefully and intentionally with understanding of the domain, not just slapped together with stimulus-response.

I think you're being downvoted because trying to slap a brand like "waterfall" on the very idea of slow, careful design is kind of ridiculous. Bonus points for throwing "agile" in to the mix as well.

But waterfall or agile implies code is being written. My take on this (and experience) is that sometimes you just takes weeks, months, even years to just make notes in a notebook, ponder and think upon the problem, jot down more notes, look around at other peoples code, do research to see if anyone else has done work in this area, do more pondering and thinking and note taking, talk it over with other people, etc. At some point it all gels together (when that happens it is almost magical and such an amazing moment of clarity) and then you are ready to use some sort of development process to actually write the code. Though by that point writing the code is extremely easy since you have such clear insight into what needs to be written. I love when that happens.

You missed the part where the definition of the problem was being percolated during this process. Agile won't save you from not understanding the problem you are trying to solve.

There is a difference between building products and tools for building products.

I have a confusion about ID structure/format:

   The ID is composed of two parts: a millisecond time and a
   sequence number.  The number after the dot is the
   sequence number, and is used in order to distinguish
   entries added in the same millisecond. 
Does this mean for example that 1506872463535.11 comes after 1506872463535.2 (because 11 > 2)? If so that means treating these as decimals (which will be easy to do inadvertently) will yield the wrong order (as would sorting them lexicographically). If so it seems like something other than a decimal point would be a better separator (colon perhaps).

Yes actually maybe it's a good idea to change the point with something else. Thanks for the hint.

Supporting `MAXSIZE` as well as `MAXLEN` on `XADD` would also handle a nice Kafka feature (the ability to define your log size in either number of messages or size on disk).

Something like: `XADD MAXSIZE ~ 2147484000 * foo bar` to cap the stream at 2GB + 1 node.

And if ids are timestamps, maybe we can define it as MAXTIME as well.

What were or would be your prefered alternative syntax ? `:` ?

Not sure... : looks ok actually, even _ or - could make some sense. The # is a bit too heavy on the eyes :-)

I'll probably eventually hate myself for even bringing this up but I can't help but notice the similarity between this ID structure and Version 1 (aka timestamp) UUIDs. While I wouldn't go as far as recommending that you fully adopt that form, it might be worth considering if you could make these IDs compatible with UUIDs by defining a canonical transform. The critical differences are:

- UUIDs use a different epoch (15 Oct 1582 vs 1 Jan 1970) - UUIDs count 100 ns blocks instead of ms - UUIDs include a 6 byte "node id" - UUIDs allow only up to 15 bits of "sequence"

I think that last one is the biggest deal, since as currently specced redis allows 64 bits of sequence, which is obviously much bigger than 15. The options I see are either up the time resolution used by redis, encode some of redis's sequence bits into the UUID's time bits, or just live with it as a limitation--in practice 2^15 is a lot of messages to get in a single millisecond (though in cases of clocks jumping back might not be too much).

You'd also need to come up with some thing for the node id, perhaps the first 6 bytes of a cluster node ID or similar.

Thinking about this some more I realized you could also encode sequence into the low order bits of the timestamp, and rereading the RFC showed it actually makes this recommendation[1]. There are 10000 100-nanosecond periods per millisecond which gives about 13 more bits. Between that and the 13-15 bits available in the clock sequence you've got 26+ bits of sequence or ~67MM values per millisecond.

Since 64 bits is overkill for milliseconds (45 bits covers the next 1000 years or so) I was thinking you could put 2 bytes of the node id in the high order bytes there (perhaps could call this the "clock id"?) and the remaining 4 bytes of the node id could go in the high order bytes of the sequence, which would still leave 32 bits for actual sequence values (but we should only use 26 or so). This means we'd get a translation roughly as follows (numbering bytes and bits from high to low significance):

   Redis                                   Version 1 UUID
        Byte 0-1  "clock id"               Bytes 4&5 of node id
        Byte 2-7  millis since 1 Jan 1970  * 10000 => ~45 high order timestamp bits
        Byte 0-3  "node id"                Bytes 0-3 of node id
        Byte 4-7
           6 bits wasted space             ignored
          26 bits actual sequence value
            13 high order bits             => clock sequence
            13 low order bits              => low order timestamp bits
Another implication of this scheme is that if redis has access to a clock that offers higher than millisecond resolution it could store everything more precise than millisecond into the sequence portion of the id.

On a side note it seems that the clock sequence in the UUID is intended to be reset to a random value at start up and every time a clock jump is detected rather than just incremented. Redis could do something similar by incrementing some of the 13 high-order bits of the sequence every time a clock jump is detected (and/or if the 13 low-order bits overflow)

[1] https://tools.ietf.org/html/rfc4122#section-

I think a mostly vertical symbol is better, ":" or "|" is preferably to "_" or "-".

I agree. Perhaps a forward-slash (/) to denote subordination, e.g. 1507035873/11

Why is that?

verticality convey difference in kind of information better IMO

Why keep the separation at all ? Are clients expected to be able to query for a given timestamp precisely ? Because then you get all the problems with clock synchronization, especially given that the Streams' clock is monotonic and I'd expect clients' clock to not be

It might be useful to be able to query by server time regardless of whether your client clock is in sync. You retrieve some set of data and the next time you can ask the server to give you everything newer than x, where x was the highest time stamp you got from the server previously.

Yes exactly, you want to ask what is newer than x, where x is the last event you're aware of, but you don't really care about the date and time in that case. If you just store the last id given by redis Streams naively then you don't even care that they're timestamps; at that point my question is, why even bother with the distinction. Just ask for everything after x and be done with it.

Redis also has TIME to get the current server time with milliseconds and the unix time stamp. I'm reasonably sure that's what's being used to get the first part of the ID anyway.

Ah, I didn't know that. Although the post says that the timestamp of an id might also be the timestamp of the last message, since the clock can go backwards, so in the worst case a client might get some duplicate messages.

I vote that you use a dash instead.

It turns out that's what it will be [0], as dash retains the ability to easily copy the whole identifier in most terminals.

[0]: https://github.com/antirez/redis/commit/1189d90d749c84e98424...

You can also have a look at the technique used here to create collision-free sequential unique IDs across a cluster, even if it is just for inspiration: https://www.npmjs.com/package/cuid


c - h72gsb32 - 0000 - udoc - l363eofy

The groups, in order, are:

1. 'c' - identifies this as a cuid, and allows you to use it in html entity ids. The fixed value helps keep the ids sequential.

2. Timestamp

3. Counter - a single process might generate the same random string. The weaker the pseudo-random source, the higher the probability. That problem gets worse as processors get faster. The counter will roll over if the value gets too big.

4. Client fingerprint. For example, the first two chars are extracted from the process.pid. The next two chars are extracted from the hostname.

5. Pseudo random (Math.random())

I don't think you can ever think to a total order between events. In your example. My understanding is that the 2 events in your example happened in parallel (by redis definition of time granularity), and there's no correct ordering between the two. What I want to say is that even if you change to a ":" you'd get wrong results.

I think antirez is saying that, by serializing the id in this way, you can also get timeseries at the ms precision for free.

Edit: nvm, antirez just replied :)

In the XADD example, you have

XADD mystream * sensor-id 1234 temperature 10.5

to let the server choose an id.

It could also be interesting to have

XADD mystream 1506872463535.* sensor-id 1234 temperature 10.5

to be able to add an element in a specific msec bucket

What about regions that treat the comma as a decimal separator? I agree with the other commenter that this is no different than using periods in IPv4 addresses.

> What about regions that treat the comma as a decimal separator?

A good reason not to change it from a period to a comma, not really relevant to whether to change it from a period to something else.

> I agree with the other commenter that this is no different than using periods in IPv4 addresses.

IP addresses have no useful concept of "before" or "after", whereas these do.

As a member of such a region, I suspect that those regions a well aware that the prevalent notation throughout programming uses '.' as a decimal separator. IPv4 addresses usually have 4 components and do not have an inherently fractional unit as their first component.

Disagree. '.' is used for version numbers everywhere, even when it's only 2 components (ex: Wordpress versioning, Django etc.) and it seems pretty clear that 1.11 > 1.2 there.

Once the specification is stated clearly as it is right now it is not a problem.

Another symbol could be used (for instance #) but I really don't see a need for that.

Perhaps decoupling the <$.#> as (# x 0.01) would solve the problem right?

Not sure if that's what they do, but it isn't very complex.

The dot doesn't make that a decimal, any more than it makes IP addresses or version numbers decimals.

As for treating them as decimals inadvertently, well, hopefully client libraries will expose IDs as pairs of integers, not as strings. If users convert them into strings and then back into meaningless pseudo-decimals, well, great, we'll have an entertaining post about someone's outage to read.

This is a terrible attitude to have when responding to an obvious potential UX confusion, particularly when it will only come up in edge cases (>10 per millisecond).

10 per milli as an edge case is domain specific.

But, it has milliseconds on the left. Before I read the explanation, I immediately thought it was a floating point timestamp. IP addresses have no reasonable decimal interpretation.

It's not much of a problem for (v4) IPs, because they almost always consist of 4 numbers separated by a dot, making them immediately distinguishable from decimal numbers. If two-component IPs were common (they are sometimes seen in CIDR notation, but not often), the dot would have been an unfortunate separator choice as well.

For versions with only two components, I would argue that the dot can be confusing already.

Why use a separator that has the potential of confusion when there are several other choices with less potential?

I had the same thoughts until the haha-outage hyperbole. Perhaps this little feature spurring so much discussion about delimiters is caused by a lack of thought by the developer, releasing an idea before it fully matured. A sort of race towards innovation mixed with a hint of it-works-ship-it.

Ah yes, like trap answers on a multiple choice exam. I suppose the Zen of that design would be: "There should be more than one obvious way to do it, but only one correct way."

Projects tend to gain more and more functionality to match the new workloads they're being used to accomplish, and it must certainly be a difficult decision for project visionaries. Do I listen to my users and implement features that will solve their new woes, but in return accept increased complexity and higher learning barriers? Complexity sucks, but it's even harder to say no to users in pain.

I wonder if this opens such projects up to disruption, in the original sense of the word. I've seen the same teams forgoing apache for nginx, then forgo nginx for haproxy when it matches their needs. With additional layers of complexity, there may be accompanying opportunities for "simple but good" projects to gain traction.

I hate complexity, and because of that I spent a lot of time creating the Redis modules subsystem so that I can take a very small core. However general data structures like streams are, in my opinion, not bloating Redis, which is still, in the "streams" branch, at just 85k lines of code. For two reasons: 1) Data structures are self contained beasts in Redis, they do not interact with other features to multiply complexity. Other features are instead like that, for instance expires interact with replication, AOF, scripting and so forth. 2) IMHO bloating Redis is to add too narrow use-case specific things insdie it. But as Redis has lists, hashes, sets, ... the "log" really was part of the general purpose things that Redis was lacking. As a result of this lack, people used something else adding complexity inside their code in order to model the same problem with a wrong tool, like sorted sets or lists or Pub/Sub, which are good for certain things but not for time series or certain events streaming tasks. Btw the fact that after around 8 years we are at 85k likes of code, totally understandable by a single individual in a matter of weeks, positions Redis as one of the simplest simple software projects out there.

Another data point, streams are if not completely, at least 70% done as we talk, and yet `git log --stat unstable..streams` reports:

12 files changed, 1323 insertions(+), 10 deletions(-)

We'll end at 2000 lines of code I guess more or less. Just to say, Redis is at a complexity scale very far from many other things we are used to these days.

It's also why software is a pop culture. Sophistication and completeness is seen as complexity and cruft by each successive generation, who start something new and simple.

I don't think it's very avoidable. Tech is genuinely getting incrementally better, but it's usually in a sawtooth pattern.

I'll agree that sophistication and completeness is often seen as complexity/cruft but it also always comes with actual cruft as well since improvement is incremental and breaking APIs is annoying.

My favorite aspect of this cycle is when some features in the complex software become seen as so useful as to be required and standard, so when the new simpler version is created they have to figure out a novel way of providing that useful functionality in a simple and elegant way. And they do it, sometimes knowing they have made a significant advance and sometimes without knowing.

Do you have any examples of that second case?

Sounds too good to be true... but I'd love to be proven wrong :)

I'd agree in general, but Redis is architected in a way where these features aren't really new "layers", but rather just horizontal modular additions.

A new data structure + associated commands being supported in Redis is like, say... a new filesystem being supported in the Linux kernel. It's a few files that you could just avoid compiling in if you didn't want them, and which add code-paths that are never run unless you intentionally use that specific new thing.

> I'd agree in general, but Redis is architected in a way where these features aren't really new "layers", but rather just horizontal modular additions.

> A new data structure + associated commands being supported in Redis is like, say... a new filesystem being supported in the Linux kernel. It's a few files that you could just avoid compiling in if you didn't want them, and which add code-paths that are never run unless you intentionally use that specific new thing.

Isn't it exactly how features are implemented in Apache? Most people think that Apache is too bloated, but the same people also leave a lot of those modules enabled without trying to spend time to understand if they actually need them.

I made that very specific comparison for a reason: most "plugin systems" for software (like Apache) act sort of like audio VSTs—they can insert themselves anywhere in the "processing chain" of a request, making following the logic more complex.

Redis modules, like filesystem drivers, are comparatively simple: they register a set of commands that they respond to, and each such command is handled exclusively by that module when received. Less like Apache plugins, more like scripts in a cgi-bin. You don't need to know what else is in the cgi-bin besides your own script, because everything there is entirely independent.

Look at the source code. Redis is very simple, single threaded, primitive networking... I would hardly call it complex even with this extra feature.

I love redis, and this API looks amazingly simple. I'm sure I can think up a good use case for this, but the only problem I have with it is that time-series log data of this nature is increasingly becoming the defacto source of truth in the various models resembling some version or another of event-sourcing.

Obviously the general thinking is that event sourced time series data allows you to treat every other data source as derived state from the log data. If it gets written to the log, it's safe and that's the only real data that cannot be considered a redundant read layer. A common structure might look like:

1.) Event Sourced/Time Series Layer: Kafka/Kinesis>S3 takes in and saves log data

2.) Operational State Layer: RDBMS Constraints / Application Logic determine how operational state is derived from log data

3.) Indexing Layer: Query optimizations occur with redundant read layers in what is essentially all just various forms of indexing. This can be RDBMS index, ElasticSearch, MapReduce, Redis, etc.

Redis' place has historically been at #3, for many reasons. Whether an application has an event-sourced layer in which their operational state is derived from log data or is actually considered the primary source is something that is hugely variable. I would say that most applications do not make a distinction between #1 and #2, and just write state directly to an RDBMS. But while the operational state may or may not be considered redundant, depending on the application, the indexing layer is almost guaranteed to be a redundant layer. The redundancy of the index layer means that Redis operating purely in-memory allows the whole thing to be blown away with no consequence. To move it two steps down the data model to the defacto source of truth is a monumental shift in responsibility. Redis as an in memory caching layer has only ever had me have a cursory awareness of its capabilities in terms of saving to disk, but I would think that a fundamentally different use case like this will have me taking a serious look at where that functionality is at today.

All of this being said, there are plenty of use cases with kafka/kinesis which are done today which actually don't even save the log data at all, and just use them as an intermediary buffer to have multiple consumers on an event stream. There's also nothing stopping us from just having one of the consumers of this stream to be saving it to S3/Disk ourselves.

I would imagine this is best used for the "store a ton of events with some capped maximum size" Kafka use case (ie - realtime analytics, IoT data, etc). I just can't imagine using Redis as your source of truth for an event sourced system, especially without the partitioning and log compacting features that Kafka has. Still this seems like a pretty amazing feature to have in Redis, can't wait to start playing with it.

It's been a long time since I looked into this: is there now a way to configure a cluster of Redis instances such that you won't lose messages on node failure? If not, all the nice at-least-once delivery (or "effectively once" when you add message dedupe) you get with something like Kafka/Kinesis/GCP PubSub is gone.

If not, either people's messages don't matter /that/ much (which is fine, just not great for most of my usecases at the moment) or everyone's in for another round of "oh shit, where did the data go?"

Edit: Just in case we end up in CP vs AP datastore wars, please go read https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

At-least-once delivery requires neither CAP consistency (linearisability) nor CAP availability (any non-failed node must return a response in a non-infinite time), but is a very useful property!

Hello, the streams have basically the same characteristics as any other Redis data structure, that is, from the POV of a local node, you can configure strong persistence on disk, but on node failures, you have basically different tunable amount of best effort consistency, it means that you cannot guarantee no messages are lost. So basically this means that you can:

1. Use the default asynchronous replication, and live with the fact (if the use case permits this) that on failover, the message did not yet received the slave that will be promoted.

2. Use WAIT to force synchronous replication to N slaves. This will not still make Redis ensure you in mathematical terms that the failover will pick a slave that received the message, under complex partitions, but narrow the real world failure models leading to losing data to more "unlikely" cases. Yet you have just best effort consistency but with better real-world outcomes.

So Redis streams will be good choice if one of the above is acceptable.

I forgot to add that with the Redis modules API for the cluster, it could be possible in the future to write a module exposing a CP version of XADD without changing Redis default semantics.

What is the consistency model of redis?

It sounds like anything can be lost in redis during normal HA operations even with WAIT pushing to a majority of slaves. Is that right?

Eventually consistent unless you're only using a single node. I believe that Redis itself commits to disk at various checkpoints in time, so if a fail happens, you're really only guaranteed to fail over into a pool of data that's consistent up to the last checkpoint of the node you're moving to.

EDIT: And as antirez said above, you can WAIT to force synchronization to all nodes, so you would be pretty likely to fail over onto a node that has n-1 messages if it didn't sync in time. That still isn't guaranteed however.

That's good to know. Thanks for the explanation!

Not currently useful to me, but I'm sure this hits the sweet spot for a bunch of people.

Last I checked, neither Redis Sentinel nor Redis Cluster were linearizable systems; you get neither C, A or P. Redis Cluster failed Aphyr's Jepsen tests back in 2013. I not sure what the current status is, but I don't think the fundamental architecture has changed since then.

With vanilla Redis master/slave replication, I believe the best way to avoid data loss is to set replication to be synchronous (it's async by default) so that slaves are always guaranteed to be in sync with the master, in case you need to promote (using Sentinel) a slave to master.

The clear writing/documentation, concise API design, and clever implementation of antirez and the rest of the Redis team continues to amaze me. Redis is easily my most admired OSS project.

Very cool! We've been doing time series with Redis using sorted sets, referencing items by timestamps and integer offsets, and using the clock shift workaround described in the article. Having this kind of thing consolidated down into a few Redis commands would be handy. The API looks clean, too.

Two comments on effectively once stream processing.

1. Consider adding an example for a stateful event stream processor client that saves the last read stream offset in redis, together with its current state and continues reading from that offset as an atomic operation. For example, a client that sums a stream of numbers, in order to have effectively once semantics would need to persist to redis the sum and offset together.

2. Consider adding a stream read deduplication example to mitigate clients that reinserted the same event twice. It is not clear how the client should behave if it didn't get an ack and it resents an event. What is the correct resending semantics so the reader would effectively dedup? What is the right data structure used to dedup message ids without consuming too much memory, etc...?

I've been using redis + resque[1] for a few side projects and I have to say I'm glad that streams are getting first class support in redis. I was always a little wary of hacking this sort of functionality on top of redis lists. It worked, but it sort of seemed a little bit fragile.

[1] https://github.com/resque/resque

I'm very excited about this. I've been eying HTTP EventSource for a while now, but there hasn't been a good solution for the backend broker. Kafka is overkill and Amazon Kinesis' pricing isn't viable if you have lots of topics. This fills the need perfectly and Redis is already part of my stack.

I don't know if you've ever looked at it, but CouchDB has had an EventSource endpoint for a long time now (http://docs.couchdb.org/en/2.1.0/api/database/changes.html?h...). CouchDB is extremely easy to install, use and maintain, and there's a number of public providers out there if you don't want to host everything yourself.

As a more generic solution, there's also pushpin (http://pushpin.org/), which is the backend of fanout (https://fanout.io/), so that may also be a nice addition to your stack if you want a more direct redis->clients link

Check out nchan.io, an nginx module that does everything you need to connected EventSource to redis backed pub/sub.

+1, good to see other developers out there using EventSource and nchan. We've been happily using both in production for almost a year now.

Just wanted to chip in and say HTTP EventSource has been really nice to work with, we've been using it in production for 1+ year

Which polyfill do use for IE/Edge?

Amazing ! nit: Change COUNT to LIMIT while it still time. Also can this primitive be used to replicate redis data instead of sentinel/cluster ?

Under what circumstances would one prefer Redis streams over Kafka and vice versa?

I can think in some circumtances: 1 - You already have a Redis infrastructure and don't wanna or don't have resources to deploy a full Kafka infrastructure (3 kafka brokers + 3 zookeeper nodes)

2 - Kafka clients are not available (or are poorly available) for every programming language. Redis has a simpler protocol, so it has more/and better clients available and even if you use an exotic language, it is easy to write a client to it (well... easier than Kafka)

3 - Kafka AFAIK does not have any internal cache implementation, so every read is served from disk (+ page cache). This means that Redis Streams will (probably) perform much better for use cases when the consumers need to fetch data from old offsets.

edit: added reason number 3.

Regarding point 3, unless your system is under massive memory pressure, no caught-up Kafka consumer should be serviced from disk. Old offsets that are flushed out of memory because you do not have it obviously are served from disk, with essentially linear reads of disk blocks (of consecutive logical addresses if flush sizes are large enough that then can end up on disk in any number of ways, depending on how much the firmware lies, I know) of the requested file.

I really can not see how Redis is going to perform "much better" reading from disk once the entries are no longer in RAM. At that point both Kafka and Redis have to read from disk, and you either have the IOPS to serve all the lagging consumers or you don't. Maybe you have enough of them to service 1 or 2 concurrent reads, maybe 10-12. But for the same messages counts, sizes and concurrent consumers, your workload will become IOPS bound rather fast.

Note: "much better" to me implies 10x+ better, not "my C library read() is 2.3% better than your Java".

re: client support - I dunno, this seems like a pretty comprehensive list to me? I mean, there's even a rust client: https://cwiki.apache.org/confluence/display/KAFKA/Clients

Confluent only officially supports the Java client (and now has a Python, Go and .Net clients as well that I didn't know) and it is really recommended to use the client with the same version of broker due to protocol incompatibilities.

Most Kafka client implementations are open-source projects of their own, this is also true for most redis clients implementations, but again: Kafka protocol is much more complicated than Redis.

I haven't used Kafka with other languages besides Java or Scala, so I can't really say how mature are the other clients.

But my point about how easy is to implement a client for Redis if needed is still valid. =)

They also support a c reference implementation (which is how they get others).

As someone who has been burned by non-Java Kafka drivers, beware the perception of ecosystem support here. The Kafka design pushes a huge amount of complexity onto the client, and in our experience only the Java client deals with this complexity well. We started out using Python clients but eventually moved to Confluent's REST API (wrapping the Java driver) because we had so many problems with it.

One that immediately comes to mind is cases where Kafka is overkill. Kafka is a great tool, but there's a lot of overhead in setting up and maintaining it (e.g. Zookeeper), so if your throughput needs are low, it's a poor fit. Spinning up a Redis server is dead simple, and if you're already using Redis for other things, then there's no need to bring an additional tool into the mix.

Genuine question - Why does everyone seem to think running a zookeeper cluster is so hard? You can run it on three small VMs and basically forget about it. We didn't have any zookeeper experience at my last startup before we started using it for Kafka and we used a very simple puppet module to install it on three instances in each of our AWS regions. It really never gave us many problems in the several years since.

Also, all the tooling around it is quite mature - there are great monitoring and management tools for probing at the internals which helped when we were dealing with more exotic kafka surgery.

For some reason Zookeeper is unjustly seem as uncool technology. I even seen it being blamed for issues that it had nothing to do with.

People say that setting ZK cluster is a huge issue, yet they don't see a problem spinning etcd, or sentry nodes in case of redis.

When I learned about ZK I was skeptic, didn't like that it was written in Java, but ZK proved to be extremely robust.

I'm not sure about it being "so hard", but it's extra stuff to deploy, maintain, monitor, and pay for. Very small teams benefit from keeping things small and simple. Again, if you actually need Kafka, then it's worth it. If you just need something similar, but can deal with the limitations of redis streams, then it's an easy choice.

Because nothing in the current hype cycle depends on zookeeper so people use that as a crutch for following hype.

Simplicity. Redis 4.0 with the rxlists module [1] provides a fast queue system with full persistence. Unless you need replay, Redis is often easier and has plenty of throughput. This streams feature now solves the replay disadvantage.

1. http://redismodules.com/modules/rxlists/

Slightly off-topic, but could the blog be adjust a little for mobile reading?

    <meta name="viewport" content="width=device-width, initial-scale=1">
    #content { max-width: 800px; } // replaces width: 800px
seems to do the job, and also behaves better in narrow desktop browser windows.

Thanks! I'll change it tonight.

Done hopefully... EDIT: undone, makes things much worse on Android :-)

Fantastic news, congrats Salvatore! Cannot _wait_ to replace some hacky Kafka uses with tried-and-true Redis4! :)

In what sense is Kafka (or your use of it) hacky? I have never used Kafka, but I have always thought of it as being more solidly engineered than Redis but also more complicated and perhaps tricky to deploy (based on blog posts I read).

In any context its used where the demand (by whatever measure you care to use: bandwidth, throughput, message durability, etc.) doesn't justify it or isn't a good use case of Kafka, for starters. That happens all the time, because every data and infrastructure engineer in the Bay Area wants to put Kafka on his resume.

But that doesn't explain why Kafka has any minimum the output required. Does it have usability issues?

A good tool should be able to be used at any scale.

Kafka has very poor tooling in my experience (a folder full of fairly buggy bash scripts...), and due to ZooKeeper requires a lot of operational care. For example, it's extremely easy to destroy a Kafka cluster by bringing a new, empty ZK server online with newer but incorrect data in its volume. ZK will happily trash the entire cluster thinking it has new instructions. So network isolation is key, which, while obvious, is another source of potential failure.

Kafka also has the JVM, which requires a lot of love to scale in my experience. I do not want my programmers messing around with GC options when writing to what should (to them) be exposed just like a regular file handle (except distributed across many systems). I strongly prefer to avoid Java applications at all costs - in my experience it takes years and years and years for Java based infrastructure to become relatively stable & reliable (see ElasticSearch 5.0, or ask anyone who has been oncall for a Tomcat based application). This is almost certainly personal bias, but it's my bias regardless.

Redis also has a _massive_ number of tooling / monitoring / ecosystem advantages, including hosted options, and can run on a single instance without configuration changes from the developers perspective.

I also have personal reasons to prefer Salvatore's work over the work of Confluent.

As someone that has both Kafka and Redis in use without issue, for years, (and is about to replace a lot of misused Redis instances with Kafka) I really fail to follow your points.

So, a Zookeeper cluster can't survive accidentally injecting just the right malicious data that will make it keel over. I'm sorry, how do you accidentally achieve that? Do you also accidentally configure your Redis Sentinel to replicate from /dev/null?

As a matter of fact, this announcement comes at a very inopportune time for me. antirez had the epiphany of reading on IRC about replicated logs instead of looking at the opening paragraphs of the Kafka documentation, and all the Redis evangelists at my job will now try to shoe-horn the wrong usecase back into Redis because Redis!1cos(0)!. Sigh.

> For example, it's extremely easy to destroy a Kafka cluster by bringing a new, empty ZK server online with newer but incorrect data in its volume. ZK will happily trash the entire cluster thinking it has new instructions.

How does that happen? I mean a new, empty ZK server with never data than the rest of the cluster?

Also, please note that ZK is not meant to be a database, but a coordination service, it's guarantee is to have all nodes being always in consistent state and neither of its nodes allow to make any changes if there's no quorum. So if a new node somehow has more recent data with higher serial number it's expected that remaining nodes will sync to that.

Exactly right - in my case the situation was another team accidentally bringing a new ZK node with "bad" but "new" data online. Had there been network isolation, no issues. Had there been static cluster identifiers, also no issues. It was a messy environment, and it should have been prevented by operational diligence, but my point is redis is "harder to mess up". As on on-call engineer, I'll always go with simpler, foolproof tools. Another qibble is how gnarly the client-side driver for Kafka is...

I don't hate Kafka, I just don't like ZK and find redis has better tooling and a better track record at my shops :)

In order to connect a ZK host to the cluster its IP needs to be included in configuration of all the nodes.

It's hard to accidentally add node to a cluster. A person who can "accidentally" add a ZK node has enough permission to do a lot of more devastating things accidentally.

Yep. All it takes is service discovery and a not-totally-familiar with ZK jr. sysadmin.

This is all in service to my point about simplicity and safety.

> ...in my experience it takes years and years and years for Java based infrastructure to become relatively stable & reliable (see ElasticSearch 5.0, or ask anyone who has been oncall for a Tomcat based application).

This is almost the exact opposite of how I determine what tools to use. If it's written in C or Java, I'm usually pretty confident that it is engineered by a team of experienced developers. Both because the languages are technically more difficult to use, and not as cool.

In contrast, if a tool is written with javascript (Node) or Ruby, and often times Python, I'm very hesitant.

In fact, this whole topic of streams has me wondering just how many developers out there are setting up clusters of Kafka or Redis or whatever is new and hip, when they could have saved themselves a huge amount of pain by using tried and true tools like JMS or ZeroMQ.

Most companies do NOT have a need for scaling like Netflix or LinkedIn, and I'm beginning to wonder if Kafka, Redis, etc are this year's version of NoSQL and MongoDB hype.

I should have clarified - web applications in Java, built by web developers, are not to be trusted. ES -usage-, not ES itself, is/was the nightmare before es5 (which is much much much more defensive against anti-patterns).

I love redis, so I certainly don't have issue with code written in C, heh. Just code written in C by junior developers :P

> A good tool should be able to be used at any scale.

I don't necessarily agree that a tool should be used at any scale even if it's technically possible to. Cost (multiple dimensions, including money and engineering effort) factors in.

> However a special ID of “$” means: assume I’ve all the elements that there are in the stream right now, so give me just starting from the next element arriving.

I can already see lazy users just repeatedly reading $, and then dropping messages when they arrive faster than they read them.

Might it be safer to instead have command to ask what the latest ID in the stream is? You'd start off by using that to work out where the streams are, then construct an XREAD command to read from there. To construct your next XREAD, it should be easier to update the IDs from the ones you just read, rather than fetching the latest IDs again. Maybe.

I've had a break from Antirez' blog for a while - it's fun to go back and see how good his English has gotten!

Surely any discussion about the concept of logs and mention of Kafka should have a reference to this excellent article by Jay Kreps on LinkedIn:


It [OP] was a bit cringe worthy, frankly. Obviously the Kafka papers about 'reconsidering the log' and 'unified view' and all that have been out there for years now.

The Kreps article is quite excellent and well worth the read.

> A final important thing to note about XRANGE is that, given that we receive the IDs in the reply, and the immediately successive ID is trivially obtained just incrementing the sequence part of the ID, it is possible to use XRANGE to incrementally iterate the whole stream, receiving for every call the specified number of elements.

Love it. No need for an expensive SCAN command.

The consumer groups proposal breaks the FIFO abstraction of a stream by allowing multiple clients to process a single stream.

Have you considered adding a semantic layer inside streams that allows each client to consume a substream? In effect the stream becomes multiplexed substreams.

If substreams makes the design too complex... have you considered server side stream 403 semantics? When a stream is manually deprecated it enters an immutable state and provides a redirect response with a link to another stream. This would allow multiplexing and demultiplexing streams without changing the client implementations too much.

For completeness I would state the obvious when fifo grouping is needed: 1. Scaling stateful event processing by splitting streams and adding more clients (CPU limit) 2. Scaling cross region replication by splitting streams and adding more tcp connections (network limit) 3. Handling more throughput by splitting a stream into two redis nodes (disk I/O limit)

Don't use consumer groups and every client will get a complete copy of the stream. What is broken with that?

The use case I'm referring to is when the client must be sharded to avoid CPU or network or disk bottlenecks.

Then that's exactly what consumer groups help with but it sounds like you want partitioning then - which is exactly what Kafka does but with a little more automation.

Run multiple Redis instances and use a simple hash based on whatever key you want to route messages and get the throughput you need. It's probably never going to be part of the core Redis logic but should be possible as a module to do the routing when used in a Redis cluster.

What I'm referring to is changing the partitioning dynamically by splitting streams, not just redis nodes. Here is one implementation example http://docs.aws.amazon.com/streams/latest/dev/kinesis-using-...

Doing it without server support is tricky.

Is there a reason to choose millis as the granularity instead of micros or nanos? Is it because there's a stronger expectation of machines in a cluster agreeing on what milli it is "now" vs the other granularities?

I'm kind of thrown by the idea of putting the timestamp / stream-id in the XADD command, I would have thought the server would assign that, since one of the strengths of redis's single threaded nature is consistency: what's in redis is the truth. If you allow clients to specify timestamp, what happens when ntpd isn't running on some? I probably misread or misunderstood it.

Could you allow specifying `$` as the timestamp to tell the server you want it to use whatever it thinks the current time is as the timestamp / stream-id?

Hello, the stream implementation does not need for the different servers (for instance master and its slaves) to agree about the time. Simply the server that receives the XADD command will generate the ID (and the time part of the ID) to attach to the item. All the other participants in the replication will accept the same ID, because clients will use "" to specify the ID, while the command is rewritten to slaves with a specific ID. Example, I run into the master:> xadd stream * a 1 b 2
But this is replicated as (output of redis-cli --slave):

So XADD allows to specify an ID just for replication / AOF pruposes, not because clients should actually specify an ID normally. However of clients really want to do that, they could but at the risk of getting errors, for instance:> xadd stream 10.0 a 1 b 2
    (error) ERR The ID specified in XADD is smaller than the target stream top item
Redis will anyway not accept any ID which is smaller than the current top-item ID.

The reason why it was chosen to use milliseconds instead of nanoseconds is because, for most applications to query for sub-millisecond ranges is likely not useful, so to see even larger numbers in the ID maybe is just unpleasant if not useful, however we are still in time to change this if there are good motivations. But being the time the one produced by the local host, after a failover the IDs are generated by another host. Milliseconds can still more or less match with good time synchronization, but nanoseconds? So it's like if this additional precision will be just used to store non-valid info.

> Redis will anyway not accept any ID which is smaller than the current top-item ID.

I was reading on mobile earlier and maybe missed this point, excellent. I also didn't realize that clients would use (star) to specify the ID and that the receiving server turns that into an actual ID before replicating it / AOFing it.

> But being the time the one produced by the local host, after a failover the IDs are generated by another host. Milliseconds can still more or less match with good time synchronization, but nanoseconds? So it's like if this additional precision will be just used to store non-valid info.

I think most failovers necessarily take longer than a millisecond so any resolution smaller than millis would _probably_ be okay, but yeah this is not a compelling reason to switch to micros/nanos. My suggestion to switch to micros/nanos was more to try to reduce the number of collisions requiring the server to de-dup / assign sequential sub-epoch numbers to events arriving during the same server tick. I guess that's not a big issue though.

Thanks for the reply, Salvatore. Redis is one of my favorite codebases and projects.

Redis Streams could be a nice real-time counterpart to http://traildb.io: For instance, use Streams in Redis to record data in real-time and periodically store it in TrailDBs for long-term archival and analysis.

This may actually solve an issue I was about to tackle, which is high-speed notification delivery. Currently I was going to do a nasty wrestle of PUB/SUB with Lists and blocking keys to try and get a cluster of message processing servers to digest notifications when they get added to the "queue". My purpose in this is notifications, as in actual push notifications for mobile/web/etc. This seems like it fits perfectly with what I had planned. Especially since this ensures message delivery instead of fire-and-forget as you mentioned. As a question though... is the XACK commands working in your current branch? Since that is key to my usage of ensuring message consumption.

Essentially, I was going down this route if anyone's curious....


Could the difference between `MAXLEN ~ 1000000` and `MAXLEN 100000` be handled internally, by marking the overflowing items as deleted until a whole block can be removed? Looks like this tombstone functionality is already planned, would make the API simpler.

I would suggest making the efficient behavior the default and let people use `= 1000000` when they know they really want the expensive but exact behavior.

Are there plans for a "unique count" over an XRANGE?

I currently use multiple Sorted Sets (one set every 5 minutes) and Union 30-60 worth to produce a "rolling window" of uniques.

I can see an alternative where I just request a unique count of elements within an XRANGE.

What sort of compression do the blocks undergo? E.g. does periodicity of the timeseries help reduce the the space of the 64/128bit timestamp? Gorilla[1] style compression would be great, although it'd likely make sub block level range queries tough.


Hello, yes IDs are delta compressed so they use actually just a few bytes per entry (often just 2) instead of 16.

I have one follow up question - is TTL a planned feature? Being able to set a TTL on the _stream itself_ and -also- on the messages would be extremely nice. While MAXLEN prevents a queue from being extremely large, I also want to remove "stale" data after a configurable time period.

Use case: A log of network latencies, where a user might currently `XREAD` with a timestamp 10 minutes in the past, would be able to save on memory usage by expiring log entries > 10 minutes, and then being able to `XREAD STREAMS strm 0` and let Redis (and therefore the infrastructure, not my code, manage data retention).

Also, how does this work re: evictions? Say a node is at max memory, will entire _streams_ be evicted, or (I hope) the oldest messages in the LRUed or LFUed queues.

there's video[1] at the end explaining this new feature

[1] https://www.youtube.com/watch?v=ELDzy9lCFHQ

here is an approach i successfully used before: since timestamps are read from clock the epoch could be a persisted value and a few nibbles could be used for the instance id and a sequential number. with that you get 64bit numbers for easy use and computation without loosing the time information at the cost of a simple transformation function. it simplifies the interface much and generally makes things faster.

many clients will even not need that timestamp anyways.

Why not just use sequence ID? I'm confused about why a timestamp is important. The sequence ID gives us ordering, is always guaranteed to be increasing.

Because with the way stream IDs are conceived you also get time-based range queries for free. With time series this is very important in many use cases.

I see, so this composite structure is in lieu of having two distinct fields exposed in the API?

This looks very cool. Can't wait to try it out.

After reading this I'm still not sure I understand. Time-series data I can see, but is there a use-case outside of this?

Poor-man’s Kafka?

I am looking forwards to trying this out as an easy access cqrs / event sourcing entry point.

The use of `+` and `-` in XRANGE seems inconsistent. Why not use `0` and `-1` like LRANGE?

Because they do not mean a position, but a special ID.

It seems that "$" is a special ID for the last message, as opposed to the last possible message.

I would humbly suggest that "^" would be a suitable symbol for the first message in a stream. ^ and $ are used in regex (and vim) in a similar way.

That way you could write "XREAD BLOCK 5000 STREAMS newstream ^" and get all the messages in a stream from the beginning, and then block until a new message comes in all with a single command. You would still be able to add a count if needed, to prevent client flooding.

Exactly, $ is the last message ID, + the greatest, - the smallest. I chose the dollar exactly because of regex assonance. However the corresponding ^ is kinda useless because with XREAD we specify the last ID we got, so it would result in not returning the first element of the stream. It means that it's more useful to specify just 0 in that case.

Funny, we also built a lot of our technology around the Streams concept.


nice to know, they have added a new data structure !

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact