Hacker News new | past | comments | ask | show | jobs | submit login
Redis streams as a pure data structure (antirez.com)
344 points by itamarhaber 29 days ago | hide | past | web | favorite | 53 comments

I get that the tennis match use-case is meant to be trivial and an example, but I don't buy it.

> Before Streams we needed to create a sorted set scored by time: the sorted set element would be the ID of the match, living in a different key as a Hash value.

I think the sorted set would be a much better choice, because then you could still insert items in the past, like when that admin remembers there was a tennis match last week he never recorded. Same goes for modifying past values, or deleting values. These operations are trivial using a sorted set & hash, not so using streams.

I'm excited for streams and I'm glad Antirez is taking time to blog and evangelize, but this article didn't convince me there's a compelling use-case for streams aside from the Kafka-like use-case.

We are going to have an option to XADD to insert elements in the middle. I commented more extensively about it in another reply, so inserting out of order later will be possible. However note that the pattern still works when you use a time as a field, you don't need range queries, but just want single-item identifiers. However the XADD option to insert out of order is really a thing that will hit Redis ASAP.

Excellent to hear this.

We use sorted sets as queues heavily and this would be a necessary thing for us to consider giving streams a go which would indeed be interesting from a memory savings (we sometimes have millions of items in our queues for a short time). Sometimes, say on error conditions, you want to stuff something back at the start of the queue (because the order of processing matters) instead of at the end as one example.. priority being another.

Just like it's useful to have both SQLite and Postgres available for smaller and larger data projects (and Spatialite and PostGIS for smaller/larger geo-data projects), it could be great to have Redis and Kafka for smaller and larger pipeline projects.

Does anyone have good patterns for joining across entries from two or more Redis streams? This is one of the most interesting aspects of Kafka/Flink/Spark/Storm/etc. Would be useful to be able to develop with streaming joins in Redis playgrounds.

This seems pretty simple when events are logged as they happen with little or no latency and you can let the stream set the timestamp. I wonder, though, about the case where events may be buffered, perhaps due to an unreliable network? The time that the event occurred might be significantly earlier than the time it's inserted, and furthermore events are arriving out of order. It seems like things get much more complicated?

Let's say tennis games are recorded on a piece of paper and entered into the computer later. What is different?

Two solutions: 1. add a timestamp as a field, and just use the ID, but in that case range queries are going to be a problem. 2. exactly because of what you stated, XADD will soon have a special argument to say: I'm going to insert an element in the middle: this is the time in milliseconds (find for me the counter part if I did not specify one). Could be confusing for streaming, but as a data structure to insert in the middle is spot-on and there is nothing preventing that.

Did anyone yet use Redis streams to store actual logs? Like server logs, application logs, etc.

I understand that Elasticsearch is a common place to put logs, also because I assume that searching through logs is a common use case, but I wonder whether Redis has particular benefits for this use case. The data structure seems particularly tailored to it (but not so much to searching I guess).

Log volume can easily exceed reasonable memory sizes. Even a small company can generate TBs of logs each month. Having a single box with TBs of memory wouldn't be desirable.

For logs without full indexing, Loki (https://github.com/grafana/loki) is a recent entry into the space, and it probably a good option to look at. It indexes metadata (labels), so it allows searching by labels but not full text. It is also supposed to be horizontally-scalable, which is probably something you want in a log storage solution.

My guess is that this would work fine until the working set size exceeds available memory. Redis (unless something new has happened the last couple of years since I used it) requires that data fit in RAM. So could work well for low-frequency logging like alerts. Not as a general purpose log system.

I wish we could standardize on using Redis as general interprocess transactional memory. I could drop 95% of our application code for our Embedded Linux platform by using stock Redis and stock SQLite, but of course there are political obstacles.

Is this embedded in the same process, or just within the same unit?

Aside: would an embeddable redis be a useful thing for apps and other isolated devices?

There is basically no gain in practical terms in running Redis as an embedded library in embedded contexts, at this point I think I'm able to summarize the key reasons.

1. Embedded systems are often used in environments where you need very resilient software. To crash the DB because there is a bug in your app is usually a bad idea.

2. As a variation of "1", it's good to have different modules as different processes, and Redis works as a glue (message bus) in that case. So again, all should talk to Redis via a unix socket or alike.

3. Latency is usually very acceptable even for the most demanding applications: when it is not, a common pattern to solve such problem is to write to a buffer from within the embedded process, that a different thread moves to Redis. Anyway if you have Redis latencies of any kind, you don't want to block your embedded app main thread.

4. Redis persistence is not compatible with that approach.

5. Many tried such projects (embedded Redis forks or reimplementations) and nobody cared. There must be a reason.

I beg to differ. SQLite is a very popular embedded database. There is inherent simplicity to just reading and writing flat files.

Redis feels like that. It’s a simple data structure server. Now if we could have those datastructurs sync with flatfiles with the same redis API, a lot of applications would become much simpler.

I’m not sure how big of an undertaking it is though.

I’m willing to bet, a fast general datasrtuctures database syncable to flat files would open up many possibilities.

Having an in-memory datastore that is compact and supports fast queries and flexible data types is very useful.

I use sqlite for this purpose, essentially as an in-memory cache of data populated from disk and incoming server packets. Having redis as an option to replace mysql (or at least to compare memory use and speed) would be great.

I looked for an embedded Redis fork and came up blank, do you have links? I found Vedis, but I would rather have something built off of the Redis code than a re-implementation.

Sorry I don't have links since I did not track such forks in the past. However I've a question: for your use case, isn't it an option to have a library that looks like Redis from the POV of the API, but actually stores objects in memory as data structures native to your programming language? This way the API looks like a mental proxy for the DSL to access Redis and the time complexity you expect from given operations, but you are just writing to local objects.

That's an option but I would rather not re-invent the wheel unless necessary!

The current use of sqlite is to allow our scripted code (lua and actionscript) to make queries of the exposed data without having to write C++ code for every possible query and data object type (and implement new ones on demand).

Redis might not be the correct thing for this exact use case (some of the queries are more complex than a simple key or range look-up) but I may be prepared to take those limitations in exchange for a substantial speed and/or memory use improvement.

It sounds to me like there might be some area where sqlite is "to much" but lightningdb/berkleydb/toky cabinet is "too little".

I would be surprised if "actual Redis" was ever the right answer to "sqlite is too much".

But I do wonder if there are some lessons to take from Redis api and wrap something like lmdb/bdb etc.

I'm not familiar enough with Redis to know when/if this would make sense over just using sqlite, though.

One limitation of SQLite is that it doesn’t support any kind of “notify me when some other process does X” operation. (If you Google it, you’ll find sqlite3_update_hook, but that only works for updates performed by the same process.) If you want to use SQLite as an event queue, you can have one process writing rows to a table and another process reading them, but you need some external signal to tell the second process “wake up, there’s new stuff in the queue”. Or you can have it poll on a timer, but that’s suboptimal in many different ways.

Which is topical, because watching for updates is a core feature of Redis streams (and Redis already had pub/sub channels before that). For that use case, SQLite is too little, even if your needs are otherwise quite basic.

Unfortunately, this difference in capabilities seems to be partly a result of limitations in the underlying OS APIs. SQLite uses POSIX advisory locks to lock ranges of the database file, but I don’t think there’s any similar API that provides an event or semaphore associated with a given file, instead of a lock. There are plenty of messaging APIs that aren’t associated with an arbitrary file – there are semaphores, message queues, and shared memory objects, in fact two sets of APIs for each of those (SysV and POSIX), plus signals, etc. But those all have their own namespaces, and if the two processes trying to synchronize with each other are in different containers, they might not share those namespaces. There are Unix sockets – those are a decent option, but they require one process to set itself up as the server, which is a bit weird in the SQLite model where all the processes are on an equal footing, and any may quit at any time. They also don’t work over NFS (whereas locking does, at least sometimes). You can try to mmap a regular file and then treat it as shared memory, but that’s not guaranteed to work in all cases, and again doesn’t work over NFS. I suppose you could try to abuse a lock as a semaphore, but that has its limitations…

But it’s not like many people use SQLite over NFS anyway. Whatever the approach, I’d love to see a “SQLite for notifications”. It would probably be a pretty simple library, but with the needed bells and whistles like bindings to higher level languages. If a library like this exists, I’d be very interested to hear, because a while back I searched for one in vain.

In my case the target application is video games (console and PC) so we performance and/or memory usage critical.

I've used Redis very successfully on the backend, so maybe I'm just trying to find some reason, any reason, to play with it in player facing code!

When I get started to build RediSLQ I wanted an interprocess, fast, data store that supported SQL manipulation.

It may be useful to you as well: RediSLQ.com

Or on GitHub: https://github.com/RedBeardLab/rediSQL

Full disclaimer: I am the author

Did you mean RediSQL.com? Or is that someone else?

Indeed you are right!

Yes https://redisql.com

I’ve used Realm for this very successfully. It is a bit limited in the number of languages it supports (outside of mobile where it seems to support pretty much everything), but it has really nice support for node.js and .net which is where I have used it.

It is pretty cool to be able to share live interconnected objects between processes with full transactional safety.

Embedded is a very huge field. That can be everything from aviation, military and medial up to some toys.

For some of those using something as Redis as the existing service might be interesting, for others it will be a no-go.

I worked on automative infotainment system in the past, and throwing Redis on an embedded Linux system there would have been fine if it would have fulfilled a particular task in a good fashion. I think I even proposed it once for something.

"basically no gain"

My payload are lists of int64's. I need to do set operations on those lists before sending the result over the wire. If you advise against embedding redis, can I instead embed my logic in redis? As a filter of sorts?

You can run Lua natively in Redis.

I hardly ever comment, but this is a really cool idea. Could you elaborate a little more? (On technical aspects, not political obstacles.)

Complete conjecture, I am not the GP.

Hydrating/deserializing data from Sqlite into types/objects and doing whatever goodness those need, then using Redis to make "updating the database" super fast (in memory after all) and let Redis write it back to Sqlite as there is IO/time/lull in traffic.

Kinda like how Epic Cache does its transaction journal flushing every X minutes?

You could have a look into RediSLQ (RediSLQ.com) which is a redis module that embed SQLite, giving crazy fast performance.

It gives you a lot of interesting concepts like "lightweight databases" or push queries result into streams.

Here the GitHub repo: https://github.com/RedBeardLab/rediSQL

Full disclaimer, I am the author.

I'm pleased you took the time to send this little targeted advert my way. I will be glad to check out that repo.

Thanks! Any feedback is welcome!

Some silly feedback on typos:

> Carefully tested for correcteness and tuned for performance – never loose a bit.

correcteness -> correctness (slightly ironic :-) ) loose -> lose

> RediSQL inheritanve all the SQLite knobs and capabilities

inheritanve -> inherits

> RediSQL is written in Rust which provides more guarantess agains common bugs

agains -> against

> Only a very minor part of RediSQL is not releases as open source

releases -> released

Just fyi :-) Looks like a really interesting tool :-)

FYI the SSL cert has expired on whoever is hosting your download link (plasso.com)

Thanks! Indeed you should not have see that link! Can you point me where did you clicked?

The correct link is the following now: https://payhip.com/RediSQL

Passo got acquired and shutdown...

On mobile there is a "Buy Now" button at the top of the screen that goes to plasso. https://imgur.com/a/XYD9slH

The correct URL is https://redisql.com

It looks like an interesting project - but I'm not sure I understand how it's better than a ram backed sqlite instance. It forces/let's you use the Redis protocol to connect rather than embedding?

It allow you to access the same dataset from different processes (possible also with SQLite) and machines.

> RediSLQ

> rediSQL

why different spelling

I threw together a few words here about how we are using Streams combined with Sorted Sets to "upgrade" legacy databases to streams of data. Not revolutionary, but it could be interesting to some people. I can write more, if there's any demand: http://nicois.github.io/posts/databases-to-streams/

I wonder how this compares to streams in Kafka or Kinesis. One of the main advantages of redis is that I see it used in many cases as a replacement for memcache (just a key/value store for bytes/strings) so it already exists in many infrastructures.

I shared my experience sometime back in another HN thread [1]:

"A key difference I observed was that if a Kafka consumer crashes, a rebalance is triggered by Kafka after which the remaining consumers seamlessly start consuming the messages from the last committed offset of the failed consumer.

Whereas with Redis streams I had to write code in my application to periodically poll and claim unacked messages pending for more than some threshold time."

[1] https://news.ycombinator.com/item?id=19231178

From my experience, Kafka has the best api for handling read-once, distributed streams. Almost every other streaming solution, like Redis in this case, has a non-ideal or non-existent way to coordinate stream consumers in a way that prevents double-reads. And lots of streaming applications need to ensure read-once (think about what a double read ends up as - maybe a twice-sent message, or a duplicate metric), so I'm not sure why they all struggle so much with just copying kafka's pretty simple consumer api

How so? Kafka only accepts offsets which are meant for batches of items or even an entire partition. This means a single item not being processed within that batch requires your own code to compensate. It's the weakest of all the messaging models.

Per-message acknowledgement is an advancement. Redis requires manual lookups for unacknowledged items but you can also use Apache Pulsar for a more scalable distributed disk-based system which itself is a solid evolution over Kafka's design.

Also note that "exactly-once" semantics are actually impossible. Messaging systems are either "at-least-once" or "at-most-once". Kafka has some attempts at using transactions to solve this but that's only when using Kafka streams and only ensures read progress, not the processing result.

I’m not sure what you mean by “only accepts offsets which are meant for batches” but with Kafka the offsets are per-partition and you have to flexibility to control exactly when that offset is marked as processed. In our systems we always used manual offset committing and would only commit an offset once processing of the message has completed successfully to ensure both at-most-once and seamless failover.

Offsets are a marker that says everything before it (in that partition) is processed. This creates 2 issues:

1) Your application must coordinate and make sure that everything up to that offset is indeed processed successfully. 2) You application must stop if it encounters an error (because it can't commit an offset greater than that item) or handle it separately by logging to another topic, database, etc.

Other systems like Redis, Pulsar, Google PubSub provide per-message acknowledgement to allow items to be individually processed without blocking other forward progress.

Ah, I see what you’re saying now. #2 was always a pain to deal with, but I think other systems have similar problems. Other messaging systems deal with this with things like dead letter queues etc, but no matter what you use for message processing you will need some specialized logic to handle records which can’t be processed normally. In Kafka, you can raise an exception for the offset and then move on. When dealing with the exception, you can seek directly to the record offset and take it from there.

For #1, any application which has an in-order requirement would suffer from this problem. I worked with event processing systems so we never really had to worry about this, since each event was independent. However, there were instances where we would need to track state for certain objects getting processed to make sure all of their child objects were also processed. For this we would use an external store with a short TTL since the lifetime of the object during processing would only be a few minutes.

All-in-all it just comes down to what your app’s requirements are. I don’t think Kafka is meant to replace every pub sub service out there, but definitely has some great use cases.

That doesn’t sound like “at-most-once”. What if your consumer crashed after processing the message but before committing the offset?

This is a fair thing to point out. For us the “at-most-once” guarantee was based on committing. Luckily our processing was idempotent so in the rare case of the above scenario it wouldn’t cause any duplication

Streams are kinda cool but they have a distinctly different feel than the other data-types in Redis. They've got this invisible statefulness. Last ids, consumer group state, etc. I've tried implementing a couple little things with streams, and it's not necessary to use the consumer group stuff or whatever of course. I wonder why streams weren't made using the modules API, though? They seem just weird/different enough to warrant exclusion from the core data types, in my thinking. Anyways, just reading the title referring to streams as "pure" made me go wtf? Because there's a lot of hidden state in there.

Pure means that when you don't use consumer groups, there is no hidden state at all, and they are just a boring data structure like everything else in Redis. Only if you use the messaging part they have state, but this is an accessory part like a shell on top of what is otherwise exactly a vanilla data structure.

Can you read from a stream that doesn't exist yet?

Streams are great! I've written a small library for Node which attempts to wrap some of the complexity (particularly for handling multiple connections for UNBLOCK calls, etc).

So far I haven't used it outside of hobby projects for webGL games and such, but it's worked brilliantly, and no Kafka required for hobby async-streaming infrastructure!

Hopefully it's useful to someone out there! https://github.com/erulabs/redis-streams-aggregator

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact