
Redis streams as a pure data structure - itamarhaber
http://antirez.com/news/128
======
apeace
I get that the tennis match use-case is meant to be trivial and an example,
but I don't buy it.

> Before Streams we needed to create a sorted set scored by time: the sorted
> set element would be the ID of the match, living in a different key as a
> Hash value.

I think the sorted set would be a much better choice, because then you could
still insert items in the past, like when that admin remembers there was a
tennis match last week he never recorded. Same goes for modifying past values,
or deleting values. These operations are trivial using a sorted set & hash,
not so using streams.

I'm excited for streams and I'm glad Antirez is taking time to blog and
evangelize, but this article didn't convince me there's a compelling use-case
for streams aside from the Kafka-like use-case.

~~~
antirez
We are going to have an option to XADD to insert elements in the middle. I
commented more extensively about it in another reply, so inserting out of
order later will be possible. However note that the pattern still works when
you use a time as a field, you don't need range queries, but just want single-
item identifiers. However the XADD option to insert out of order is really a
thing that will hit Redis ASAP.

~~~
nicpottier
Excellent to hear this.

We use sorted sets as queues heavily and this would be a necessary thing for
us to consider giving streams a go which would indeed be interesting from a
memory savings (we sometimes have millions of items in our queues for a short
time). Sometimes, say on error conditions, you want to stuff something back at
the start of the queue (because the order of processing matters) instead of at
the end as one example.. priority being another.

------
drewda
Just like it's useful to have both SQLite and Postgres available for smaller
and larger data projects (and Spatialite and PostGIS for smaller/larger geo-
data projects), it could be great to have Redis and Kafka for smaller and
larger pipeline projects.

Does anyone have good patterns for joining across entries from two or more
Redis streams? This is one of the most interesting aspects of
Kafka/Flink/Spark/Storm/etc. Would be useful to be able to develop with
streaming joins in Redis playgrounds.

------
skybrian
This seems pretty simple when events are logged as they happen with little or
no latency and you can let the stream set the timestamp. I wonder, though,
about the case where events may be buffered, perhaps due to an unreliable
network? The time that the event occurred might be significantly earlier than
the time it's inserted, and furthermore events are arriving out of order. It
seems like things get much more complicated?

Let's say tennis games are recorded on a piece of paper and entered into the
computer later. What is different?

~~~
antirez
Two solutions: 1. add a timestamp as a field, and just use the ID, but in that
case range queries are going to be a problem. 2. exactly because of what you
stated, XADD will soon have a special argument to say: I'm going to insert an
element in the middle: this is the time in milliseconds (find for me the
counter part if I did not specify one). Could be confusing for streaming, but
as a data structure to insert in the middle is spot-on and there is nothing
preventing that.

------
nicois
I threw together a few words here about how we are using Streams combined with
Sorted Sets to "upgrade" legacy databases to streams of data. Not
revolutionary, but it could be interesting to some people. I can write more,
if there's any demand: [http://nicois.github.io/posts/databases-to-
streams/](http://nicois.github.io/posts/databases-to-streams/)

------
_pmf_
I wish we could standardize on using Redis as general interprocess
transactional memory. I could drop 95% of our application code for our
Embedded Linux platform by using stock Redis and stock SQLite, but of course
there are political obstacles.

~~~
cordite
Is this embedded in the same process, or just within the same unit?

Aside: would an embeddable redis be a useful thing for apps and other isolated
devices?

~~~
antirez
There is basically no gain in practical terms in running Redis as an embedded
library in embedded contexts, at this point I think I'm able to summarize the
key reasons.

1\. Embedded systems are often used in environments where you need very
resilient software. To crash the DB because there is a bug in your app is
usually a bad idea.

2\. As a variation of "1", it's good to have different modules as different
processes, and Redis works as a glue (message bus) in that case. So again, all
should talk to Redis via a unix socket or alike.

3\. Latency is usually very acceptable even for the most demanding
applications: when it is not, a common pattern to solve such problem is to
write to a buffer from within the embedded process, that a different thread
moves to Redis. Anyway if you have Redis latencies of any kind, you don't want
to block your embedded app main thread.

4\. Redis persistence is not compatible with that approach.

5\. Many tried such projects (embedded Redis forks or reimplementations) and
nobody cared. There must be a reason.

~~~
midnightclubbed
Having an in-memory datastore that is compact and supports fast queries and
flexible data types is very useful.

I use sqlite for this purpose, essentially as an in-memory cache of data
populated from disk and incoming server packets. Having redis as an option to
replace mysql (or at least to compare memory use and speed) would be great.

I looked for an embedded Redis fork and came up blank, do you have links? I
found Vedis, but I would rather have something built off of the Redis code
than a re-implementation.

~~~
antirez
Sorry I don't have links since I did not track such forks in the past. However
I've a question: for your use case, isn't it an option to have a library that
_looks like Redis_ from the POV of the API, but actually stores objects in
memory as data structures native to your programming language? This way the
API looks like a mental proxy for the DSL to access Redis and the time
complexity you expect from given operations, but you are just writing to local
objects.

~~~
e12e
It sounds to me like there might be some area where sqlite is "to much" but
lightningdb/berkleydb/toky cabinet is "too little".

I would be surprised if "actual Redis" was ever the right answer to "sqlite is
too much".

But I do wonder if there are some lessons to take from Redis api and wrap
something like lmdb/bdb etc.

I'm not familiar enough with Redis to know when/if this would make sense over
just using sqlite, though.

~~~
comex
One limitation of SQLite is that it doesn’t support any kind of “notify me
when some other process does X” operation. (If you Google it, you’ll find
sqlite3_update_hook, but that only works for updates performed by the same
process.) If you want to use SQLite as an event queue, you can have one
process writing rows to a table and another process reading them, but you need
some external signal to tell the second process “wake up, there’s new stuff in
the queue”. Or you can have it poll on a timer, but that’s suboptimal in many
different ways.

Which is topical, because watching for updates is a core feature of Redis
streams (and Redis already had pub/sub channels before that). For that use
case, SQLite is too little, even if your needs are otherwise quite basic.

Unfortunately, this difference in capabilities seems to be partly a result of
limitations in the underlying OS APIs. SQLite uses POSIX advisory locks to
lock ranges of the database file, but I don’t think there’s any similar API
that provides an event or semaphore associated with a given file, instead of a
lock. There are plenty of messaging APIs that _aren’t_ associated with an
arbitrary file – there are semaphores, message queues, and shared memory
objects, in fact two sets of APIs for each of those (SysV and POSIX), plus
signals, etc. But those all have their own namespaces, and if the two
processes trying to synchronize with each other are in different containers,
they might not share those namespaces. There are Unix sockets – those are a
decent option, but they require one process to set itself up as the server,
which is a bit weird in the SQLite model where all the processes are on an
equal footing, and any may quit at any time. They also don’t work over NFS
(whereas locking does, at least sometimes). You can try to mmap a regular file
and then treat it as shared memory, but that’s not guaranteed to work in all
cases, and again doesn’t work over NFS. I suppose you could try to abuse a
lock as a semaphore, but that has its limitations…

But it’s not like many people use SQLite over NFS anyway. Whatever the
approach, I’d love to see a “SQLite for notifications”. It would probably be a
pretty simple library, but with the needed bells and whistles like bindings to
higher level languages. If a library like this exists, I’d be very interested
to hear, because a while back I searched for one in vain.

------
skrebbel
Did anyone yet use Redis streams to store actual logs? Like server logs,
application logs, etc.

I understand that Elasticsearch is a common place to put logs, also because I
assume that searching through logs is a common use case, but I wonder whether
Redis has particular benefits for this use case. The data structure seems
particularly tailored to it (but not so much to searching I guess).

~~~
antoncohen
Log volume can easily exceed reasonable memory sizes. Even a small company can
generate TBs of logs each month. Having a single box with TBs of memory
wouldn't be desirable.

For logs without full indexing, Loki
([https://github.com/grafana/loki](https://github.com/grafana/loki)) is a
recent entry into the space, and it probably a good option to look at. It
indexes metadata (labels), so it allows searching by labels but not full text.
It is also supposed to be horizontally-scalable, which is probably something
you want in a log storage solution.

------
coleifer
Streams are kinda cool but they have a distinctly different feel than the
other data-types in Redis. They've got this invisible statefulness. Last ids,
consumer group state, etc. I've tried implementing a couple little things with
streams, and it's not necessary to use the consumer group stuff or whatever of
course. I wonder why streams weren't made using the modules API, though? They
seem just weird/different enough to warrant exclusion from the core data
types, in my thinking. Anyways, just reading the title referring to streams as
"pure" made me go wtf? Because there's a lot of hidden state in there.

~~~
antirez
Pure means that when you don't use consumer groups, there is no hidden state
at all, and they are just a boring data structure like everything else in
Redis. Only if you use the messaging part they have state, but this is an
accessory part like a shell on top of what is otherwise exactly a vanilla data
structure.

~~~
coleifer
Can you read from a stream that doesn't exist yet?

------
reggieband
I wonder how this compares to streams in Kafka or Kinesis. One of the main
advantages of redis is that I see it used in many cases as a replacement for
memcache (just a key/value store for bytes/strings) so it already exists in
many infrastructures.

~~~
rainhacker
I shared my experience sometime back in another HN thread [1]:

"A key difference I observed was that if a Kafka consumer crashes, a rebalance
is triggered by Kafka after which the remaining consumers seamlessly start
consuming the messages from the last committed offset of the failed consumer.

Whereas with Redis streams I had to write code in my application to
periodically poll and claim unacked messages pending for more than some
threshold time."

[1]
[https://news.ycombinator.com/item?id=19231178](https://news.ycombinator.com/item?id=19231178)

~~~
opportune
From my experience, Kafka has the best api for handling read-once, distributed
streams. Almost every other streaming solution, like Redis in this case, has a
non-ideal or non-existent way to coordinate stream consumers in a way that
prevents double-reads. And lots of streaming applications need to ensure read-
once (think about what a double read ends up as - maybe a twice-sent message,
or a duplicate metric), so I'm not sure why they all struggle so much with
just copying kafka's pretty simple consumer api

~~~
manigandham
How so? Kafka only accepts offsets which are meant for batches of items or
even an entire partition. This means a single item not being processed within
that batch requires your own code to compensate. It's the weakest of all the
messaging models.

Per-message acknowledgement is an advancement. Redis requires manual lookups
for unacknowledged items but you can also use Apache Pulsar for a more
scalable distributed disk-based system which itself is a solid evolution over
Kafka's design.

Also note that "exactly-once" semantics are actually impossible. Messaging
systems are either "at-least-once" or "at-most-once". Kafka has some attempts
at using transactions to solve this but that's only when using Kafka streams
and only ensures read progress, not the processing result.

~~~
aarbor989
I’m not sure what you mean by “only accepts offsets which are meant for
batches” but with Kafka the offsets are per-partition and you have to
flexibility to control exactly when that offset is marked as processed. In our
systems we always used manual offset committing and would only commit an
offset once processing of the message has completed successfully to ensure
both at-most-once and seamless failover.

~~~
manigandham
Offsets are a marker that says everything before it (in that partition) is
processed. This creates 2 issues:

1) Your application must coordinate and make sure that everything up to that
offset is indeed processed successfully. 2) You application must stop if it
encounters an error (because it can't commit an offset greater than that item)
or handle it separately by logging to another topic, database, etc.

Other systems like Redis, Pulsar, Google PubSub provide per-message
acknowledgement to allow items to be individually processed without blocking
other forward progress.

~~~
aarbor989
Ah, I see what you’re saying now. #2 was always a pain to deal with, but I
think other systems have similar problems. Other messaging systems deal with
this with things like dead letter queues etc, but no matter what you use for
message processing you will need some specialized logic to handle records
which can’t be processed normally. In Kafka, you can raise an exception for
the offset and then move on. When dealing with the exception, you can seek
directly to the record offset and take it from there.

For #1, any application which has an in-order requirement would suffer from
this problem. I worked with event processing systems so we never really had to
worry about this, since each event was independent. However, there were
instances where we would need to track state for certain objects getting
processed to make sure all of their child objects were also processed. For
this we would use an external store with a short TTL since the lifetime of the
object during processing would only be a few minutes.

All-in-all it just comes down to what your app’s requirements are. I don’t
think Kafka is meant to replace every pub sub service out there, but
definitely has some great use cases.

------
erulabs
Streams are great! I've written a small library for Node which attempts to
wrap some of the complexity (particularly for handling multiple connections
for UNBLOCK calls, etc).

So far I haven't used it outside of hobby projects for webGL games and such,
but it's worked brilliantly, and no Kafka required for hobby async-streaming
infrastructure!

Hopefully it's useful to someone out there! [https://github.com/erulabs/redis-
streams-aggregator](https://github.com/erulabs/redis-streams-aggregator)

