
Roshi: a CRDT system for timestamped events - sagichmal
https://developers.soundcloud.com/blog/roshi-a-crdt-system-for-timestamped-events
======
boothead
> The tl;dr on CRDTs is that by constraining your operations to only those
> which are associative, commutative, and idempotent.

Interesting. We have an event stream database implemented in Haskell, and this
looks like an excellent way to index it. Especially associative, commutative
and idempotent can (probably) all be encoded in the type system!

~~~
rdtsc
Interesting, I have seen CRDTs mostly in the Erlang context (monitoring and
following works Basho has been doing for Riak).

Basically a lot of this:

[http://basho.com/tag/crdt](http://basho.com/tag/crdt)

But using Haskell's type system is an interesting idea.

You can start with one level higher than just looking for associative,
commutative and idempotent, because that actually comes the definition of a
semilattice

[http://en.wikipedia.org/wiki/Semilattice](http://en.wikipedia.org/wiki/Semilattice)

The crucial part is that you have an appropriate meet or join operator (and
that it satisfies the above requirements).

Informally think of a functions like max() over natural numbers or a union()
over sets. Those are some examples.

I don't know Haskell but found this interesting gist of someone who has tried
this, and frankly a lot of stuff there is above my head, but it might help
you:

[https://gist.github.com/pozorvlak/6540639](https://gist.github.com/pozorvlak/6540639)

~~~
boothead
That is interesting (but also sadly over my head too).

I think some of the theory there is starting to get into operational
transformation territory [1]. If that's the case then a really interesting
application might be applying the same semantics indexing a stream of events
to expressing events as a diff against state and propagating them to
clients...

[1]
[http://en.wikipedia.org/wiki/Operational_transformation](http://en.wikipedia.org/wiki/Operational_transformation)

~~~
tel
This is definitely aligned with OT. OT typically comes in two steps—merge and
simplify. Merge takes two divergent histories and puts them together into a
collective view. Simplify takes that collective view and reduces it to a
single "consistent" view according to domain rules.

With CRDTs the merge operation is made such that the simplify operation is id.

It's also interesting to note that the merge operation is just a pullback in
the appropriate category.

------
peterwwillis
I don't know why, but this post gave me warm and fuzzy feelings all over. It's
like they actually thought out a novel solution to a problem, not just picking
up a trending tool in the startup dev sphere. ( And their README looks like it
was written by an experienced ops engineer or open source dev! How retro :3 )
I wonder if SoundCloud is hiring.

~~~
sagichmal
Good news, friend! [http://soundcloud.com/jobs](http://soundcloud.com/jobs) :)

------
tlarkworthy
So I am a little unsure about where all the data is. Am I right in thinking
that a user has a cached merge of all their follower's events. So when they
first read they get an inconsistent view which can be populated in the
background on demand. The massive data gains being: most people don't read all
their data so you waste a lot of time on fan-out writing data that is never
used.

The read view just has to be caught up from when it was last accessed. You
don't do this all the time, only when a user requests their timeline. So still
an individual CDRT set still has to be persisted for each individual user a
bit like an out-of-date inbox. You don't create the whole inbox from scratch
each request or do you _?

_ maybe you do because you only need recent events??

\---------------- EDIT.

I think I worked it out ... the inboxes are created dynamically. This is what
was meant by stateless. All the time series data is able to fit into one
server's memory, so the IO overhead of assemblage is low

~~~
sagichmal
You got it on your edit. The inbox is generated dynamically from multiple
outboxes. The time series data, however, is _not_ able to fit in one server's
memory; it's sharded over multiple non-communicating Redis instances,
collectively known as a pool. And multiple pools (complete replica sets)
constitute the system.

~~~
tlarkworthy
Ahhhhhhh. yes yes. Nice.

Weirdly enough I am implementing a CRDT list today (not a set). The TreeDoc
example in the main paper has a problem I think, if two writes with same key
go in, the disambiguator keeps an ordering. But if that occurs, you now can't
insert between those two edits because the disambiguators are not dense.

The set stuff looks good though.

------
dangduc
I'm trying to understand CRDTs, and Roshi provides an implementation of how
CRDTs can be put to use in real-world applications. I've assess the
documentation on Roshi and would like assent or clarification if I'm wrong.

After glossing over the readme of github.com/soundcloud/roshi I came away with
the impression that Roshi's LWW-element-set implementation does not strictly
adhere to the qualities of a CRDT, mainly that operations must be commutative.

Roshi documents two uses-cases:

If first we apply an add operation to the set, then apply a remove operation
with the same tuple, the resulting set will ignore the remove operation:
A(a,1) R() + remove(a,1) = A(a,1) R()

If we apply the remove operation, then next the add operation, the resulting
set will ignore the add operation: A() R(a,1) + add(a,1) = A() R(a,1)

This means the resulting state of the set depends on the order of operations,
which violates the commutative property of CRDTs.

As a consequence, if we assume this CRDT is replicated on each redis node in a
Roshi cluster, then there is a case (admittedly rare and short-lived) where a
Roshi cluster cannot be eventually consistent:

Given no more future operations on a set with key K, if, node 1 contains a key
K with set A(a,1) R(), and node 2 contains a key K with set A() R(a,1), then
the system cannot be eventually consistent.

Roshi seems to have implemented a weak form of CRDT, which under rare and
short-lived situations impedes eventual consistent. But given the operational
realities, this is totally acceptable.

------
e12e
Great stuff, but:

> Of course, reads are difficult. If you follow thousands of users, making
> thousands of simultaneous reads, time-sorting, merging, and cutting within a
> typical request-response deadline isn't trivial.

Even within a "typical request-response deadline" blasting through some
thousands of "events" does sound kind of trivial? (Which I suppose is
reflected in the relative simplicity and elegance of this solution :-)

Now, if you absolutely need the inboxes of user A and user B who follow the
same thousand users to end up _identical_ if they are accessed in a similar
time-frame, then things do indeed sound a bit hairy -- but do you really need
that? (Or even demand that both A and B see X re-sharing track N, before user
Y re-shared track M -- if the sharing events are within the threshold of
clock-skew internal to your systems)?

I suppose it's great to be able to say that these guarantees will hold, but
I'm not sure I see them as needed for (all of) soundcloud's events?

~~~
sagichmal

        > Even within a "typical request-response deadline" 
        > blasting through some thousands of "events" does sound 
        > kind of trivial?
    

If they were all on the same physical machine, sure. But the events are spread
across multiple nodes, so a single stream read can potentially touch every
Redis instance.

    
    
        > I suppose it's great to be able to say that these 
        > guarantees will hold, but I'm not sure I see them as 
        > needed for (all of) soundcloud's events?
    

I'm all about cheating, if it nets me a big win. How would you cheat, here?
Honest question; it's not apparent to me what corners we could cut, given the
whole point, more-or-less, is to avoid pre-materializing views.

~~~
peterwwillis
Why do you want to avoid pre-materializing views? Storage issues? Concerns
about not-live-enough information? Both?

~~~
erichocean
Both, due to the Power Law property/problem inherent in natural networks (of
which social networks are an obvious example).

For example, (I'm told that) Twitter has effectively two separate systems for
distributing updates for users, one for normal users (you and I), and one for
celebrities. Propagating Lady Gaga's tweets to all 30 million followers can
take something like five minutes to accomplish. It's even more fun when these
celebrities start tweeting back and forth with each other.

Maybe SoundCloud's CRDT approach would work better for natural networks? At
some point, network communication could start to be an issue with a compile-
the-list-at-read design, but I could be wrong.

I personally don't see why a CRDT is necessary if you used something like
Twitter's snowflake approach to assign ids. Pull in events is sorted order
from each machine, merge, then push to the user. It's not clear to me what
adding the CRDT abstraction adds to that besides overhead. Maybe there are
other benefits I'm not grasping?

~~~
sagichmal

        > At some point, network communication could start to be 
        > an issue with a compile-the-list-at-read design, but I 
        > could be wrong.
    

You're absolutely right. Optimizing for that bottleneck occupied most of the
development time.

------
encoderer
Something I learned working in engineering at Formspring (which also used
Cassandra to solve a similar problem) was that fan out is expensive (both in
terms of necessary horizontally scaled infrastructure AND in terms of working
with the complexity.)

------
malandrew
For those interested in CRDT and using NodeJS, check out:

[https://github.com/dominictarr/scuttlebutt](https://github.com/dominictarr/scuttlebutt)

[https://github.com/dominictarr/crdt](https://github.com/dominictarr/crdt)

------
ecesena
Great work! Any chance to see it extended to non-time-series events, like an
HN over Roshi?

At a very superficial level it seems to me that you can't implement a reorder
operation within the model, but perhaps you can circumvent the problem with
some tricks...

~~~
sagichmal
You're right: it's strictly coupled to time order. All of the CRDT semantics
rely on that invariant. Arbitrary reordering would turn it into a different
beast indeed...

------
jonahkagan
sagichmal, could you elaborate on "we were able to map a fan-in-on-read stream
product to a data model that could be implemented with a specific type of
CRDT"? The post (which is awesome btw) glosses over exactly how Roshi is used
to solve the problem presented in the first half of the post - or at least it
seems that way to me as I am unfamiliar with CRDTs.

