The short answer is write-through cache. You write the update directly to the ca...

squeaky-clean · on Aug 15, 2023

In Nathan Marz's (the article author) book, Big Data, he describes this and calls it the Speed Layer. I haven't fully finished the article yet, but the components it's describing seem to be equivalent to what he calls the Batch Layer and the Serving Layer in his book.

But I'm kind of getting the impression this works without any speed layer and is expected to be fast enough as-is.

nathanmarz · on Aug 15, 2023

Rama codifies and integrates the concepts I described in my book, with the high level model being: indexes = function(data) and query = function(indexes). These correspond to "depots" (data) , "ETLs" (functions), "PStates" (indexes), and "queries" (functions).

Rama is not batch-based. That is, PStates are not materialized by recomputing from scratch. They're incrementally updated either with stream or microbatch processing. But PStates can be recomputed from the source data on depots if needed.

Pxtl · on Aug 15, 2023

So the idea is that you could do

1. send event data to depot

2. trigger localized ETLs (or put it high-priority in queue) to recalculate just the impacted data into relevant PStates

3. await completion of aforementioned ETLs

4. run query from updated PStates

Maybe too heavy for an upvote, but very appropriate for a an important transaction like a purchase.

FridgeSeal · on Aug 16, 2023

Forgive me if I’m misunderstanding things, but this seems quite similar to what Materialize and ReadySet do, but like “as a library”, because Rama doesn’t use a “separate” layer for the storage stuff. Is that correct-ish?

kulahan · on Aug 15, 2023

This explains so many bugs I came across on Reddit. I guess it works, but man I dislike this implementation.

jitl · on Aug 15, 2023

Rama should bundle a write-through cache! Another in-memory JVM cluster thingamabob (Apache Ignite) used to propose write-through caching as it's primary selling point: https://ignite.apache.org/use-cases/in-memory-cache.html#:~:....

Or, maybe their pitch is that the streaming bits are so fast, you can just await the downstream commit of some write to a depot and it'll be as fast as a normal SQL UPDATE.

nathanmarz · on Aug 15, 2023

Rama is extremely fast, as you can see for yourself by playing with our Mastodon instance.

jedberg · on Aug 15, 2023

It’s fast until it’s not. Making a post and then hitting reload and not seeing it can be very jarring for the user. Definitely something to think about.

nathanmarz · on Aug 15, 2023

What do you mean? Every post I do shows up instantly.

Reloading the page from scratch can be slow due to Soapbox doing a lot of stuff asynchronously from scratch (Soapbox is the open-source Mastodon interface that we're using to serve the frontend). https://soapbox.pub/

squeaky-clean · on Aug 15, 2023

I think the concern is will this still be true if Mastodon reaches Twitter scale?

nathanmarz · on Aug 15, 2023

Rama is scalable. So as your usage grows, you add resources to keep up. Scaling a Rama module is a trivial one-line command at the terminal.

Rama's built-in telemetry provides the information you need to know when it's time to scale.

jitl · on Aug 15, 2023

is there a way to guarantee reading your own writes from a client perspective?

nathanmarz · on Aug 15, 2023

Yes. Depot appends by default don't return success until colocated streaming topologies have completed processing the data. So this is one way to coordinate the frontend with changes on the backend.

Within an ETL, when the computations you do on PStates are colocated with them, you always read your own writes.

teacpde · on Aug 15, 2023

It makes sense, but wouldn’t the write be slow? Especially when you have many streaming pipelines.

nathanmarz · on Aug 15, 2023

That's part of designing Rama applications. Acking is only coordinated with colocated stream topologies – stream topologies consuming that depot from another module don't add any latency.

Internally Rama does a lot of dynamic auto-batching for both depot appends and stream ETLs to amortize the cost of things like replication. So additional colocated stream topologies don't necessarily add much cost (though that depends on how complex the topology is, of course).

reilly3000 · on Aug 15, 2023

DynamoDB’s DAX cache espouses the same approach.

I have to say in my ~12 years as an active Redditor I can’t recall a time where I saw any real state issues, even with rapidly changing votes, etc. Bravo!? Now that we’re beyond the days of molten servers, I have to say its overall reliability in the face of massive spiky traffic is quite a feat.

endisneigh · on Aug 15, 2023

Really? I see this all the time even now.