Also this is an open source project, not something I want to sell you. Feel free to make a PR/issue with any open topics that are not mentioned in the docs.
Then I fail to understand how it works. How Event-Reduce becomes aware of these "write events"?
>this is an open source project, not something I want to sell you
You made it open source so others can use it, right? They better be making an informed decision whether your solution suits their needs.
And yes, you should always do testings before you use open source stuff. There is no warranty use it on your own risk.
Anyway, to maintain consistency, you have to limit yourself to one process of your app. No sharding, load-balancing etc. This is significant limitation, and it's not obvious. I encourage you to mention it in README.md.
It is a simple algorithm that is implemented as a function with two inputs and one output.
Some DBs expose an event stream, for example, PG:
Is this intended to be an optimisation on top of localStorage and so on? If so, at least you don't have to worry about multiple writers.
The work in general purpose streaming grew out of the continuous query work in the early 2000s. Marrying truly general purpose streaming with transactional databases is not a solved problem (as opposed to transactional databases providing streams), but the notion of continuously running database queries has been around for a long time.
This algorithm is limited to single table aggregates I think?
And this doesn't even take into account recursive queries.
It's to bad that the EventReduce algorithm doesn't come with a paper, I'd love to compare the two.
If EventReduce needs joins to be delegated to views then it doesn't solve the really hard parts of view maintenance. Interleaving nested joins and aggregates also wouldn't be possible, because the aggregation and join live in separate worlds.
And a lot of queries actually have that property.
I guess you could push down all joins into a single megatable and then group-by the living hell out of stuff to get back the structure but that would create an O(insane) cartesian product from your database.
Materialize on the other hand is basically the holy grail of view maintenance.
It supports ANSI SQL, which is a lot, and it's using the culminated work on differential dataflow and timely dataflow.
Also it's written in Rust.
My prediction is in 10 years we'll all be using it in our backends and frontends. You'll stream live updates from your sql database to your frontent where it will completely replace React and Js with Rust and a local db that has the UI as a maintained view.
But a simpler solution might be to have a standard WAL format, and use that as a basis for having replicas of the system of record be a first class citizen in my data center instead of something we all cobble together or pay for.
I've a radically different approach: can the queries in question as VIEWs, materialize them, use triggers to update materializations where you can write those triggers easily and the updates are quick, or schedule an update where they're not.
If your RDBMS is very good about pushing WHERE constraints into VIEWs, and depending on how complex a VIEW query is, you might be able to make the update automatic by just querying the materialization's underlying VIEW with appropriate WHERE constraints from the ROWs being INSERTed/UPDATEd/DELETEd. You can tell which VIEWs might suitable for this by checking that the TABLE whose row the trigger is running for is a "top-level" table source for the VIEW's query: meaning a table source that's either the left side of a top-level LEFT JOIN, or either side of an INNER JOIN. If you can run a query on the VIEW with a timeout then you can just do that in the trigger and mark the materialization as needing an update if the query is too slow. Lastly, a scheduled or NOTIFYed job can run to perform any slower updates to a materialization.
1. Do you have a paper on this with a rigorous justification of the algorithm?
2. This surely has to rely on the isolation level being pretty high, or EventReduce might be reading while n other processes are updating. I don't see that mentioned.
3. Surely you need logical clocks for this? If not, could you point me to a high-level description of the algorithm to show why they aren't necessary.
4. Why does sort order matter? A timestamp yes (see 3. above), but I don't understand why the order matters.
thanks (and trying to understand this might be the thing to get me into looking at BDDs again. I never understood their value).
2. EventReduce is mostly useful for realtime applications. I myself use it in a NoSQL database (RxDB). There you stream data and events and a single document write is the most atomic 'transaction' you can do. If you need transactional serial writes and reads that depend on each other, you would not use EventReduce for that.
3. EventReduce is just the algorithm that merges oldResults+Event. It assumes that you feed in the events in the correct order. Mostly this is meant to be used with databases that provide a changeStream where you can be sure that the order is correct.
4. Sort order matters because EventReduce promises you to always return the same results as a fresh query over the database would have returned. When the sort order is not predictable, the returned rows from a query depend on how the data is stored in the database. This order cannot be predicted by EventReduce which means it will then return a wrong result set.
PS: BDDs are awesome :)
There are many different trade-offs between a paper and the current repository with source code.
For me the biggest argument was that EventReduce is a performance optimization. So to be sure if it really works and is faster, you always need an implementation since you cannot predict the performance from a paper.
Because I did not have time for both, I only created the repository with the implementation. Maybe a paper will be published afterwards.
Edit: your example gives replaceExisting() so that's supporting an update of some kind.
But easier you do that by using a database that already provides a changestream like couchdb, Postgres, mongodb and so on.
Forgive my ignorance, but that is the whole point of working with a relational database. If cannot use JOINS then this solves only a very limited use case.
I'm thinking about the application layer. If you have an application that writes data to a table, it's typical to run multiple instances of that application to support scale and reliability requirements.
If I send a write to one instance, how does it communicate and synchronise that write with the other application instances?
I ask because this can be a tricky thing to do, especially when consensus is required, as consensus algorithms such as Raft/Paxos require a number of network roundtrips which will introduce latency, and actually account for much of that latency in the database examples given in some cases.
See it as a simple function that can do oldResults+event=newResults like shown in the big image on top of the readme.
I must admit, with limitations like this I’m struggling to figure out the use cases for this.
Edit: so I guess this is easier using the change subscriptions you mention in other comments. That does mean many subscribers, but hopefully that’s minimal load. This has the trade-off that it’s now eventually consistent, but I suppose that’s not a problem for many high read applications.
I’m still feeling like this could be solved in a simpler way with just simple data structures and a pub sub mechanism. Now I think of it, we do a similar thing with Redis for one service, and a custom Python server/pipeline in another, but we’ve never felt the need for this sort of thing.
Do you have more details about specific applications/use cases, and why this is better than alternatives?
e.g. making deterministic changes to some data before you've heard back from any sort of central single-source-of-truth.
Difference with CRDT is that the goal is to never actually hear back from a central authority and still have consistency between clients.
Maybe this is a sort of hybrid CRDT/OT.
EvenReduce assumes there are no other systems interacting with the DB state (by using the old state that the current system saw). If there are no other systems, simple batching would work fine.
Specifically, I’m wondering about isolation levels which determine whether uncommitted changes are queryable before commit/rollback.
In addition, it also tracks the history of changes and hence allows the cursor to go back if needed with "resumeToken"
For more information about the difference I recommend the video "Real-Time Databases Explained: Why Meteor, RethinkDB, Parse & Firebase Don't Scale" https://www.youtube.com/watch?v=HiQgQ88AdYo&t=1703s
Have you actually tried it from mongo shell or any mongodb client driver ?
In the official document link which i shared it is clearly mentioned it supports aggregation pipeline. Any operator which is compatible with aggregation pipeline framework including "$sort" and "$skip" can be used. You can also use JOIN like operator "$lookup" or "$graphLookup".
See this link for info
Can you share the link for the code and data in Database against which you are querying to prove your claim ?
However, your query won't return the query results.
And regular PostgreSQL queries don't let you return the query plan they used on top the query results.
Also see the limitations of EventReduce in the readme.
However, you can't ask ask PostgreSQL to run a query and also return the query plan used for said query.
Explain will calculate the query plan for a query.
Explain analyze will calculate the query plan, run the query and compare the query plan expectations to reality.
However, if statistics change, so does the query plan.
So if you run a query then run the same query again with explain analyze, you don't have a guarantee of getting the same information back.
And since explain analyze doesn't return the query results you are obligated to run two separated queries.
If that's a problem you can use auto_explain.
Materialized views solve a similar problem but in a different way with different trade-offs. When you have many users, all subscribing to different queries, you cannot create that many views because they are all recalculated on each write access to the database. EventReduce however has a better scalability because I does not affect write performance and the calculation is done when the fresh query results are requested not beforehand.