
EventReduce: An algorithm to optimize database queries that run multiple times - eventreduce
https://github.com/pubkey/event-reduce/blob/master/README.md
======
danbruc
Did I understand this correctly? You have a single set of items which you
query by evaluating a predicate on each of them and then sort the matching
ones. After the initial query you update the query result by looking at all
the data update events, i.e. you remove delete items from the result, you
insert matching new items in the correct position according to the sort order
and you insert, remove, or move updated items as they start or stop matching
the predicate and change their position according to the sort order.

~~~
eventreduce
Yes this is correct. The performance benefit comes from doing all this stuff
on the CPU instead of using disc-io. Also the internal binary decision diagram
of EventReduced is optimized in a way to run less logic then a query would do.
This makes it even faster then running the query again with an in-memory
database.

~~~
ComodoHacker
And the main cost of this (questionable IMO) benefit is losing consistency,
which is losing any change to DB not coming from the calling app. You haven't
mentioned this cost anywhere.

~~~
eventreduce
The writes are not tunneled somehow through this algorithm. You still use the
database like your normally would do. So the consistency is not affected.

Also this is an open source project, not something I want to sell you. Feel
free to make a PR/issue with any open topics that are not mentioned in the
docs.

~~~
ComodoHacker
>The writes are not tunneled somehow through this algorithm

Then I fail to understand how it works. How Event-Reduce becomes aware of
these "write events"?

>this is an open source project, not something I want to sell you

You made it open source so others can use it, right? They better be making an
informed decision whether your solution suits their needs.

~~~
eventreduce
You have to provide the events by yourself. See EventReduce as a simple
function that can do oldResults+Event=newResults.

And yes, you should always do testings before you use open source stuff. There
is no warranty use it on your own risk.

~~~
ComodoHacker
OK, so you don't "tunnel writes through" EventReduce, you "tee" them to
EventReduce.

Anyway, to maintain consistency, you have to limit yourself to one process of
your app. No sharding, load-balancing etc. This is significant limitation, and
it's not obvious. I encourage you to mention it in README.md.

~~~
eventreduce
I encourage you to read the readme and check out the demo. EventReduce is
nothing magically drills out your database and affects the consistency of your
write-accesses.

It is a simple algorithm that is implemented as a function with two inputs and
one output.

------
pgt
Materialize exists to efficiently solve the view maintenance problem:
[https://materialize.io/](https://materialize.io/)

~~~
fahrradflucht
There is also noria [0] which is basically a SQL database with integrated
materialized views with memcached like perf. - But I think it is more in a
research state.

[0]: [https://github.com/mit-pdos/noria](https://github.com/mit-pdos/noria)

~~~
scott_s
Any database which offers what are called "continuous queries" also solves
this problem. I'm glad the ACM Digital Library is open right now, because I
can just point to:
[https://dl.acm.org/action/doSearch?AllField=continuous+queri...](https://dl.acm.org/action/doSearch?AllField=continuous+queries)

The work in general purpose streaming grew out of the continuous query work in
the early 2000s. Marrying truly general purpose streaming with transactional
databases is not a solved problem (as opposed to transactional databases
providing streams), but the notion of continuously running database queries
has been around for a long time.

------
cryptonector
How does this work for complex queries with sub-queries, LEFT OUTER JOINs,
LATERAL JOINs, aggregation (DISTINCT/GROUP BY), window functions, CTEs,
RECURSIVE CTEs?

I've a radically different approach: can the queries in question as VIEWs,
materialize them, use triggers to update materializations where you can write
those triggers easily and the updates are quick, or schedule an update where
they're not.

If your RDBMS is very good about pushing WHERE constraints into VIEWs, and
depending on how complex a VIEW query is, you might be able to make the update
automatic by just querying the materialization's underlying VIEW with
appropriate WHERE constraints from the ROWs being INSERTed/UPDATEd/DELETEd.
You can tell which VIEWs might suitable for this by checking that the TABLE
whose row the trigger is running for is a "top-level" table source for the
VIEW's query: meaning a table source that's either the left side of a top-
level LEFT JOIN, or either side of an INNER JOIN. If you can run a query on
the VIEW with a timeout then you can just do that in the trigger and mark the
materialization as needing an update if the query is too slow. Lastly, a
scheduled or NOTIFYed job can run to perform any slower updates to a
materialization.

~~~
eventreduce
It sounds like you did not read the READme. What you are describing is
something that does scale up with more data but not with more requests. When
you have an application where 1000s of users subscribe to different queries,
your view-maintainance would kill the write performance while EventReduce does
not.

------
throwaway_pdp09
Forgive me if I missed stuff, please point me in the right direction if you've
covered it, but some questions if I may (after I've said well done!). I've
considered this problem before and it seems very difficult . So:

1\. Do you have a paper on this with a rigorous justification of the
algorithm?

2\. This surely has to rely on the isolation level being pretty high, or
EventReduce might be reading while n other processes are updating. I don't see
that mentioned.

3\. Surely you need logical clocks for this? If not, could you point me to a
high-level description of the algorithm to show why they aren't necessary.

4\. Why does sort order matter? A timestamp yes (see 3. above), but I don't
understand why the order matters.

thanks (and trying to understand this might be the thing to get me into
looking at BDDs again. I never understood their value).

~~~
eventreduce
1\. No I do not have a paper. I thought a lot about publishing a paper first
but then decided against it, because I think that good code and tests and
demos are more valuable.

2\. EventReduce is mostly useful for realtime applications. I myself use it in
a NoSQL database (RxDB). There you stream data and events and a single
document write is the most atomic 'transaction' you can do. If you need
transactional serial writes and reads that depend on each other, you would not
use EventReduce for that.

3\. EventReduce is just the algorithm that merges oldResults+Event. It assumes
that you feed in the events in the correct order. Mostly this is meant to be
used with databases that provide a changeStream where you can be sure that the
order is correct.

4\. Sort order matters because EventReduce promises you to always return the
same results as a fresh query over the database would have returned. When the
sort order is not predictable, the returned rows from a query depend on how
the data is stored in the database. This order cannot be predicted by
EventReduce which means it will then return a wrong result set.

PS: BDDs are awesome :)

~~~
DarkWiiPlayer
It would probably be a good idea to write a paper at some point; it's simply
easier to read a document explaining the algorithm with some pseudocode than
to dig through an actual codebase with all the messy language-details in
between the parts that actually matter.

~~~
eventreduce
I understand that reading the plain source code is more painful then reading a
paper.

There are many different trade-offs between a paper and the current repository
with source code. For me the biggest argument was that EventReduce is a
performance optimization. So to be sure if it really works and is faster, you
always need an implementation since you cannot predict the performance from a
paper.

Because I did not have time for both, I only created the repository with the
implementation. Maybe a paper will be published afterwards.

------
telekid
This seems conceptually similar to differential dataflow.

[https://github.com/timelydataflow/differential-
dataflow/blob...](https://github.com/timelydataflow/differential-
dataflow/blob/master/differentialdataflow.pdf)

------
cntlzw
"EventReduce can be used with relational databases but not on relational
queries that run over multiple tables/collections."

Forgive my ignorance, but that is the whole point of working with a relational
database. If cannot use JOINS then this solves only a very limited use case.

~~~
eventreduce
The biggest usecase for EventReduce is realtime applications. Most
technologies for these like Firebase, AWS AppSync etc. work on non-relational
data. If you want to use EventReduce with relational queries, you have to make
them non-relational before, for example by using materialized views. If you do
not want to do that, you should not use this algorithm in its current
featureset.

~~~
danpalmer
My guess would be that if you're at a scale where you're thinking about these
sorts of things, you are also at a scale where you're running on multiple
machines. How does EventReduce share writes across the cluster?

~~~
eventreduce
EventReduce is an algorithm and not a database-wrapper. It will not care about
your writes or if your database layer is a cluster and so also not affect
them.

~~~
danpalmer
Sorry I wasn't clear in my original post.

I'm thinking about the application layer. If you have an application that
writes data to a table, it's typical to run multiple instances of that
application to support scale and reliability requirements.

If I send a write to one instance, how does it communicate and synchronise
that write with the other application instances?

I ask because this can be a tricky thing to do, especially when consensus is
required, as consensus algorithms such as Raft/Paxos require a number of
network roundtrips which will introduce latency, and actually account for much
of that latency in the database examples given in some cases.

~~~
eventreduce
EventReduce is a simple algorithm. It does not care or affect how you handle
propagation of writes or how you handle your events, transactions or
conflicts.

See it as a simple function that can do oldResults+event=newResults like shown
in the big image on top of the readme.

~~~
danpalmer
This means then that if you run multiple application servers, which most do,
that you’ll need to implement a data distribution mechanism of some sort.

I must admit, with limitations like this I’m struggling to figure out the use
cases for this.

Edit: so I guess this is easier using the change subscriptions you mention in
other comments. That does mean many subscribers, but hopefully that’s minimal
load. This has the trade-off that it’s now eventually consistent, but I
suppose that’s not a problem for many high read applications.

I’m still feeling like this could be solved in a simpler way with just simple
data structures and a pub sub mechanism. Now I think of it, we do a similar
thing with Redis for one service, and a custom Python server/pipeline in
another, but we’ve never felt the need for this sort of thing.

Do you have more details about specific applications/use cases, and why this
is better than alternatives?

~~~
eventreduce
I think the best example for why this is useful is described by david glasser
at his talk about the oplog driver used in meteor.js
[https://www.youtube.com/watch?v=_dzX_LEbZyI](https://www.youtube.com/watch?v=_dzX_LEbZyI)

------
joefreeman
This sounds like (a simpler version of?) Lambda Architecture [1, 2]

[1]
[https://en.wikipedia.org/wiki/Lambda_architecture](https://en.wikipedia.org/wiki/Lambda_architecture)
[2] [https://www.manning.com/books/big-
data](https://www.manning.com/books/big-data)

~~~
fastball
To me this sounds a lot like how CRDT works (just simpler and not rigorously
proven).

e.g. making deterministic changes to some data _before_ you've heard back from
any sort of central single-source-of-truth.

Difference with CRDT is that the goal is to never actually hear back from a
central authority and still have consistency between clients.

Maybe this is a sort of hybrid CRDT/OT.

------
Zaheer
The goal of this is to reduce DB queries? Why not just queue up / batch
writes? What benefits does this provide over application side batching of the
event.

EvenReduce assumes there are no other systems interacting with the DB state
(by using the old state that the current system saw). If there are no other
systems, simple batching would work fine.

~~~
eventreduce
All your assumptions are wrong. Please read the other comments here or at
least the readme of the repository. I will happily answer all ongoing
questions you have afterwards.

------
pfarrell
I’m going to look at the code, but how does are transactions in the database
handled in eventreduce?

Specifically, I’m wondering about isolation levels which determine whether
uncommitted changes are queryable before commit/rollback.

------
truth_seeker
IMO An open cursor of Change Stream with Aggregation pipeline (for given use-
case) in MongoDB is more flexible solution to achieve this functionality.

In addition, it also tracks the history of changes and hence allows the cursor
to go back if needed with "resumeToken"

[https://docs.mongodb.com/manual/changeStreams/](https://docs.mongodb.com/manual/changeStreams/)

~~~
eventreduce
There is a big difference between a change-stream and a realtime query. For
example mongodbs cursor-stream is a good way to observe the events that happen
to a specific collection or documents that match some criteria. If you want
the realtime-results of a query that has sorting, skip limit etc. than it is
really hard to warp the changestream into this. In fact this is exactly what
EventReduce could do for you.

For more information about the difference I recommend the video "Real-Time
Databases Explained: Why Meteor, RethinkDB, Parse & Firebase Don't Scale"
[https://www.youtube.com/watch?v=HiQgQ88AdYo&t=1703s](https://www.youtube.com/watch?v=HiQgQ88AdYo&t=1703s)

~~~
truth_seeker
>> There is a big difference between a change-stream and a realtime query. For
example mongodbs cursor-stream is a good way to observe the events that happen
to a specific collection or documents that match some criteria. If you want
the realtime-results of a query that has sorting, skip limit etc. than it is
really hard to warp the changestream into this.

Have you actually tried it from mongo shell or any mongodb client driver ?

In the official document link which i shared it is clearly mentioned it
supports aggregation pipeline. Any operator which is compatible with
aggregation pipeline framework including "$sort" and "$skip" can be used. You
can also use JOIN like operator "$lookup" or "$graphLookup".

See this link for info [https://docs.mongodb.com/manual/core/aggregation-
pipeline-op...](https://docs.mongodb.com/manual/core/aggregation-pipeline-
optimization/)

~~~
eventreduce
Yes I used it. I actually know it really well. I also did performance
comparisons with mongodb and mongodbs change stream and cursors. What I posted
here is just an algorithm. You could now compare it to mongodb (a product) and
say it is a "more flexible solution" but I do not see the point in directly
comparing it simply based on the documentation of both.

~~~
truth_seeker
>> Yes I used it. I actually know it really well. I also did performance
comparisons with mongodb and mongodbs change stream and cursors.

Can you share the link for the code and data in Database against which you are
querying to prove your claim ?

~~~
eventreduce
No and I also do not want to "claim" something. Feel free to do your own
tests.

------
marceloabsousa
Very cool! This reminds me of some research I did a few years ago on program
consolidation:
[https://dl.acm.org/doi/10.1145/2594291.2594305](https://dl.acm.org/doi/10.1145/2594291.2594305)

~~~
eventreduce
Thanks that looks interesting. I added it to my to-read list.

------
LunaSea
Databases like PostgreSQL don't offer insights into the query plans, does
EventReduce parse the SQL statements to determine which tables and rows will
be affected by a query and run the appropriate caching or cache invalidation
logic?

~~~
theandrewbailey
> Databases like PostgreSQL don't offer insights into the query plans

How so?

[https://www.postgresql.org/docs/current/sql-
explain.html](https://www.postgresql.org/docs/current/sql-explain.html)

[https://stackoverflow.com/questions/7359702/how-do-i-
obtain-...](https://stackoverflow.com/questions/7359702/how-do-i-obtain-a-
query-execution-plan-in-sql-server)

[https://mariadb.com/kb/en/analyze-
statement/](https://mariadb.com/kb/en/analyze-statement/)

[https://docs.oracle.com/cd/B19306_01/server.102/b14211/ex_pl...](https://docs.oracle.com/cd/B19306_01/server.102/b14211/ex_plan.htm)

~~~
LunaSea
This is only for queries where you explicitly want the query plan.

However, your query won't return the query results.

And regular PostgreSQL queries don't let you return the query plan they used
on top the query results.

------
AmericanBlarney
Many applications solve this by using memory caching (e.g. Redid, memcached,
etc.) of performance sensitive datasets. There are a lot of drawbacks to the
approach, to the point that I would avoid it altogether.

~~~
eventreduce
The difference is that with simple caching you have to run full queries again
when the cache becomes invalidated. This is often expensive especially on
write intensive data usages.

------
pfarrell
Have you compared with an in-memory data store like Redis? Due to the lack of
support for joins, that seems like a more natural comparison than A relational
database.

------
gitgud
Interesting, seems similar to Firebase's Firestore NoSQL database. In that you
can create a complex query and receive real-time updates on each query.

------
regularfry
So... a write-through cache?

------
alecco
The use case seems better fit for a streaming database.

~~~
eventreduce
There is a big difference between a database with an event stream and a
'realtime query' that can be created with event reduce.

~~~
scott_s
What is that difference?

~~~
eventreduce
I recommend the video "Real-Time Databases Explained: Why Meteor, RethinkDB,
Parse & Firebase Don't Scale"
[https://www.youtube.com/watch?v=HiQgQ88AdYo](https://www.youtube.com/watch?v=HiQgQ88AdYo)

~~~
scott_s
That does not answer what differentiates your solution. I work on steaming
systems. I am aware of the spectrum of online, latency aware data processing.
But what I can tell from your solution is that the changes are coming from the
database itself. Since, as I understand it, the database is the still the
source of all data, I don’t see why your solution is any faster than
continuous queries in a database.

------
dahauns
Sooo...Materialized/Indexed Views?

~~~
eventreduce
No, see the FAQ in the readme:

Materialized views solve a similar problem but in a different way with
different trade-offs. When you have many users, all subscribing to different
queries, you cannot create that many views because they are all recalculated
on each write access to the database. EventReduce however has a better
scalability because I does not affect write performance and the calculation is
done when the fresh query results are requested not beforehand.

~~~
dahauns
Others have pointed this out already: This is wholly dependent on the RDBMS
used, and Oracle offers incremental refresh on demand. I have to admit though
that I mistakenly thought that MSSQL would as well...

------
Asraful56
Nice

