The low temporal locality part of the design space is particularly interesting. It is extremely general but for many data models "low temporal locality" can mean highly variable windows on a per entity basis and not on the entire collection, which is a challenging criteria for the way most systems are designed. Physical world sensor observation data models are a good example of this. Existing solutions for low temporal locality data models scale quite poorly in practice, even on systems nominally designed for them.
At the limit, low temporal locality platforms converge on the same unusual design requirements as symmetric mixed-workload database kernels. The latter is something we really don't (know how to) build, since it has some interesting computer science constraints. It is a missing piece of technology to make this stuff really work well at scale.
I'm writing a programming-language ( http://www.adama-lang.org/ ) which is reactive, and I'm currently looking at how do I optimize (prematurely, of course). I believe a combination of static analysis and a good runtime working together are the future (at least, for me).
For instance, I currently statically optimize "iterate _table where id == x" to "_table.get(x)", and this was a good optimization. I use the same analysis to introduce some limited indices.
The paradox for my use case (board games) is that going further isn't really needed, and the overhead of optimization starts to cost more than brute force table scans. However, I'm excited about this approach beyond board games.
> Most of these systems are equally expressive and can emulate each other
But a quick stab at how I think it could fall under this taxonomy:
If using the DataFrame API
If using the DStream API
[low temporal locality]
If you don't watermark and let the state grow--which I don't have much experience with,
but I think could possibly run into scalability issues, you'd have to write some
imperative logic to cull state--but it's very necessary for low temporal locality.
[high temporal locality]
If using watermarks, it will close the and take the output for you
[internally consistent / internally inconsistent]
Not really sure how Spark falls here, I think if you use the DataFrame API, you can have
it be consistent, but a lot of operations aren't supported, so you may end up having to
switch to DStreams and write code imperatively.
And then further, what if one of the streams is processed faster than each the other in
an aggregation? I'm not sure if there's a way to specific business rules around making
things atomic--but I'm sure you could hack it if you dropped low level enough.
I think this is touched upon in the other article:
> When combining multiple streams it's important to synchronize them so that the outputs
> of each reflect the same set of inputs.
That said, I'm really excited by the future of Spark Structured Streaming--I want to write declarative code and say how my data gets from A to B--not what needs to happen to get from A to B.
It didn't make it into https://scattered-thoughts.net/writing/internal-consistency-... because it has severe limitations for low temporal locality operations:
* As of Spark 2.4, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
* As of Spark 2.4, you cannot use other non-map-like operations before joins. Here are a few examples of what cannot be used.
* Cannot use streaming aggregations before joins.
* Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.
* There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets. Some of them are as follows.
* Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.
* Limit and take the first N rows are not supported on streaming Datasets.
* Distinct operations on streaming Datasets are not supported.
* Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.
* Few types of outer joins on streaming Datasets are not supported.
The rest of spark doesn't really fit the post at all afaict - there is no streaming or incremental update, just batch stuff.
we have a homegrown push-based streaming library in .net based on reactive (it supports indexed joins of "tables") and we're looking for something more infrastructural as we move to the cloud.
But I've only played with souffle, which I am not sure where it lands in this taxonomy. I think it lands on the high temporal locality but compared to differential datalog, the biggest difference in my opinion is the souffle solves problems as a batch (i.e., all the inputs are known at command issue time) while differential datalog may receive inputs at runtime.
There's also incA: which I think competes with differential datalog. https://github.com/szabta89/IncA
So, how I see Datalog (which is a subset of prolog) falling into this taxonomy:
Datalog (batch-processing) would fall under structured/high-temporal locality/internally consistent.
Datalog (incremental) would fall on structured/low-temporal locality/internally consistent.
Looking a little closer on what is defined as "consistent" here, incremental Datalog might not be "consistent" because it might return an "incorrect" or "unavailable" computation for a past input if you "update" or modify the input. But if one restricts the subset of Differential Datalog to follow functional semantics, then that would be consistent. Not really super knowledgable about this projects beyond playing with them and reading some papers on Datalog/Souffle.
There's another great blog, with writings supported by patrons, whose name escapes me (I think the domain ends in something about "lime", with the '.me' ccTLD).
Whenever I see those domains on HN, I always click, since I have a high expectation the content will be good, given how it was created.
I suspect it's also going to be fragile. The next time we hit a recession or, god forbid, another pandemic, sponsorships will probably be the first thing to go.
But it does help fill gaps. There are a lot of things I've wanted to code, study or write about that I think people will get value from but that I could not get any employer to pay for. So I really appreciate being able to work on things just because they're valuable, rather than because there is a way to capture the value and make a profit.
I've been following the rust content, it's a readable and useful (to me) blog
Maybe persistent vs in-memory? Like whether the data is typically completely blown away and recomputed on page reload (e.g., ui state, dom trees, scene graphs).
(e.g. 'eventually consistent' analytics is so 2021, because it looks like you are doing something quantitative,except it doesn't matter if you get the right answer. It will get you claps from a certain audience but most people lose patience pretty quick when the 'numbers don't add up'.)
I also disagree that unstructured is the future. I think most computations actually are structured. Otherwise sql queries wouldn’t be so useful. I think a lot of processes basically start with a big bag of foos and end up with a big bag of roughly corresponding bars, so if one foo changes then there is only a localised change to the output. That can be encoded with an unstructured system but I find they are better for cases where outputs are more scalar and the processes between inputs and outputs are myriad. While you can do structured-like things in an unstructured system, you lose out on a lot of the advantages of that structure.
For instance a tool like this
maintains a database of timely information and can use rules to match events and could be used to manage the numerous problems of event-based architectures.
That particular tool mangles Java source code badly in the process of compiling rules so it gives error messages that make no sense at all, even if you are looking at the compiler's source code in the other window and at the running compiler in the debugger.
In the light of the interest in "low code" and the real success of "business rules" for domains they apply in I am amazed there has been less effort to apply production rules to the "update the ui when the database changes".
Even though that can look unstructured, the implementation of the rules can be done with the same relational operators that all those other methods used. A few year ago you would have had to specified the indexes by hands, but the self-optimizing database patent from Salesforce is run out now and there is no reason the system can't learn the frequent query patterns itself.
If you're doing any kind of Important Reporting on eventually consistent data, you'd better make sure that you either know you're only including finalised data, or that there's a big fat warning with error bars.
On the low temporal locality side, the only consistent system I've seen so far is differential dataflow (and materialize) which does internally track which results are consistent and gives you the option to wait for consistent results in the output.
This isn't quite what the Helios paper talks about, but there's lots of things like async indexing in there that are kinda similar in nature