Hacker News new | past | comments | ask | show | jobs | submit login
Why isn't differential dataflow more popular? (scattered-thoughts.net)
228 points by jamii 40 days ago | hide | past | favorite | 120 comments



Indirectly answering the question - I've skimmed through the git README, the abstract and all the pictures in the academic paper that it references.

I have no idea what this thing does. Can someone explain in simple terms what it does?

My organisation is currently investigating on installing Spark on the theory that it connects to databases and we need analytics. As far as I can tell it breaks analytics work into parallel workloads.

[0] https://github.com/TimelyDataflow/differential-dataflow/


(disclaimer: I work at Materialize and I work with Differential regularly)

Differential dataflow lets you write code such that the resulting programs are incremental e.g. if you were computing the most retweeted tweet in all of twitter or something like that and 5 minutes later 1000 new tweets showed up it would only take work proportional to the 1000 new tweets to update the results. It wouldn't need to redo the computation across all tweets.

Unlike every other similar framework I know of, Differential can also do this for programs with loops / recursion which makes it more possible to write algorithms.

Beyond that, as you've noted it parallelizes work nicely.

I wrote a blog post that was meant to explain "what does Differential do" and "when it is or isn't useful" and give some concrete examples that might be helpful. https://materialize.com/life-in-differential-dataflow/


> if you were computing the most retweeted tweet in all of twitter

are all the retweets counts of every tweet stored in the memory ?

Where are previous counts stored that are merged with new stream of tweets.


Lets say you have three numbers, a, b, c, and you want to add them together to get the total. Then later, "c" changes, and you'd like to re-compute the total. One option would be to re-run the full sum, a + b + c, which would be fine. However, that repeats the "a + b" calculation.

Would it be possible to improve the efficiency of the total calculation by re-using the pre-computed "a + b" if only "c" changes?

Differential dataflow is one way to do that, but only really applies if you have lots of data with complex calculations. For analytics, maybe the "a + b" calculation would cover your last 5 years of operations, and then when a new day's worth of data comes in, you just compute the changes to the totals, rather than re-computing the analytics for all those years, all without manually having to write distinct "total" and "update" code.


Sounds like basic memoization and topological sort gets you all the way there? If that's really all this is about, I'm sure there a lots of adhoc implementations of it in many codebases. It doesn't necessarily seem like something you'd need to bring in a Rust framework to do.

Edit: lots of downvotes, yet no replies? Can explain someone why my comment is apparently so terrible...?


You're asking this in good faith, so you don't deserve to get downvoted. I think people on hn are a bit sensitive to comments they perceive as "reductionist" that may oversimplify complex problems, even if they're honest questions.

But you're right, at it's core this kind of problem will use techniques like memoization, and most projects that need something like this will have adhoc approaches to solve this problem. The advantage of differential data flow is that it's a generalized approach to this problem. The business logic behind these workflows, between tracking dependencies and updates can get pretty damn complicated and difficult to maintain. Having a generalized approach would make building these dataflows much simpler.

The paper it's based on is pretty skimmable, so I recommend taking a look at it. https://raw.githubusercontent.com/TimelyDataflow/differentia...


Thanks for linking the paper.

I think I maybe a bit environmentally damaged from mainly using Clojure. Algorithms similar to this are fairly common in the Clojure ecosystem. Memoization is part of the standard library too.

The claim (in the article) that no one cares about Differential Dataflow seems to be only true when talking about this specific library. The general concept surely translates to some combination of simple concepts like memoization, topological sorting, partial application, etc. so it's obvious to me that many adhoc implementations would exist tailored to more specific needs in a different programming languages with different feature sets. Sometimes buying into a framework is a lot more work than rolling your own, especially if it means having to switch to a different programming language.


Importantly, this doesn't just use memoization (it actually avoids having to spend memory on that), but rather uses operators (nodes in the dataflow graph) that directly work with `(time, data, delta)` tuples. The `time` is a general lattice, so fairly flexible (e.g. for expressing loop nesting/recursive computations, but also for handling multiple input sources with their own timestamps), and the `delta` type is between a (potentially commutative) semigroup (don't be confused, they use addition as the group operation) and an abelian group. E.g. collections that are iteratively refined in loops often need an abelian `delta` type, while monoids (semigroup + explicit zero element) allow for efficient append-only computations [0].

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2019...


> Sounds like basic memoization and topological sort gets you all the way there?

I don't really want to pull rank here, but for the benefit of other readers: 100% nope.

I personally find the "make toxic comments to draw folks out" rhetorical style frustrating, so I'll just leave you with a video from Clojure/conj about how nice it would be to be able to use DD from Clojure, to get a proper reactive Datomic experience.

https://www.youtube.com/watch?v=ZgqFlowyfTA


My comment was basically (paraphrasing here) "given that my understanding of the problem is it that it can be pulled off using simple constructs X and Y, seems like most people wouldn't need to pull in framework Z".

It's puzzling to me why you _wouldn't_ want to "pull rank", as you say. I did not pretend to be an expert in this domain. I'm really just exposing my knowledge and speculating about why people apparently aren't using this framework, which is what the damn submission is about. Did you even read it?

It seems like I managed to piss off a bunch users of the framework, who - rather than simply explain in clear terms why I'm supposedly wrong - instead just downvote away and make passive-aggressive comments that assume I'm some sort of troll.

Remind me to never engage with the Rust community again. Jfc.

Edit: Oh, so you're the creator of the framework? If you go straight to calling people toxic when they have questions about it, I think I understand why no one wants to use it.


I completely understand where you're coming from and I've been downvoted for expressing non-popular views here, and I relate to your frustration.

That being said, rest assured that your experience says absolutely nothing about the wider Rust community. It's one of the most helpful ones I've engaged with.

So please don't judge it by one strangely toxic framework creator.


> That being said, rest assured that your experience says absolutely nothing about the wider Rust community. It's one of the most helpful ones I've engaged with.

It's very common to see people with toxic attitudes in and around the Rust community, even in their internal communication about how to use Rust (`actix-web`, anyone?). I don't think it's helpful to lie to yourself about the Rust community like this.

The only thing Rust users who don't want to have these conversations can do is to openly recognize and talk about the extreme fanaticism Rust users commonly display and the toxic pattern of communication that sometimes is bundled or separate, when it comes to priorities in software dev.


What is “toxic” about the comment? That sounds like a legitimate question to me.


From what I see it's the dismissive way it was posed, with little curiosity about the real challenges. Similar to the 'oh I could build that in a weekend' style comments that are pretty exhausting for creators to have to deal with.


This submission is literally about why people aren't using some Rust framework. I add my two cents as to why that might be and then that gets called toxic and dismissive.

Seems like many people here aren't actually willing to engage in a discussion. I guess this submission is basically just native advertisement for the framework in question.


Thank you. I find this assumption of bad faith quite frustrating. Every comment I make in this thread seems to be instantly downvoted.


It's tautological that anything that can be implementated can be implementated. Libraries and frameworks give you the implementation without taking the time to do it yourself, so you can focus on your core competency.


That wasnt my point at all. My point was that this seems like basic application of simple concepts from Computer Science, so it's not that odd if people aren't thinking about using this library to do it.


> My point was that this seems like basic application of simple concepts from Computer Science

I'm not sure that differential dataflow is that simple as I haven't checked the paper nor the repository—according to other replies, it isn't—but if it were, that's all the more reason to use a library/tool instead of reimplementing something for the thousandth time, I think.


> My point was that this seems like basic application of simple concepts from Computer Science

Can you give examples of real world programming tasks that don't fit this definition?


The qualifiers are relevant here, specifically:

> basic application

> simple concepts

Topological sorting (along with basic graph theory) is something any beginners course on discrete mathematics should already have taught you. You would reach for that in most cases where you need to deal with a graph of dependencies.

The other part of the puzzle is about storing calculations, i.e. memoization. This is trivial in my language of choice, but really it's not a hard problem to solve in any language. You map function inputs to outputs somewhere in memory and retrieve the results when needed.

These techniques are broadly applicable in many domains. To many people, the intuition would be to reach for them directly when they have a task that looks like a graph search, rather than go look for a framework or library that they would then need to read the documentation for and spend time integrating with their code. Sometimes less is more.

My point is really just that a lot more people would think about this as a graph or memoization problem than would ever think to go look for a Rust framework. Maybe if their own solution at some point doesn't work out, they will start searching for frameworks or libraries.


It's madness to combine application logic with update management in your codebase. Update management is very hard to get right, and there are a lot of corner cases that only show up under extreme conditions with delayed or duplicate delivery of updates. When the update logic is incorrect, you'll occasionally get plausible-but-somewhat-wrong answers that are hard to reproduce. The very worst kind of bug.

It's much better to have the update logic handled in a thoroughly tested library, and build your application logic on top.


This reply would be much better if you provided some examples of these corner cases.


(Someone correct me if I'm wrong!) I think about differential dataflow as the solution to "I can't batch data operations, because I don't know when my various inputs will land."

If everything exists at 7am, and/or you don't need the freshest computed values, this is not the solution you need.

If data A is ready between 2-4am, data B at noon, and data C sometime between 8am-6pm, this allows you to abstract that uncertainty into code, then let the system solve it on a daily basis.

This is not a problem everyone has. But it is a problem most people working with inventory or events have! And it's usually a problem people feeding things to ML have.


I just found this video: https://youtu.be/yyhMI9r0A9E

and also this: http://muratbuffalo.blogspot.com/2017/11/on-dataflow-systems...

Timely dataflow was inspired by Naiad.


Assume you need to execute the following query:

    SELECT SUM(A) FROM MyTable
For large tables it will take some time to compute the result. Now assume we append a new record and want to get the new result. The traditional approach is execute this query again. A better approach is to process this new record only by adding its value in A to the result of the previous query. It is important in (stateful) stream processing.

Something similar is implemented in these libraries which however rely on a different data processing conception (alternative to map-reduce):

https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!


I think there are a lot of similarly interesting paradigms that goes mostly unnoticed because of a lack of "developer UX".

My personal favorite is "Functional hybrid modelling" - https://github.com/giorgidze/Hydra


I've always been interested in distributed stream processing platforms (Storm, Spark, Samza, Flink, etc) - and I've been interested in a distributed processing platform that wasn't on the JVM (there used to be one called Concord). That said, I came across differential dataflow a while ago (as I also began writing more and more Rust).

I think the biggest issue is the documentation, not so much on writing code, but on building an actual production service using it. I think most of us can now grok that you have a Kafka Stream on one end and a datastore on the other, and the quintessential map/reduce hello world is WordCount.java. That doesn't isn't clear from the differential dataflow documentation - I remember thinking how are they getting data from the outside world into this thing, then thinking maybe I don't understand this project at all.

Consider the example in the ReadMe - the hello world is "counting degrees in a graph". While it gives you an idea of how simple it is to express that compuation, it isn't interactive - it's unclear how one might change the input parameters (or if that's even possible). The hardest part of most of these frameworks is glue - but once you have that running then exploring what's possible is much easier. Differential Dataflow doesn't provide that for me right off the bat.

That said - I'm not surprised, when I last checked it out Rust Kafka drivers weren't all there and it seemed to be evolving parallel to everything else. I think what would make it more popular is a mental translation of common Spark tasks (like WordCount) to differential dataflow.


> it's unclear how one might change the input parameters (or if that's even possible).

Yeah, the readme is pretty dense on terminology that is unfamiliar (at least to me).

It answers that question like this:

> In the examples above, we can add to and remove from edges, dynamically altering the graph, and get immediate feedback on how the results change

but it would be great to show an example of that in code, as otherwise it is easy to assume that "reachable" is a fixed result set, when the whole point of the system is that presumably you can subscribe to changes in "reachable" as "roots" or "edges" change.


This was exactly my experience as well. I needed an non-JVM streaming platform. Differential Dataflow seemed like a possible fit, but I wasn't able to unlock the magic. Most likely a product maturity issue (docs, examples, defined use cases, etc.) than a technical one.


I don't think it is possible to compare "differential dataflow" with projects like Spark. (don't know about kafka streams)

Spark is a "product" -- it has extensive documentation, supports multiple languages, and generally is production-grade. It is full of "nice-to-haves", like interactive status monitors, helper functions, serialization layers, and so on. It integrates with existing technologies (Hadoop, K8S, HDFS, etc..). It has this "finished product" feel.

"differential dataflow" seems to be a library. It only supports a single language. The documentation is very basic (it was not even clear if there is a way to run this on multiple machines or not). It is very bare-bones -- there are only a few dozens functions, and no resource monitors and interactive shells. It does not seem to integrate with anything. It has "research software" feel -- there are random directories in top-level repo, academic papers, and so on.

(it would probably be more fair to compare "materialize" to Spark...)


Ok. That sort of lines up with "lack of docs" and "missing features" but maybe the bigger thing is the general impression and expectations.

Having spent much of last year pointing kafka at differential dataflow and watching kafka fall over or fail to start, I definitely feel like I would trust differential dataflow more. But I agree that nothing about the presentation of the project gives that impression.


Discovery is another. "Differential dataflow" and "timely dataflow" don't quite convey clearly what problem they're solving or what machine characteristics they rely on or even where to expect more performance from them and how. Not saying that they aren't performant, but we need to say how and where pretty clearly.

For example, Spark makes it clear that its performance comes from exclusively in-memory compute across a cluster.

Spark may also be "good enough" from a performance standpoint for many use cases. New tools can get adoption among small players only if they are radically easier to use and deploy.


I have never used neither kafka nor differential dataflow, but I would like to offer an personal anecdote as an illustration of an importance of a greater system:

I have once needed to set up webapp written in Python. I did this by running the code in WSGI instance, and it via nginx. Setting up all the activation files, locked-down permissions, secure sockets was pretty finicky to get right, and took non-trivial amount of time. It would have been much easier to use Python's built-in web server and expose to internet directly. It has fewer moving parts and generally more predictable.

I still went with more complex solution -- because I needed logging, security and large file offload. And I used built-in web server for development.

There requirements of "production" system are pretty different than "development" one. Sometimes people are willing to install bigger (and therefore more fragile) system when they need more features.


That's not the issue here. Kafka is far more fragile than it should be, partly because companies have always approached it as cluster-first and partly because it's enterprise-first software where high setup costs just aren't important. A lot of JVM software ends up like this - there's a big chunk of fiddly O(1) work in getting it all going, just because no-one ever bothers to make it all easy to get started with.

I say this as a huge fan of Kafka, but things like MySQL have better defaults and are easier to get running out of the box, and there's no reason Kafka's starting experience couldn't be the same if someone cared enough to put the time and effort in. And ultimately it's a shame, because it leads people to ignore something that's a much better model and platform in the long term.


> there's no reason Kafka's starting experience couldn't be the same if someone cared enough to put the time and effort in

There's always a perverse incentive when the software provider's model is to monetize through support, consulting, and offering the software as a managed service. If it's too easy to run, then why would one pay for any of these services?


Do you have an opinion on MSK?


AIUI it's pretty expensive compared to, say, RDS or their managed Redis service? Which makes perfect sense relative to how much of a pain running your own Kafka cluster is.

100% worth it IMO, but it's a lot of upfront cost and you only start to see the benefits when a given flow is Kafka end-to-end and you learn how to use it, so I absolutely get why people are skeptical.


My 2 cents.

Compute and storage capabilities keep growing rapidly, if one structures their data well, uses a reasonable query processor, and some form of underlying columnar storage then computing calculations on TBs of data can be accomplished in seconds for low costs.

Being able to recompute the world from scratch is a p0 requirement for most analytic workloads, as otherwise migrations and un-forseen computation changes due to new product requirements and other activities become painful.

This leaves differential techniques in an awkward spot where to be effective they need to

1) Operate on vast quantities of data or sufficiently complex calculations such that optimization of compute is a concern to the end-user.

2) Operate in a computational environment that is sufficiently constrained such that all present and future changes to the computation can be reasonably accounted for.

3) Be transparent enough that engineers don't feel that they are duplicating logic.

It's hard to think of applications that would meet this criteria outside of intelligent caching of computation within DB engines.


Depends how you look at it. Single thread perf is pretty much stagnant. Only recently ryzen has budged multi core perf, we're still to see how long this growth path can be.

I agree that cloud has made computing power more accessible, but there are known limits regarding scalability. Moreover, using distributed computing (spark, etc) kills efficiency (see all those posts about a laptop beating a medium sized big data cluster).


If I am not misremembering, "all those blog posts" are all written by Frank McSherry, one of the cofounders of materialize materialize and developer of differential dataflow :-)

http://www.frankmcsherry.org/assets/COST.pdf


Very few data processing activities are bottlenecked by single thread performance. Similarly as memory, thread count, and storage capacities of modern servers have continued Moore's law the number of applications that can fit within the same finite cost has also improved.

In terms of efficiency, the big question is "does an incremental gain in efficiency matter for this application?" if we're talking about < 10x performance/cost change the answer will be no for most teams. Consider how many of these big data applications are implemented in python or other interpreted languages.


>It's missing some important feature, like persistence?

Hard to answer, considering....

>It's had very little advertising?

Personally, I've never heard of it and have no idea what differential dataflow is. Maybe I've done but I never gave or discovered a name for it. I don't know what spark or kafka streams are. Maybe because I've never had a use case for those that wasn't satisfied by a tool that was "good enough", or, more likely, I haven't come across anyone recommending those on projects, because they also don't know what those tools are. I would have never known what RabbitMQ was if a coworker never suggested we use it to build queues, and it turned out to be cumbersome to use and 100x more complicated than writing a stored procedure that turned out to be "good enough". Most tools fall into that space where they are marginally better in some regards over "good enough", but not better enough to accomadate for the learning curve for other developers, changes in maintenance or design, cost, etc. Advertising is pretty general and it's hard to say if which of these it's doing wrong, and depending on their market none of these might be wrong for them, just the potential market is content with "good enough" and have no need to search for tools like this.

>Rust is intimidating?

I'm not sure what the stats on Rust are but I don't think its that popular for business developers to where you could point to it for the reason a tool has failed the adoption phase


"Build a better mousetrap and the world will beat a path to your door." is bunk. People don't automatically adopt new better things. I don't know why though.

In my teens and twenties I collected ideas the way some folk collect stamps. The simple fact of the matter is that there are amazing things out there that you've never heard of, and no one really seems to care.

(As an aside, I hate the question, "If FOO is so great why doesn't everyone use it?" I do not know. That's not my department.)

These days some of these things are better known and some even have Wikipedia pages and stuff ( E.g. https://en.wikipedia.org/wiki/Vaneless_ion_wind_generator ) but a lot of others are still obscure (trawl through Rex Research if you want to look for weird tech.)

Like, there's a mechanism that can absorb kinetic energy. The demo has a little car on rails with a ramp at one end and a wall at the other. They put a wineglass at the wall and they put the car on the ramp and let it go: the glass shatters. They activate the device and repeat: car hits glass and halts, glass does not shatter. Messed up, right? They're from Poland IIRC, they've been doing demos at trade shows. I bet you've never heard of them. (Bug me and I'll try to dig up a link; They're in Rex Research.)

I already mentioned the "Vaneless" ion wind generator, an efficient solid-state device for converting wind into electric power without e.g. killing birds with spinning vanes. Cheap, simple, easy, durable, been around for decades, and you just now heard about it, eh? :)

There's a battery that desalinizes salt water. A nuclear reactor made of molten salt. Balloons stronger than steel. There's a guy in Michigan, Wally Wallington, who figured out how to move monoliths single-handedly, they walk just like the old stories say!

Anyway, I'm getting ranty here. To veer back on topic: yeah, it's a bummer, you build an awesome mousetrap and even the people with lots of mice ignore it. I wish I knew what to tell you. Maybe paint it mauve?


These examples provide your answer - digging just past the surface, they're not better at giving people what they want.

The vaneless ion generator is >5x less efficient.

Molten Salt Reactors, on the surface, sound quite safe, but nuclear power's primary problem is economics, and this is an unproven technology operating over the long time scales needed to amortize reactor construction costs. Nuclear power systems present novel failure mechanisms that don't exist in everyday technology, such as the corrosion mechanisms caused by coupling of mechanical, thermal, chemical, and radiological stresses, and MSR's present a new set of these that are poorly understood. Additionally, MSR's exhibit thermal shocks in normal operations orders of magnitude above what may be produced in a water-cooled reactor.

It's just not that simple - it looks simple to you.


And yet the molten-sodium and molten-lead reactors, which sound quite unsafe, actually exist and used even today, since the seventies, actually.


Yes, they actually exist in the West as research reactors, since there has been no demonstration or certification of their suitability for long-term operation.


It seems to me like you're just being circular now, like you're saying "Sure, MS reactors exist, but we don't know if they are good because we haven't done the work to find out."

Going back to the original article "Why isn't differential dataflow more popular?" we could ask "Why aren't Molten Salt reactors more popular?"


There are usually multiple sane reasons why these ideas don't make it; either technological, practical, business or social hurdles. Conformism is one of them. They aren't actually better than what they are going against until they overcome that, or offer some kind of massive advantage that makes it dumb not to switch.


I've heard that, I don't buy it. I think we are just stupid and crazy.

Biodegradable plastic has been a thing for over half a century yet the whole planet is choking on discarded plastic. Where's the sane reason for that?

Cars!

Cars aren't sane, we had to convince ourselves that they were though a deliberate campaign of domestic propaganda![1] Cars kill hella people (I don't have stats for the rest of the world but more Americans have been killed by cars than have died in all the wars we've fought!) And that's just the people that get hit. There's air pollution (car exhaust is a deadly poison) and tires are constantly wearing down and giving off vulcanized rubber particles. I won't deny their convenience, but you'll never convince we they're sane.

Or my old pet peeve: refrigerators. They open like cabinets instead of like drawers. The cold air goes out, the warm air goes in, and you get to pay for the energy to cool it off. Why not make them like a chest of drawers?

Sanity? I think not.

[1] "The Real Reason Jaywalking Is A Crime" (Adam Ruins Everything) https://www.youtube.com/watch?v=vxopfjXkArM)

Replying here to the other fella:

> The vaneless ion generator is >5x less efficient.

C'mon, less efficient by what metrics? Total cost over the lifetime of the structure, including maintenance? Killing eagles?

In re: Molten Salt reactors there's a old page[1] about Molten Salt tech in general that talks about reactors towards the end:

> The second MSR was a civilian power plant prototype, the Molten Salt Reactor Experiment (MSRE)7. Hugely successful, it was ignored by the US Atomic Energy Commission (US AEC), which had decided to favor the Liquid Metal Fast Breeder Reactor (LMFBR). The Director of ORNL, Dr. Alvin Weinberg, pushed for the MSR, but was fired for his efforts 8.

> 8. Pages 198 - 200, "The First Nuclear Era : The Life and Times of a Technological Fixer", by Alvin Martin Weinberg (1994).

I've read a longer history of Molten Salt reactors that described the events referenced in the above quote. Evidently both kinds of reactor showed promise early on (MS and LMFB) but for bureaucratic reasons the one was passed over for the other.

[1] https://web.archive.org/web/20040602210408/http://home.earth...


Or my old pet peeve: refrigerators. They open like cabinets instead of like drawers. The cold air goes out, the warm air goes in, and you get to pay for the energy to cool it off. Why not make them like a chest of drawers?

This one has an easy answer. If you want to isolate each of the drawers so that opening one does not result in the warming of the others, then you'll need a separate mechanism for blowing cold air into each drawer. (Or worse, a separate cooling coil for each drawer.) These are expensive mechanical components prone to failure, and each of these mechanisms takes up valuable space in the fridge. They may provide some long term amortized savings over the energy cost if done correctly, but it's not at all obvious to me that would be the case without seeing the math worked out.


Newer designs are actually integrating drawers of all sorts of configurations and even mini limited drawers/access for guests while entertaining and what not.


Speaking of refrigerators, they should pull cold air from outside and vent hot air into the house. For roughly half the year where we are, a fridge would just cycle outside air. But no, it has to fight the internal house temp for the whole year.

Sanity? I think not.


See, this is exactly the point. When something that seems obvious is not happening, most of the time it just means there is a gap in your understanding.

Unless you live in the north pole and get outside temps in the 3°C range, this idea is already a no-go. Refrigerators are pretty efficient, using the same energy as a single incandescent lamp. They are not fighting the internal house temperature at all - compressors/heat pumps actually get less efficient at colder temperatures, and the heat it sheds off is offsetting your heating costs.


I'll bite, but you haven't shown anything and have sprinkled your comment full of rhetorical devices and fallacies.

The average upfridge at homedepot (USA) consumes 500kWh ish a year. This is to cool down from an internal ambient of 20C. It absolutely takes more work to cool from 20C than it does 10,5 or 0C. This is the conservation of energy. Cooler _efficiency_ changes as it approaches the setpoint do not mean that I can have free energy by having a larger delta.

Modern refrigerators have been designed specifically (forced by the government) because they are always-on devices that in aggregate consume large amounts of energy. You are conflating low power and always-on consumption of energy.

I'll absolutely engage with you, but it needs to be over facts and mathematics and not hand wavy arguments about "pretty efficient" vs a incandescent lamp.

https://openstax.org/books/physics/pages/12-2-first-law-of-t...


This is a great thread, thanks. Familiarity is a powerful thing. It helps explain why excel remains the most widely used “programming language” in enterprise.

Part of it is habits, which are not easy to change. And the fact that when they are changed, it’s often incrementally, by substituting them with something similar enough.

Thanks for the reminder of the amazing things that are out there, but just haven’t cracked the mainstream. I think about it a lot with software from the 90s. Things like HyperCard. Stuff that, despite lower processor rates and memory, seemed MORE sophisticated than a lot of what’s standard today.


Cheers!

And yes! HyperCard is a great example. Really picking on the IT industry is like taking candy from a baby, we're so woefully ignorant of our own prior art.


>there's a mechanism that can absorb kinetic energy

Yes, it's called braking.


"The demo has a little car on rails with a ramp at one end and a wall at the other." any link for that?


https://www.youtube.com/watch?v=z-h56N_A3rY It's in Polish. They also drive a real car into one of these things set up as a bumper!

News articles, links to more videos, patents, etc.: http://www.rexresearch.com/lagiewka/lagiewka.htm


I think you fell for a trickster.

"A small Fiat 126p, going 45 km per hour, was driven into a concrete wall. The bumper was not damaged. The driver wore no seatbelts. The inertial reaction, which should have thrown him onto the hood, did not ocur. The stopping distance was only 16 centimetres. Impossible? Yet hundreds of people in, and the stadium, and millions more on television. The use in all vehicles of the absorber of the energy, "Ecollision", can radically improve automobile safety."

Regardless of the mechanism of energy absorption, if a vehicle is decelerated from 45 kph to 0 within 16 centimeters, there needs to be a way to remove the kinetic energy from the driver inside as well (seatbelts). If not, they will simply continue flying along at 45 kph, through the windshield. If the driver in this case was not wearing seatbelts, it's quite likely that the vehicle was not going that quickly.

``Lagiewka says, "The technical idea behind my buffer can be used in very many practical solutions. Another invention which I showed experimentally, is the brake. Connected to the axis on a Mercedes, the car stopped in one-quarter of the distance usually required".''

When braking, the limiting factor is usually not the brake mechanism itself (e.g. the disc brakes), but the interface between tires and the road. That's why cars have anti-lock systems.


> I think you fell for a trickster.

It's possible. I certainly haven't tried to replicate the mechanism.

In the video you can see the driver does experience "the inertial reaction" when the car halts, at least it seems to me that he leans forward due to inertia at that moment. I think that description is just bad. (Or, of course, it was just a weird hoax.)


If I do not calculate wrongly on my back of envelope, to brake from 10m/s (36km/h) in 10cm you need constant 50g acceleration on the driver... he certainly should feel that. Yes, it is a con.


Meaning no disrespect, I don't see why your "back of envelope" estimate (which you didn't actually share) should be given more credit than the video. Like I said, I think that written description is simply wrong. It wouldn't be the first time a journalist misreported the details of some technology, eh?

I just watched the video again with the playback speed at 1/4 (thanks Youtube!) and at 0:46 ( https://youtu.be/z-h56N_A3rY?t=45 ) when the car hits the bumper I can clearly see the driver lean forward due to inertia. Of course you still need to wear a seatbelt, this isn't magic. (You know on Star Trek how the Klingon disruptors make a person disintegrate, clothes and all, yeah? How does the disintegration process know to stop at the floor? What is it about the interface between boots and floor which stops the disintegration? Could you use that to make disintegration-proof armour?) I get what you're saying. I do.

The point is not that the driver magically didn't feel inertia (he did, and you can clearly see it in the video) it's that the car didn't crumple. The kinetic energy given up by the stopping car went into the flywheel rather than into violent deformation of the physical structure of the car. I.e. it works (if it works at all, I don't deny that it might be a hoax) like a "crumple zone" without the crumple:

> Crumple zones are designed to increase the time over which the total force from the change in momentum is applied to an occupant ...

~ https://en.wikipedia.org/wiki/Crumple_zone


> The point is not that the driver magically didn't feel inertia (he did, and you can clearly see it in the video)

He clearly didn't experience 50g deceleration, he would have been sent flying through the windshield.

What's more likely is that the car actually did take a lot more than 16cm to slow down to 0 m/s or that it wasn't going at 45km/h. I would bet on the second, the car looks like it's going at maybe 20km/h in the video.

> The kinetic energy given up by the stopping car went into the flywheel rather than into violent deformation of the physical structure of the car.

That's an interesting idea but if it ends up being less effective at protecting the humans inside I don't think most people will choose it over normal crumple zones.


> if it ends up being less effective at protecting the humans inside

Of course. Who cares if you can just reset the flywheel (as opposed to scrapping the crumpled car) if you still have to scrap the people off of the dashboard?

My whole point is that there are devices like this one that may be more effective, given some R&D, but that get neglected.

Let me put forward another, perhaps less physically controversial, example: the "Rolomite".

> Rolamite is a technology for very low friction bearings developed by Sandia National Laboratories in the 1960s. It is the only elementary machine discovered in the twentieth century and can be used in various ways such as a component in switches, thermostats, valves, pumps, and clutches, among others.

> The Rolamite was invented by Sandia engineer Donald F. Wilkes and was patented on June 24, 1969. It was discovered while Wilkes was working on a miniature device to detect small changes in the inertia of a small mass. After testing an S-shaped metal foil, which he found to be unstable to support surfaces, the engineer inserted rollers into the S-shaped bends of the band, producing a mechanical assembly that has very low friction in one direction and high stiffness transversely. It became known as Rolamite.

https://en.wikipedia.org/wiki/Rolamite

Or the Hilsch-Ranque vortex tube:

> The vortex tube, also known as the Ranque-Hilsch vortex tube, is a mechanical device that separates a compressed gas into hot and cold streams. The gas emerging from the "hot" end can reach temperatures of 200 °C (392 °F), and the gas emerging from the "cold end" can reach −50 °C (−58 °F). It has no moving parts.

https://en.wikipedia.org/wiki/Vortex_tube

Now these you can actually buy. They sell little ones that go on the end of an air compressor hose to deliver "spot cold" as it's called. I once emailed a company who makes them to ask what would happen if you set it up so that the cold (or hot) output chilled (or heated) the incoming air, would you get a feedback loop? But they weren't interested.

The vortex tube is less efficient than a heat pump, so there are good reasons not to use it in every potential application, but I feel that there are good applications that go completely unrealized. That's my original point. Just because some cool technology exists doesn't guarantee that it will be used well, or at all.


"back of envelope" estimate (which you didn't actually share) "

If you are interested: going from v to 0 or going from 0 to v uniformly with acceleration in time t is related by v=at. The distance traveled is s=1/2a t^2. Plugging the first into the second gives s=1/2 v^2/a. So a=1/2 v^2/s= 1/2 (10m/s)^2/0.1m=500m/s^2~50g

Interestingly this directly follows from the definition of acceleration and doesnt use anything like Newtons laws.

Looking at the Video, why didn't they just do it in a controlled environment? Some gauges/meter marking and high speed cameras. The time it takes for the stop is in the order of t=v/a=10/500s=1/50s so only one frame in normal video rate.


Cheers for showing your work. But again, in the video, whatever it's faults, you do see a car hit a barrier and not go crunch, yeah?

> why didn't they just do it in a controlled environment?

Well, there is more than one video. The one we're talking about is obviously a public demonstration and not a scientific test. (There was one video that seems to have been removed now that showed a very good and clear demo of the ramp/glass being done at some trade show or convention. that video or others might still be on YT somewhere.)

What about that ramp/glass demo? A glass shatters when the little car thingy hits it, and then another glass doesn't shatter when the flywheel device is active.

And really, I haven't dug into this particular tech too deeply, it could well be a hoax.

But my point still stands, there are lots of interesting and useful ideas that work and get ignored or neglected. Magnus effect rotors, the Tesla turbine, desalinizing batteries, "Aircrete", etc...

I could literally go on all day, just listing the less "woo-woo" stuff off of Rex Research.


I've looked at this and thought it looked amazing, but also haven't used it for anything. Some thoughts...

Rust is a blessing and curse. I seems like the obvious choice for data pipelines, but everything big currently exists in Java and the small stuff is in Javascript, Python or R. Maybe this will slowly change, but it's a big ship to turn. I'm hopeful that tools like this and Balista [1] will eventually get things moving.

Since the Rust community is relatively small, language bindings would be very helpful. Being able to configure pipelines from Java or Typescript(!) would be great.

Or maybe it's just that this form of computation is too foreign. By the time you need it, the project is so large that it's too late to redesign it to use it. I'm also unclear on how it would handle changing requirements and recomputing new aggregations over old data. Better docs with more convincing examples would be helpful here. The GitHub page showing counting isn't very compelling.

[1] https://github.com/ballista-compute/ballista


These products are competing for mindshare in an incredibly saturated market. There are a lot of was to skin the data pipeline cat. I think a lot of companies already founded data engineering teams, all of whom have established tech-stacks for data engineering tasks.

Personally, I keep an eye out on new technologies, but I'm not likely to embrace them without good reason. A fragmented tech stack is annoying.

This looks an awful lot like Spark to me. And doesn't seem to really solve the problems I typically experience with data engineering. For me, the biggest issue is orchestration. I don't see any facilities here for managing and executing data pipelines.

So, it seems to me that people aren't using dataflow more because it looks a lot like legacy products on the market. And it doesn't solve the massive problem of job orchestration and management. Apache Airflow + python + BigQuery is immensely powerful and dead simple to use. It's going to be hard to compete with.


The problem is these two:

The api is too hard to use?

The docs / tutorials are not good enough?

DD falls into an uncanny valley where the API surface is simple enough to grasp quickly yet foreign enough that actually grokking it is pretty hard, let alone applying it in an organization where maintenance is a top concern. To do anything nontrivial, you need knowledge of timely-dataflow too and the DD documentation doesn't do a good job of integrating knowledge from TD docs - they're written by someone who has already internalized that knowledge so it's an afterthought. Getting data in and out of the dataflow and orchestrating workers is pretty much undocumented outside of Github issue discussions. Trying to abstract away dataflows behind types and functions turns into a big ol' generic mess. There are a lot of rough edges like that (and the abomination crate is... well... an abomination).

McSherry's blog posts, while tantalizing, are often focused too much on static examples (entire dataset is available upfront) and are too academic-focused to make up for holes in the book. As far as I can tell, the library hasn't seen enough use for best practices to emerge and there's almost no guidance on how to build a real world system with DD.

By far the biggest problem I've had: I can avoid a DD project for a week or two at most before enough knowledge leaves my memory that I have to spend days rereading my own code to get reoriented and productive again. You either use unlabeled tuples which turns the dataflow into an unholy mess or you spend half your time writing and deleting boilerplate when doing R&D. DD is just too weird and the API too awkward - I haven't figured out a method for writing straightforward DD code.

That said, when I have gotten it to work on nontrivial problems, the performance and capabilities have been really impressive. I've just never been able to get the stars to align to use it in a professional context with future maintainers.

I think what DD needs is a LYNQ-like composable query language that abstracts away the tuple datatypes and provides an ORM/query builder layer on top of dataflow statements. Most developers are familiar with SQL statements which would make DD a lot easier to adopt.


Personally, after now seeing this, I think it's going to solve a problem for us that we're going to run into in the medium term so that's pretty neat, but we are dealing with a lot of data and re-computing certain things from scratch for us would be potentially prohibitive at our scale of data.

I think the main issue is that in most shops is that the scale of their data isn't so large that a re-computation of a query with new data takes long enough that they would want to put it engineering effort to switch off more common tools like spark, airflow and columnar storage dbs. They're also likely, with decent engineering, not yet at a point where they run into tuning issues on their ingest side. An ETL taking an hour every night and then taking a couple seconds to run that query or even have that query set up on a job that just sends out a report isn't really an issue for most small - medium sized companies, and even at larger ones if your data throughput isn't particularly high I don't see people needing to reach for this for the same reasons.

You obviously can do those less intensive tasks in DDF but it doesn't really strongly make a case for itself in those regards, largely because DDF doesn't seem to offer anymore benefit on those smaller tasks, 15s to 230ms is a really tremendous leap in performance but for many companies I doubt the 15s is a bottleneck in the first place so it's not actually solving a problem there, it would be a nice to have.


A possible reason not mentioned in the post is that writing efficient incremental algorithms is just fundamentally hard, despite the primitives and tooling afforded by the differential dataflow library. For example, even with a lot of machine learning libraries targeting python, there are only a couple that really implement online algorithms.


Can confirm, am still unsure if DDlog [0] can be switched to Worst-case optimal joins (WCOJ) [1] with the recent (unreleased, but almost 1y old) calculus operators of DDflow [2][3], because at least the original dogs^3 approach supposedly doesn't work in iterative contexts (which are necessary for recursive operations, like graph computations). The calculus blog post ends on a promising note, however.

I'm trying to help a couple (friends?) with getting the analysis of rslint [4] running well on DDlog or at least DDflow, with the end-goal being a perceptually zero-latency linter that typically responds faster than a human types. We're currently seeing initial delays in the single-digit second range, and that's not even on large projects (the incremental performance is far better, but we would like to out-compete the official TS typechecker even in CI settings that don't keep the linter's state across runs). The good news: we're making nice progress on profiling tools and I might get to trying some WCOJ code later today.

[0]: https://github.com/vmware/differential-datalog [1]: https://github.com/TimelyDataflow/differential-dataflow/tree... [2]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [3]: https://mtrlz.dev/api/rust/dogsdogsdogs/calculus/index.html (3rd-party hosting of docs for the calculus/dogs^3 crate) [4]: https://github.com/rslint/rslint


> It's missing some important feature, like persistence?

For the use cases I'm envisioning, this strikes me as a nice-to-have, and even then only if the persistence API were sufficiently easy to use (or at least to avoid).

> It's had very little advertising?

I hadn't heard much about it till now.

> Rust is intimidating?

At work I need a killer reason to inflict _any_ language on everyone else. We have a lot of shallow computation graphs (really the same few graphs on different datasets) and a few deep graphs which need incremental updates. The cost of an ad-hoc solution is less than the perceived cost (maybe the real cost) of adopting an additional language.

> Other

Broad classes of algorithms will basically expand to being full re-computations with this framework (based on a quick read of the whitepaper), and adopting a tool for efficient incremental updates is less enticing if I'm going to have to manually fiddle with a bunch of update algorithms anyway. E.g., kernel density estimation needs to be designed from the ground up to support incremental updates; a naive translation of those algorithms to dataflow would efficiently update some sub-components like covariance matrices, but you'd still wind up doing O(full_rebuild) work.


I missed the edit window, but I'm rethinking that last point. It's not clear to me yet whether the current implementation supports this, but I don't see any fundamental reason why one couldn't extend the framework with user-defined operators, and that could make for an extremely pleasant end-user experience.


We have something very similar to differential dataflow implemented at my current place of work, with our own home-brewed libraries and patterns that leverage the relatively unique way we store our data (most similar to TimeScaleDB).

Like map reduce, most people do not understand how it works and why it is a useful paradigm.

Unlike map reduce, there is not an entire sub-industry of companies offering it as a service, and engineers who have used it for years without contemplating its alternatives. In absence of this background noise, people assume DD is niche, and even "wrong" or harmful.

We have new people who come in from time to time, who have experience working at a giant MR shop, who spend the first few months wondering aloud why we don't "just use <MR Framework>". They usually come around (if they care to understand how this new system works) or give up (usually because they never understood the MR trade-offs in the first place, but were unwilling to part with its style of thinking/working).

One thing I'll note is our jargon around it is extremely minimal and literal. The diction employed by DD (and TimeScaleDB!) feels very formal in comparison, which can be off-putting to prospective users.

I'm not one to advocate for dumbing down your tone (quite the opposite). However, it's interesting to note that the successful-yet-complicated projects (like MR, kafka) have an accretion disk around them of dumbed-down explanations on Medium, youtube and the like, that can lure in people who are curious but less-academic.

I don't think you can manufacture these. It's just a matter of time, until things like this appear for DD.


I’ve been curious about it, but it’s difficult to wrap my mind around. I’ve read a lot of frank mcsherry’s blog posts, watched his videos, been through the book, and I guess it just hasn’t clicked for me! I also don’t have any use cases that make sense as a hobby project, and abstractly I know it could be useful at work but I can’t evangelize something I don’t really understand.

Rust took me around three attempts to get into, and it took a motivated project to really seal the deal, but at some point I understood enough that it just became programming again. Haven’t reached that with differential dataflow yet, but I’ll keep trying.


Hey,

in case you're interested, we're a loose group working on DDlog [0] and DDflow to be able to use them for a JS/TS linter [1].

There are a couple fairly concrete and varyingly-isolated tasks on Timely [2], DDlog, as well as DDflow's dogs^3 [3](for the latter, boilerplate-encapsulating tooling with documentation for WCOJ [4]). Let me know if you'd like to talk.

[0]: https://github.com/vmware/differential-datalog [1]: https://github.com/rslint/rslint [2]: https://github.com/TimelyDataflow/timely-dataflow [3]: https://mtrlz.dev/api/rust/dogsdogsdogs/index.html_ [4]: https://github.com/TimelyDataflow/differential-dataflow/tree...


I wanted to pick it up, I feel it's under-appreciated technology that has lots of potential. Reasons why I didn't:

- It's somewhat hard to sell to management. There (was) no company behind it to provide support; and it's not a "successful Apache project"/ with large-ish community, either. And generally for a long while it was a passion project more than something Frank McSherry would actively encourage you to use in production.

- As other have said, the "hello world" is somewhat tricky. Not a lot of people know Rust. If you say "let's do this project in Rust", this will likely not go well; if I were able to use it from .NET and JVM, as a library, it might be an easier sell (I'm personally more invested in .NET now but earlier in my carer it would've been JVM)

- last but not least: the "productization" story is a bit tricky; comparing it to Spark does it no service. For Spark, not only do I have managed clusters like EMR, but I have a decent amount of tooling to see how the cluster executes (the spark web UI). Also I can deploy it in mesos, yarn not just standalone (and mesos/yarn have their own tooling). For differential dataflow, one had none of that (at least last time I checked). Maybe it'd be more fair to compare it to Kafka Streams?

  * Might I add: spark-ec2 was a huge help for me picking up spark, since before the 1.0 version. You can do tons of work on a single machine, yes... but, for this kind of systems, the very first question is "how do you distribute that?". And you have the story that "it's possible", but you don't have easy examples of "word count, done on 3 machines, not because it's necessary but because we demonstrate how easy it is to distribute the computing across machines".

  * Compared to Kafka Streams: the thing about Kafka Streams is that you know what to use it for (Kafka!) and one immediately groks how one uses this in production (all state management is delegated to Kafka, this is truly just a library that helps you work better with Kafka). With differential dataflow, it's much less clear. You could use it with Kafka, but also with Twitter directly, or with something else. And what happens if it crashes? How do you recover from that? What are the data loss risks? Does it give you any guarantees or do you have to manage that?


I am a huge fan of Frank McSherry's work and don't necessarily agree with the premise that DD is somehow failing. However,...

Batch data processing is very well understood, cheap and getting cheaper every year. So, if you can afford to boil the ocean every night, DD is a tough sell.

The addressable market, customers with problems which can only be solved with DD (instantaneous exactly correct answers) is probably small right now.


I think the killer app for differential dataflow would be an easy to set up realtime database like Firebase, but with much richer real-time queries and materialized views.

Materialize (built on differential dataflow) is cool but doesn't have the complete package of a persisted database.


Do you happen to have any examples of real-time queries or apps you would be interested in?

Re: the second point — you’re right, Materialize has historically leveraged existing upstream systems (like Kafka) for things like persistence. But we also hear you loud and clear that not everyone wants to stand up Kafka :)


Yeah, I think there's a tremendous amount of use cases for Materialize for companies that already have data store infra and want real time analytics or such use cases.

However, I also think differential dataflow solves a big problem for smaller companies building out their MVP or in early-stages. Firebase is popular because it's easy to set up, and it's realtime functionality on the client side mean you don't need to write a client-side data management layer, you can just use firebase's realtime functionality.

The issue is that firebase is completely untyped, isn't relational, and has limited queries. So you end up writing gnarly non-transactional code that makes many round-trip requests to query basic stuff.

I think there may be an opportunity product that combines the performance of and client-side tools of firestore, the ease of use of airtable and the real-time query and materialized view functionality of materialize into a database platform for businesses that want to scale their product.

Big ask obviously, but I know that a product like that would help me launch products much faster, I'd pay a lot for it.


Having a possibility to update (query) output with new input data rather than process the whole input again even if the changes are very small is indeed a very useful feature. Assume that you have one huge input table and you computed the result consisting of a few rows. Now you add 1 record to the input. A traditional data processing system will again process all the input records while the differential system will update the existing output result.

There are the following difficulties in implementing such systems:

o (Small) changes in input have to be incrementally propagated to the output as updates rather than new results. This changes the paradigm of data processing because now any new operator has to be "update-aware"

o Only simple operators can be easily implemented as "update-aware". For more complex operators like aggregation or rolling aggregations, it is frequently not clear how it can be done conceptually (efficiently)

o Differential updates have to be propagated through a graph of operations (topology) which makes the task more difficult.

o Currently popular data processing approaches (SQL or map-reduce) were not designed for such a scenario so some adaptation might be needed

Another system where such an approach was implemented, called incremental evaluation, is Lambdo:

https://github.com/asavinov/lambdo - Feature engineering and machine learning: together at last!

Yet, this Python library relies on a different novel data processing paradigm where operations are applied to columns. Mathematically, it uses two types of operations: set operations and functions operations, as opposed to traditional approaches based on only set operations.

A new implementation is here:

https://github.com/asavinov/prosto - Functions matter! No join-groupby, No map-reduce.

Yet, currently incremental evaluation is implemented only for simple operations (calculated columns).


How is a rolling aggregate hard to update? If the value at index i is changed, just update everything from i-n to i+n (where n is the rolling window size).


Yes, this is the basic logic: for any incremental aggregation we need to detect groups which can be influenced by this new record or updated record. If we do row-based rolling aggregation then then indeed we need to update records (i-n, i+n). Yet, the following difficulties may arise:

o Generally, we do not want to re-compute aggregates - aggregates should be also updated, particularly, if n is very large

o In real applications, rolling aggregation is performed using partitioning on some objects. For example, we append new events from many different devices to one table and want to compute rolling aggregates for each individual device. Hence, this (i-n, i+n) will not work anymore.

o Rolling aggregation using absolute time windows will also work differently. Although, if records are ordered (like in stream processing) and there are no partitions, then it is easy.


Myself and a few others have done a lot of research on performing sliding window aggregations updates without recomputing everything. Our code is on github, and the README has links to the papers: https://github.com/IBM/sliding-window-aggregators


I am not working in this domain, but why not put some (a lot of) numbers on the claim that it is dramatically faster than spark etc. Maybe show how a 10 hour spark problem can be reduced to minutes.


Frank McSherry .. the one behind the timely dataflow library and materialize.com .. did show many years ago that there is plenty of performance headroom above Spark by showing how one laptop can beat a Spark cluster.

"Scalability .. but at what COST?" http://www.frankmcsherry.org/assets/COST.pdf


It's easy to get into situations where you're paying massive costs with serialization, deserialization, and network I/O, and I believe graph operations with Spark are one of those situations. I would be curious if running Spark in local mode with a single thread would actually improve the runtime, or if it would reveal other issues with the Spark graph libraries.


Generally memory layout is extremely important for graph problems, even on a single node. As I understand it the Spark approach does not embrace a "flat" layout, but rather does lots of pointer chasing, which can really slow things down. Because Spark isn't very careful about memory usage and layout, you outgrow a single node quite fast, and then you're back to really bad distributed scaling characteristics.


So can a time flow cluster beat a spark data enter? Seriously one shouldn't hide their value points, one has to advertise them. Maybe even put a spreadsheet with costs saved for the decision makers.


Where and how in dataflow is late data being handled? How can I configure in which ways refinements relate? These questions are the standard "What Where When How" I want to answer and put into code when dealing with streaming data. I was not able to find this in the documentation, but I only spent a few minutes scanning it.

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

https://www.oreilly.com/radar/the-world-beyond-batch-streami...

Also "Materialize" seems not to support needed features like tumbling windows (yet) when dealing with streaming data in SQL: https://arxiv.org/abs/1905.12133

Additionally "Materialize" states in their doc: State is all in totally volatile memory; if materialized dies, so too does all of the data. - this is not true for example for Apache Flink which stores its state in systems like RocksDB.

Having SideInputs or seeds is pretty neat, imagine you have two tables of several TiBs or larger. This is also something that "Materialize" currently lacks: Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data.


Late data is very deliberately not handled. The reasoning for that is best available at [0]. Now, there are ways [1] to handle bitemporal data, but they have fairly significant issues in ergonomics and performance, due to the additional work needed to allow the bitemporal aggregations.

As for the data persistence, that's something the underlying approach for the aggregations could handle relatively well with LSM trees [2] (back then, `Aggregation` was called `ValueHistory`).

Along with syncing that state to replicated storage, it should not be a big problem to make it recover quickly from a dead node.

[0]: https://github.com/frankmcsherry/blog/blob/master/posts/2020... [1]: https://github.com/frankmcsherry/blog/blob/master/posts/2018... [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...


Taken from [0] If you wanted to use the information above to make decisions, it could often be wrong. Let's say you want to wait for it to be correct; how long do you wait?

I know how long I want to wait, 30 minutes in one of my cases as I know that I've seen 95% of the important data by then. In the streaming world there is _always_ late data so being able to tell what should happen when the rest (5%) arrives is crucial for me.

This differs from use-case to use-case for me and being able to configure this and handling out-of-order data at scale is key for me when selecting a framework for stream processing. Apache Beam and Apache Flink do this very well.

Taken from [1]: Apache Beam has some other approach where you use both and there is some magical timeout and it only works for windows or something and blah blah blah... If any of you all know the details, drop me a note. It obviously only works when you window your data as it needs to fit in memory. The event-time and system-time concept from Beam and Flink are very similar, also the watermark approach. Thank you for sharing the links, For me it is now clearer where the difference lies between differential-dataflow and stream-processing frameworks (which also offer SQL and even ACID conformity!). I'm using Beam/Flink in production and missing out on one of these mentioned points is a deal-breaker for me.


What do you usually want to happen with late data? In DD you have the option to ignore it at the source but not to update already-emitted results. Is the latter important for you?


In DDflow, you could also use the `Product` timestamp combinator, and track both the time that event came from, as well as the time you ingested it. You can then make use of the data as soon as the frontier says it's current for the relevant ingestion timestamp, and occasionally advance the frontier for the origin timestamp at the input, so that arrangements can compact historic data. An affected example would be a query that counts "distinct within some time window". It only has to keep that window's `distinct on` values around as long as you can still feed events with timestamps in that window. If you are no longer able to, the values of the `distinct on` become irrelevant for this operator, and only the count for that window needs to be retained.


If I have to report transaction (aka money) then yes. I need to update already emitted results. If it's just a log-based metric for internal use then no.

What I would like to have is a choice - and Apache Beam for example lets you choose this.


> whether there are also potential users who would have been perfectly happy with javascript/python/R bindings and a good tutorial

If you think people in some communities would benefit then you should be proactive in advertising there and in particular providing bindings for their favourite languages. This would enhance discoverability. In my corner of the tech and science world, people mostly use python and/or R but few know about Rust and fewer have knowingly used it.


The architecture of DDflow does not lend itself well to bindings of the common, interactive type. It relies on the code for the operators being compiled and optimized by LLVM, which inhibits run-time configurability.


Nah, materialize is basically interactive SQL bindings for DD. There is some loss of optimization potential compared to running everything through the rust compiler but it's still decent.


This is interesting! For me - we just didn’t hear about differential dataflow. But we will probably use it in an upcoming project, because I was looking for a solution like that.


For me, it's the fact that we aren't (currently at least) using rust. I would possibly, maybe, consider porting it to another language but haven't had the time...

In general I wonder how many people sit in the intersection of those who are free and willing to base their system on rust, have dataflow problems to solve, and understand the advantages differential dataflow brings to the table?


Do you think you would be using it if it had bindings to $WORK_LANGUAGE?

Figuring out how to manage memory without constant de/serialization would be tricky, and it's unfortunate that it's so hard to do fusion across FFI, but I'd still expect it to happily outperform eg spark sql.


It seems Reflow falls in this category:

https://github.com/grailbio/reflow

> Reflow thus allows scientists and engineers to write straightforward programs and then have them transparently executed in a cloud environment. Programs are automatically parallelized and distributed across multiple machines, and redundant computations (even across runs and users) are eliminated by its memoization cache. Reflow evaluates its programs incrementally: whenever the input data or program changes, only those outputs that depend on the changed data or code are recomputed.


> People don't automatically adopt new better things. I don't know why though.

I consulted for Motorola many years ago. I remember one of the Sr guys explaining to me their product view. New things needed to be 10x better than existing in order for MOT to be excited or want to invest in a new product, otherwise the switching cost / effort made it too risky that people wouldn’t bother to adopt a new thing.


This looks like it’s comparable to to Incremental in OCaml.

https://opensource.janestreet.com/incremental/

Jane Street uses Incremental quite heavily in their trading platform.

My guess is that not a lot of people are using Rust to build the kinds of platform where this kind of library would see adoption, yet.


Palantir has an incremental computation framework incorporated into its data processing platform.[1]

[1] Search for "Incremental Computability of a Dataset Transformation" https://patents.justia.com/patent/20180196862


Could this be used to build a compiler? That's what I really want, a compiler that updates the binary as I type.


TensorFlow and Theano are quite popular, and they're all about expressing differentiable computations in a "dataflow"-based framework. It might be a simple case of needing to write some support code to make OP's desired use cases more straightforward when using these frwmeworks.


Why is there little mention(one comment) of online algorithms?

I was surprised by how little attention online algorithms received when I first had to implement one.

My conclusion is that processing power currently overcomes the lack of definition or understanding people have about what they're building.


I've been using DD in production usage for just over a year now for low latency(sub second from event IRL to pipeline CDC output) processing in a geo-distributed environment(100's of locations globally coordinating) some days at the TB per day level of event ingest.

DD for me was one of the final attempts to find something, anything, that could handle the requirements I was working with, because Spark, Flink, and others just couldn't reasonably get close to what I was looking for. The closest 2nd place was Apache Flink.

Over the last year I've read through the DD and TD codebases about 5-7 times fully. Even with that, I'm often in a position where I go back to my own applications to see how I had already solved a type of problem. I liken the project to taking someone use to NASCAR and dropping them into a Formula One vehicle. You've seen it work so much faster, and the tech and capabilities are clearly designed for so much more than you can make it do right now.

A few learning examples that I consider funny:

1. I had a graph that was on the order of about 1.2 trillion edges with about 90 million nodes. I was using serde derived structs for the edge and node structs(not simplified numerical types), which means I have to implement(or derive) a bunch of traits myself. I spent way more time than I'd like to admit trying to get .reduce() to work to remove 'surplus' edges that have already been processed from the graph to shrink the working dataset. Finally in frustration and reading through the DD codebase again, I 'rediscovered' .consolidate() which 'just worked' taking the 1.2 trillion edges down into the 300 million edges. For instance, some of the edge values I need to work with have histograms for the distributions, and some of the scoring of those histograms is custom. Not usually an issue, except having to figure out how to implement a bunch of the traits has been a significant hurdle.

2. I get to constantly dance between DD's runtime and trying to ergonomically connect the application into the tonic gRPC and tokio interfaces. Luckily I've found a nice pattern where I create my inter-thread communication constructs, then start up 2 rust threads, and start tokio based interfaces in one, and DD runtime and workers in the other. On bigger servers(packet.net has some great gen3 instances) I usually pin tokio to 2-8 cores, and leave the rest of the cores to DD.

3. Almost every new app I start, I run into the gotcha where I want to have a worker that runs only once 'globally' and it's usually the thread that I'd want to use to coordinate data ingestion. Super simple to just have a guard for if worker.index() == 0, but when deep in thought about an upcoming pipeline, it's often forgotten.

4. For diagnostics, there is: https://github.com/TimelyDataflow/diagnostics which has provided much needed insights when things have gotten complex. Usually it's been 'just enough' to point into the right direction, but only once was the output able to point exactly to the issue I was running into.

5. I have really high hopes for materialize.io That's really the type of system I'd want to use in 80% of the cases I'm using DD right now. I've been following them for about a year now, and the progress is incredible, but my use cases seem more likely to be supported in the 0.8->1.3 roadmap range.

6. I've wanted to have a way to express 'use no more than 250GB of ram' and have some way to get a compile time feedback that a fixed dataset won't be able to process the pipeline with that much resources. It'd be far better if the system could adjust its internal runtime approach in order to stay within the limits.


On (6), are you already using WCOJs and/or delta-based joins? If not, check out [0][1].

If you do, you might be interested in the LSM tree ideas [2] from when arrangements where called `ValueHistory` to offload part of the memory usage to SSDs.

[0]: https://github.com/TimelyDataflow/differential-dataflow/tree... [1]: https://mtrlz.dev/api/rust/dogsdogsdogs/index.html (3rd party rustdoc for convenience) [2]: https://github.com/TimelyDataflow/differential-dataflow/issu...


People should stop assuming that merit and popularity are the same thing.

Anyway, now that I know something with that name exists, maybe someday I can learn how it works. Or will have a project where it is important.


I use Snakemake and this sounds like it does the same? I think Snakemake is pretty popular? At least among bioinformaticians.


Perhaps it will be. I'm super excited about Materialize! If it really takes off, it will surely inspire other projects.


On a quick glance, it seems that you should compare it more to Flink than Spark.


as others mentioned you cannot really compare a concept (differential dataflow) with a product like Flink. I did not look in to differential dataflow so far, but my impression is that it would not handle all the problems Flink can handle. Such as handling late data, complex event processing with state and still getting correct results. E.g. if you scale and some of your node fails, how do you ensure correctness? This is a rather hard problem which for example Flink solves. If you do not solve that problem then basically you cannot scale to as many nodes (might not be necessary for some use cases) as for example Flink does


Scanned through the GitHub page. Don't really see immediate value for using it vs. other solutions.




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: